mirror of
https://github.com/vitali-fedulov/hyper.git
synced 2025-09-04 19:35:13 +00:00
Compare commits
23 Commits
Author | SHA1 | Date | |
---|---|---|---|
|
90ffdabf78 | ||
|
57f7f3463b | ||
|
5e43aec964 | ||
|
61edb2954d | ||
|
84f05b9c5c | ||
|
f7ae2a8907 | ||
|
c6d0a02019 | ||
|
bd4ee7fcc6 | ||
|
53c0313311 | ||
|
af4edd7579 | ||
|
0cdc91b59c | ||
|
8330dfbe44 | ||
|
bd63fbbcd4 | ||
|
5e6500c206 | ||
|
fc910b3659 | ||
|
1040b4f3a9 | ||
|
2f83399465 | ||
|
4d66dfaf90 | ||
|
a96a06a77a | ||
|
b98bf8c70d | ||
|
5a935f1d16 | ||
|
8dab742792 | ||
|
a441b7817f |
2
LICENSE
2
LICENSE
@ -1,6 +1,6 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2121 Vitali Fedulov (fedulov.vitali@gmail.com)
|
||||
Copyright (c) 2021 Vitali Fedulov (fedulov.vitali@gmail.com)
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
|
22
README.md
22
README.md
@ -1,11 +1,19 @@
|
||||
# Hashing float vectors in N-dimensions
|
||||
# Hashing N-dimensional float vectors
|
||||
|
||||
Package hyper allows fast approximate search of nearest neighbour vectors in n-dimensional space.
|
||||
Search nearest neighbour vectors in n-dimensional space with hashes. There are no dependencies in this package.
|
||||
|
||||
Package functions discretize a vector and generate a set of hashes, as described in the [following document](https://vitali-fedulov.github.io/algorithm-for-hashing-high-dimensional-float-vectors.html).
|
||||
The algorithm is based on the assumption that two real numbers can be considered equal within certain equality distance. Then quantization is used for comparison. To make sure points near or at quantization borders are also comparable, a vector can be discretized into more than one hash, as described [here](https://vitali-fedulov.github.io/similar.pictures/algorithm-for-hashing-high-dimensional-float-vectors.html) (also as [PDF](https://github.com/vitali-fedulov/research/blob/main/Algorithm%20for%20hashing%20float%20vectors.pdf)). The method indirectly clusters given vectors by hypercubes and their neighbourhoods. It is exhaustive within a set precision.
|
||||
|
||||
To use the package follow the sequence of functions/methods:
|
||||
1) CubeSet or CentralCube, depending which one is used for a database record and which one for a query.
|
||||
2) HashSet and DecimalHash to get corresponding hash set and central hash from results of (2). If DecimalHash is not suitable because of very large number of buckets or dimensions, use FNV1aHash to get both the hash set and the central hash).
|
||||
The algorithm assumes a uniform and normalized vector space - without complex manifolds or very diverse properties of dimensions, which can potentially complicate search. But even with these complications, sufficiently large hupercubes (small number of buckets) will probably work fine for prefiltering or sequencial filtering by smaller-dimentional sub-spaces, as briefly mentioned in the article.
|
||||
|
||||
[Example](https://github.com/vitali-fedulov/images3/blob/master/hashes.go) of usage for image comparison.
|
||||
It has not been tested on very high-dimensional vectors - but they may produce impractically large hash sets. The linked example below uses only 9 dimensions.
|
||||
|
||||
## How to use
|
||||
|
||||
1) Normalize each component of your input vectors to the same min/max value range. Use these min/max values in the parameters settings.
|
||||
2) Provided a float vector []float64, use `CentralCube` and `CubeSet` functions to generate hypercube coordinates []int and [][]int.
|
||||
3) Generate a `DecimalHash`/`FNV1aHash` and `HashSet` for corresponding central hash and hash set from the hypercube coordinates above. The difference between one hash and a hash set is that one corresponds to a hash-table record and the other to a query or vice versa, depending on performance/memory preference. There are 2 alternative hash functions: DecimalHash and FNV1aHash. DecimalHash does not have collisions, but is not suitable for cases with large number of buckets or dimensions. FNV1aHash is applicable for all cases. Hash collisions can be progressively eliminated by using custom hash functions or verifying similarity with the Euclidean metric.
|
||||
|
||||
[Example](https://github.com/vitali-fedulov/imagehash2/blob/main/hashes.go) for similar image search and clustering.
|
||||
|
||||
[Go doc](https://pkg.go.dev/github.com/vitali-fedulov/hyper) for full code documentation.
|
||||
|
2
cubes.go
2
cubes.go
@ -18,7 +18,7 @@ type Params struct {
|
||||
// CubeSet returns a set of hypercubes, which represent
|
||||
// fuzzy discretization of one n-dimensional vector,
|
||||
// as described in
|
||||
// https://vitali-fedulov.github.io/algorithm-for-hashing-high-dimensional-float-vectors.html
|
||||
// https://vitali-fedulov.github.io/similar.pictures/algorithm-for-hashing-high-dimensional-float-vectors.html
|
||||
// One hupercube is defined by bucket numbers in each dimension.
|
||||
// min and max are minimum and maximum possible values of
|
||||
// the vector components. The assumption is that min and max
|
||||
|
@ -17,7 +17,7 @@ func TestDecimalHash(t *testing.T) {
|
||||
func TestFNV1aHash(t *testing.T) {
|
||||
cube := Cube{5, 59, 255, 9, 7, 12, 22, 31}
|
||||
hash := cube.FNV1aHash()
|
||||
want := uint64(1659788114117494335)
|
||||
want := uint64(6267598672213710911)
|
||||
if hash != want {
|
||||
t.Errorf(`Got %v, want %v.`, hash, want)
|
||||
}
|
||||
@ -31,10 +31,10 @@ func TestHashSet(t *testing.T) {
|
||||
{1, 0, 8, 3, 0, 0, 9}}
|
||||
hashSet := cubes.HashSet((Cube).FNV1aHash)
|
||||
want := []uint64{
|
||||
6172277127052188606,
|
||||
3265650857171344968,
|
||||
13730239218993256724,
|
||||
6843127655045710906}
|
||||
9211138565158515574,
|
||||
6304441926533466432,
|
||||
5296875461196147964,
|
||||
13706017245957046114}
|
||||
if !reflect.DeepEqual(hashSet, want) {
|
||||
t.Errorf(`Got %v, want %v.`, hashSet, want)
|
||||
}
|
||||
|
Loading…
x
Reference in New Issue
Block a user