Similarity-based Selection
Similarity Selection | 1/65 |
Relational selection is based on a boolean condition C
... Similarity Selection | 2/65 |
Similarity selection is used in contexts where
Similarity-based Retrieval | 3/65 |
Similarity-based retrieval typically works as follows:
How to restrict solution set to only the "most similar" objects:
... Similarity-based Retrieval | 4/65 |
Tuple structure for storing such data typically contains
id
oid
val
bytea
... Similarity-based Retrieval | 5/65 |
Naive approach to similarity-based retrieval
q = ...// query object dmax = ...// dmax > 0 => using threshold knn = ...// knn > 0 => using nearest-neighbours Dists = []// empty list foreach tuple t in R { d = dist(t.val, q) insert (t.oid,d) into Dists// sorted on d } n = 0; Results = [] foreach (i,d) in Dists { if (dmax > 0 && d > dmax) break; if (knn > 0 && ++n > knn) break; insert (i,d) into Results// sorted on d } return Results;
Cost = read all r feature vectors + compute distance() for each
... Similarity-based Retrieval | 6/65 |
For some applications, Cost(dist(x,y)) is comparable to Tr
⇒ computing dist(t.val,q)
t
To improve this aspect:
Further optimisation: dimension-reduction to make vectors smaller
... Similarity-based Retrieval | 7/65 |
Content of feature vectors depends on application ...
Feature vectors represent points in a very high-dimensional space.
Query: feature vector representing one point in vh-dim space.
Answer: list of objects "near to" query object in this space.
Example: Content-based Image Retrieval | 8/65 |
User supplies a description or sample of desired image (features).
System returns a ranked list of "matching" images from database.
... Example: Content-based Image Retrieval | 9/65 |
At the SQL level, this might appear as ...
create view Sunset as select image from MyPhotos where title = 'Pittwater Sunset' and taken = '2009-01-01'; create view SimilarSunsets as select title, image from MyPhotos where (image ~~ (select * from Sunset)) < 0.05 order by (image ~~ (select * from Sunset));
where the ~~
... Example: Content-based Image Retrieval | 10/65 |
Implementing content-based retrieval requires ...
dist(x,y) = √( (x1-y1)2 + (x2-y2)2 + ... (xn-yn)2 )
... Example: Content-based Image Retrieval | 11/65 |
Data structures for naive similarity-retrieval:
... Example: Content-based Image Retrieval | 12/65 |
Insertion of an image into the database:
... Example: Content-based Image Retrieval | 13/65 |
Inputs to content-based similarity-retrieval:
Approaches to kNN Retrieval | 14/65 |
Partition-based
... Approaches to kNN Retrieval | 15/65 |
Above approaches mostly try to reduce number of objects considered.
Other optimisations to make kNN retrieval faster
VA Files |
VA (Signature) Files | 17/65 |
Vector Approximation (VA) file developed by Weber and Blott.
... VA (Signature) Files | 18/65 |
Uses a signature file "parallel" to the main feature vector file
... VA (Signature) Files | 19/65 |
VA signatures have properties:
VA-File Signatures | 20/65 |
Computing VA signatures:
Consider a 3-d RGB feature vector, with 0..255 for each RGB value:
... VA-File Signatures | 21/65 |
Insertion with VA-Files | 22/65 |
Given: a feature vector vobj for a new object.
Insertion is extremely simple.
sig = signature(vobj) append vobj to feature vector file append sig to signature file
Storage overhead determined by m (e.g. 3/32).
Query with VA-Files | 23/65 |
Input: query vector vq
results = []; maxD = infinity; for (i = 0; i < r; i++) {// fast distance calculation dist = minDistance(region[va[i]], vq) if (#results < k || dist < maxD) { dist = distance(v[i],vq) if (#results < k || dist < maxD) { insert (tid[i],dist) into results// sorted(results) && length(results) <= k maxD = largest distance in results } } }
Result is a list f tids of k database objects nearest to query.
... Query with VA-Files | 24/65 |
VA signatures allow fast elimination of objects by region
Given query vector q, data vector v, can quickly compute:
... Query with VA-Files | 25/65 |
Marks on each dimension i are stored as array Ri[0], Ri[1], .. Ri[2m].
... Query with VA-Files | 26/65 |
Distance bound formula:
lower(q,v) = √∑i=1dlowi2 where lowi = qi - Ri[vi+1], if vi < qi = 0, if vi = qi = Ri[vi] - qi, if vi > qi upper(q,v) = √∑i=1duppi2 where uppi = qi - Ri[vi], if vi < qi = max(qi-Ri[vi], Ri[vi+1]-qi), if vi = qi = Ri[vi+1] - qi, if vi > qi
Cost of VA File Searching | 27/65 |
Performing VA search requires:
... Cost of VA File Searching | 28/65 |
Example: d=200, m=3, r=106, B=4096, TD≅0.001Tr
Improved VA Search Algorithm | 29/65 |
An improved query algorithm that guarantees minimal D computations:
vaq = vasignature(vq) results = []; maxD = infinity; pending = []; for (i = 0; i < r; i++) { lowD = lower(vq,region[va[i]]) uppD = upper(vq,region[va[i]]) if (#results < k || dist < uppD) { sortedInsert (i,uppD) into results heapInsert (i,lowD) into pending } } results = []; heapRemove (i,lowD) from pending while (lowD < maxD) { dist = distance(v[i],vq) if (#results < k || dist < maxD) { sortedInsert (i,dist) into results// sorted(results) && length(results) <= k maxD = largest distance in results } heapRemove (i,lowD) from pending }
Curve-based Similarity Search |
Curve-based Searching | 31/65 |
The strategy when using curve-based indexing is as follows:
Indexing based on curves needs to ...
... Curve-based Searching | 32/65 |
Mapping from 3-d to 1-d
Note: impossible to preserve all NN-in-space as NN-on-curve.
Curve Index File Organisation | 33/65 |
Data structures for curve-based searching:
Insertion with Curve Index | 34/65 |
For each image inserted into the database:
Searching with Curve Index | 35/65 |
Overview of curve-based searching algorithm:
Input: query feature vector vq p = map(vq) cur1 = cur2 = lookup(p) while (not enough answers) { (pt,vec,tid) = next(cur1) remember tid if D(vec,vq) small enough (pt,vec,tid) = prev(cur2) remember tid if D(vec,vq) small enough } Output: collected tids of similar objects
... Searching with Curve Index | 36/65 |
How curve is scanned to find potential near-neighbours:
... Searching with Curve Index | 37/65 |
What kind of curves are required to make this approach viable?
2d Space-filling Curves | 38/65 |
... 2d Space-filling Curves | 39/65 |
The CurveIx Scheme | 40/65 |
What we would like from the above curve mapping:
... The CurveIx Scheme | 41/65 |
Example curveix scheme with 2 curves on a 2d space:
... The CurveIx Scheme | 42/65 |
The basic idea:
Curve Mapping Functions | 43/65 |
Each Ci has the following properties:
How to generate multiple Cis from a single vector:
Data Structures for CurveIx | 44/65 |
Derived from data structures for the QBIC system:
a mapping function for each of the m different space-filling curves (i=1..m) | ||
a database of r d-dimensional feature vectors;
each entry is a pair (imageId, vector) imageId |
||
a database of rm curve points; each point is represented
by a tuple (curveId, point, imageId) (curveId, point) |
||
feature vector for query q |
... Data Structures for CurveIx | 45/65 |
Database Construction | 46/65 |
For each image obj, insertion requires:
id = identifier for obj for (i in 1..m) { p = Ci(vobj) insert (i, p, id) into Index } insert (id, vobj) into Db
Example: Search with 5 Curves | 47/65 |
Finding k-NN in CurveIx | 48/65 |
Given: vq , C1 .. Cm , Index , Db
for (i in 1..m) { p = Ci(vq) lookup (i,p) in Index fetch (i,p1,j) while (p1 "close to" p on curve i) { collect j as candidate fetch next (i,p1,j) } } for each candidate j { lookup j in Db fetch (j,vj) d = D(vj , vq) include j in k-NN if d small enough }
CurveIx vs. Linear Scan | 49/65 |
Linear scan: Cost = read all N vectors + compute D for each
CurveIx: Cost = m B-tree lookups + compute D fN times
Some observations ...
Ci Values as Keys | 50/65 |
One problem raised early in the implementation:
Problems:
Solution:
Performance Analysis | 51/65 |
Since CurveIx does not guarantee to deliver the k-NN among its candidates, we set an "acceptable" accuracy level of 90%.
In other words, CurveIx must deliver 0.9 k nearest-neighbours to be considered useful.
The initial experiments aimed to answer the questions:
Experiments | 52/65 |
Measures for accuracy:
... Experiments | 53/65 |
To determine how these measures vary:
Sample Comparison | 54/65 |
Linear Scan | ||
QUERY: img00102 0.000000 img00102 0.005571 img00713 0.008826 img00268 0.011413 img05054 0.011811 img00631 0.014259 img04042 0.027598 img00203 0.037471 img00287 0.063639 img00244 0.067583 img00306 # dist calcs: 524 1.67user 0.23system ...
|
QUERY: img00102 0.000000 img00102 0.005571 img00713 0.008826 img00268 0.011413 img05054 0.011811 img00631 0.014259 img04042 0.027598 img00203 0.037471 img00287 0.063639 img00244 0.067583 img00306 # dist calcs: 20000 1.93user 1.12system ...
|
Experimental Results | 55/65 |
For fixed database (20K), effect of varying Range, Ncurves
#Curves | Range | Acc1 | Acc2 | #Dist |
20 | 20 | 6.72 | 0.20 | 426 |
20 | 30 | 7.28 | 0.28 | 695 |
20 | 40 | 7.68 | 0.36 | 874 |
40 | 30 | 8.16 | 0.40 | 1301 |
40 | 40 | 8.60 | 0.44 | 1703 |
60 | 30 | 8.40 | 0.44 | 1905 |
60 | 40 | 8.60 | 0.48 | 2413 |
80 | 30 | 8.87 | 0.58 | 2485 |
80 | 40 | 9.20 | 0.72 | 3381 |
100 | 30 | 9.10 | 0.70 | 3061 |
100 | 40 | 9.28 | 0.72 | 4156 |
Results: Size vs. Accuracy | 56/65 |
For fixed CurveIx parameters (80 curves, 30 range), effect of varying database size:
#Images | Acc1 | Acc2 |
5K | 9.72 | 0.76 |
10K | 9.44 | 0.80 |
15K | 9.16 | 0.70 |
20K | 9.04 | 0.64 |
iDistance |
iDistance | 58/65 |
iDistance (Jagadish, Ooi, Tan, Yu, Zhang)
... iDistance | 59/65 |
Computing the iDistance: D(p) = j.c + dist(p,Oj)
where c is a constant to "spread" the partitions over the iDistance line
Searching with iDistance | 60/65 |
The approach:
... Searching with iDistance | 61/65 |
First iteration of search (small r):
... Searching with iDistance | 62/65 |
Later iteration of search (larger r):
... Searching with iDistance | 63/65 |
Details of search method:
//Inputs: // q = query point, k = # nearest-neighbours // O[m] = set of reference points // rad[m] = radius of partition around O[i] // deltaR = search radius increase on each step // maxR = maximum search radius int r = 0;// search radius around q int stop = 0;// flag for when to halt int seen[];// partition i already searched? Obj S[];// result set, k nearest neighbours Obj lp[], rp[];// lists of limits with partitions while (!stop) { r = r + deltaR stop = SearchRefs(q,r) } ...
... Searching with iDistance | 64/65 |
int SearchRefs(q,r) { f = furthest(S,q) stop = (dist(f,q) < r && |S| == k) for (i = 0; i < m; i++) { d = dist(O[i],q) if (!seen[i]) { hs[i] = sphere(O[i],rad[i]) if (hs[i] contains q) { seen[i] = 1 leaf = BtreeSearch(i*c+d) lp[i] = SearchInward(leaf, i*c+d-r) rp[i] = SearchOutward(leaf, i*c+d+r) } elseif (hs[i] intersects sphere(q,r)) { seen[i] = 1 leaf = BtreeSearch(rad[i]) lp[i] = SearchInward(leaf, i*c+d-r) } } else { if (lp[i] != null) lp[i] = SearchInward(left(lp[i]), i*c+d-r) if (rp[i] != null) rp[i] = SearchOutward(right(rp[i]), i*c+d+r) } } return stop }
... Searching with iDistance | 65/65 |
Cost depends on data distribution and position of Oi
See analysis in ACM Trans on DB Systems, v.30, Jun 2005, p.364
Determining the set of reference points:
deltaR
Produced: 17 Jul 2019