Similarity-based Selection
| Similarity Selection | 1/65 |
Relational selection is based on a boolean condition C
| ... Similarity Selection | 2/65 |
Similarity selection is used in contexts where
| Similarity-based Retrieval | 3/65 |
Similarity-based retrieval typically works as follows:
How to restrict solution set to only the "most similar" objects:
| ... Similarity-based Retrieval | 4/65 |
Tuple structure for storing such data typically contains
idoidvalbytea
| ... Similarity-based Retrieval | 5/65 |
Naive approach to similarity-based retrieval
q = ...// query object dmax = ...// dmax > 0 => using threshold knn = ...// knn > 0 => using nearest-neighbours Dists = []// empty list foreach tuple t in R { d = dist(t.val, q) insert (t.oid,d) into Dists// sorted on d } n = 0; Results = [] foreach (i,d) in Dists { if (dmax > 0 && d > dmax) break; if (knn > 0 && ++n > knn) break; insert (i,d) into Results// sorted on d } return Results;
Cost = read all r feature vectors + compute distance() for each
| ... Similarity-based Retrieval | 6/65 |
For some applications, Cost(dist(x,y)) is comparable to Tr
⇒ computing dist(t.val,q)t
To improve this aspect:
Further optimisation: dimension-reduction to make vectors smaller
| ... Similarity-based Retrieval | 7/65 |
Content of feature vectors depends on application ...
Feature vectors represent points in a very high-dimensional space.
Query: feature vector representing one point in vh-dim space.
Answer: list of objects "near to" query object in this space.
| Example: Content-based Image Retrieval | 8/65 |
User supplies a description or sample of desired image (features).
System returns a ranked list of "matching" images from database.
| ... Example: Content-based Image Retrieval | 9/65 |
At the SQL level, this might appear as ...
create view Sunset as
select image from MyPhotos
where title = 'Pittwater Sunset'
and taken = '2009-01-01';
create view SimilarSunsets as
select title, image
from MyPhotos
where (image ~~ (select * from Sunset)) < 0.05
order by (image ~~ (select * from Sunset));
where the ~~
| ... Example: Content-based Image Retrieval | 10/65 |
Implementing content-based retrieval requires ...
dist(x,y) = √( (x1-y1)2 + (x2-y2)2 + ... (xn-yn)2 )
| ... Example: Content-based Image Retrieval | 11/65 |
Data structures for naive similarity-retrieval:
| ... Example: Content-based Image Retrieval | 12/65 |
Insertion of an image into the database:
| ... Example: Content-based Image Retrieval | 13/65 |
Inputs to content-based similarity-retrieval:
| Approaches to kNN Retrieval | 14/65 |
Partition-based
| ... Approaches to kNN Retrieval | 15/65 |
Above approaches mostly try to reduce number of objects considered.
Other optimisations to make kNN retrieval faster
| VA Files |
| VA (Signature) Files | 17/65 |
Vector Approximation (VA) file developed by Weber and Blott.
| ... VA (Signature) Files | 18/65 |
Uses a signature file "parallel" to the main feature vector file
| ... VA (Signature) Files | 19/65 |
VA signatures have properties:
| VA-File Signatures | 20/65 |
Computing VA signatures:
Consider a 3-d RGB feature vector, with 0..255 for each RGB value:
| ... VA-File Signatures | 21/65 |
| Insertion with VA-Files | 22/65 |
Given: a feature vector vobj for a new object.
Insertion is extremely simple.
sig = signature(vobj) append vobj to feature vector file append sig to signature file
Storage overhead determined by m (e.g. 3/32).
| Query with VA-Files | 23/65 |
Input: query vector vq
results = []; maxD = infinity;
for (i = 0; i < r; i++) {
// fast distance calculation
dist = minDistance(region[va[i]], vq)
if (#results < k || dist < maxD) {
dist = distance(v[i],vq)
if (#results < k || dist < maxD) {
insert (tid[i],dist) into results
// sorted(results) && length(results) <= k
maxD = largest distance in results
}
}
}
Result is a list f tids of k database objects nearest to query.
| ... Query with VA-Files | 24/65 |
VA signatures allow fast elimination of objects by region
Given query vector q, data vector v, can quickly compute:
| ... Query with VA-Files | 25/65 |
Marks on each dimension i are stored as array Ri[0], Ri[1], .. Ri[2m].
| ... Query with VA-Files | 26/65 |
Distance bound formula:
lower(q,v) = √∑i=1dlowi2
where
lowi = qi - Ri[vi+1], if vi < qi
= 0, if vi = qi
= Ri[vi] - qi, if vi > qi
upper(q,v) = √∑i=1duppi2
where
uppi = qi - Ri[vi], if vi < qi
= max(qi-Ri[vi], Ri[vi+1]-qi), if vi = qi
= Ri[vi+1] - qi, if vi > qi
| Cost of VA File Searching | 27/65 |
Performing VA search requires:
| ... Cost of VA File Searching | 28/65 |
Example: d=200, m=3, r=106, B=4096, TD≅0.001Tr
| Improved VA Search Algorithm | 29/65 |
An improved query algorithm that guarantees minimal D computations:
vaq = vasignature(vq)
results = []; maxD = infinity; pending = [];
for (i = 0; i < r; i++) {
lowD = lower(vq,region[va[i]])
uppD = upper(vq,region[va[i]])
if (#results < k || dist < uppD) {
sortedInsert (i,uppD) into results
heapInsert (i,lowD) into pending
}
}
results = []; heapRemove (i,lowD) from pending
while (lowD < maxD) {
dist = distance(v[i],vq)
if (#results < k || dist < maxD) {
sortedInsert (i,dist) into results
// sorted(results) && length(results) <= k
maxD = largest distance in results
}
heapRemove (i,lowD) from pending
}
| Curve-based Similarity Search |
| Curve-based Searching | 31/65 |
The strategy when using curve-based indexing is as follows:
Indexing based on curves needs to ...
| ... Curve-based Searching | 32/65 |
Mapping from 3-d to 1-d
Note: impossible to preserve all NN-in-space as NN-on-curve.
| Curve Index File Organisation | 33/65 |
Data structures for curve-based searching:
| Insertion with Curve Index | 34/65 |
For each image inserted into the database:
| Searching with Curve Index | 35/65 |
Overview of curve-based searching algorithm:
Input: query feature vector vq
p = map(vq)
cur1 = cur2 = lookup(p)
while (not enough answers) {
(pt,vec,tid) = next(cur1)
remember tid if D(vec,vq) small enough
(pt,vec,tid) = prev(cur2)
remember tid if D(vec,vq) small enough
}
Output: collected tids of similar objects
| ... Searching with Curve Index | 36/65 |
How curve is scanned to find potential near-neighbours:
| ... Searching with Curve Index | 37/65 |
What kind of curves are required to make this approach viable?
| 2d Space-filling Curves | 38/65 |
| ... 2d Space-filling Curves | 39/65 |
| The CurveIx Scheme | 40/65 |
What we would like from the above curve mapping:
| ... The CurveIx Scheme | 41/65 |
Example curveix scheme with 2 curves on a 2d space:
| ... The CurveIx Scheme | 42/65 |
The basic idea:
| Curve Mapping Functions | 43/65 |
Each Ci has the following properties:
How to generate multiple Cis from a single vector:
| Data Structures for CurveIx | 44/65 |
Derived from data structures for the QBIC system:
| a mapping function for each of the m different space-filling curves (i=1..m) | ||
a database of r d-dimensional feature vectors;
each entry is a pair (imageId, vector)imageId |
||
a database of rm curve points; each point is represented
by a tuple (curveId, point, imageId)(curveId, point) |
||
| feature vector for query q |
| ... Data Structures for CurveIx | 45/65 |
| Database Construction | 46/65 |
For each image obj, insertion requires:
id = identifier for obj
for (i in 1..m) {
p = Ci(vobj)
insert (i, p, id) into Index
}
insert (id, vobj) into Db
| Example: Search with 5 Curves | 47/65 |
| Finding k-NN in CurveIx | 48/65 |
Given: vq , C1 .. Cm , Index , Db
for (i in 1..m) {
p = Ci(vq)
lookup (i,p) in Index
fetch (i,p1,j)
while (p1 "close to" p on curve i) {
collect j as candidate
fetch next (i,p1,j)
} }
for each candidate j {
lookup j in Db
fetch (j,vj)
d = D(vj , vq)
include j in k-NN if d small enough
}
| CurveIx vs. Linear Scan | 49/65 |
Linear scan: Cost = read all N vectors + compute D for each
CurveIx: Cost = m B-tree lookups + compute D fN times
Some observations ...
| Ci Values as Keys | 50/65 |
One problem raised early in the implementation:
Problems:
Solution:
| Performance Analysis | 51/65 |
Since CurveIx does not guarantee to deliver the k-NN among its candidates, we set an "acceptable" accuracy level of 90%.
In other words, CurveIx must deliver 0.9 k nearest-neighbours to be considered useful.
The initial experiments aimed to answer the questions:
| Experiments | 52/65 |
Measures for accuracy:
| ... Experiments | 53/65 |
To determine how these measures vary:
| Sample Comparison | 54/65 |
| Linear Scan | ||
QUERY: img00102 0.000000 img00102 0.005571 img00713 0.008826 img00268 0.011413 img05054 0.011811 img00631 0.014259 img04042 0.027598 img00203 0.037471 img00287 0.063639 img00244 0.067583 img00306 # dist calcs: 524 1.67user 0.23system ...
|
QUERY: img00102 0.000000 img00102 0.005571 img00713 0.008826 img00268 0.011413 img05054 0.011811 img00631 0.014259 img04042 0.027598 img00203 0.037471 img00287 0.063639 img00244 0.067583 img00306 # dist calcs: 20000 1.93user 1.12system ...
|
| Experimental Results | 55/65 |
For fixed database (20K), effect of varying Range, Ncurves
| #Curves | Range | Acc1 | Acc2 | #Dist |
| 20 | 20 | 6.72 | 0.20 | 426 |
| 20 | 30 | 7.28 | 0.28 | 695 |
| 20 | 40 | 7.68 | 0.36 | 874 |
| 40 | 30 | 8.16 | 0.40 | 1301 |
| 40 | 40 | 8.60 | 0.44 | 1703 |
| 60 | 30 | 8.40 | 0.44 | 1905 |
| 60 | 40 | 8.60 | 0.48 | 2413 |
| 80 | 30 | 8.87 | 0.58 | 2485 |
| 80 | 40 | 9.20 | 0.72 | 3381 |
| 100 | 30 | 9.10 | 0.70 | 3061 |
| 100 | 40 | 9.28 | 0.72 | 4156 |
| Results: Size vs. Accuracy | 56/65 |
For fixed CurveIx parameters (80 curves, 30 range), effect of varying database size:
| #Images | Acc1 | Acc2 |
| 5K | 9.72 | 0.76 |
| 10K | 9.44 | 0.80 |
| 15K | 9.16 | 0.70 |
| 20K | 9.04 | 0.64 |
| iDistance |
| iDistance | 58/65 |
iDistance (Jagadish, Ooi, Tan, Yu, Zhang)
| ... iDistance | 59/65 |
Computing the iDistance: D(p) = j.c + dist(p,Oj)
where c is a constant to "spread" the partitions over the iDistance line
| Searching with iDistance | 60/65 |
The approach:
| ... Searching with iDistance | 61/65 |
First iteration of search (small r):
| ... Searching with iDistance | 62/65 |
Later iteration of search (larger r):
| ... Searching with iDistance | 63/65 |
Details of search method:
//Inputs: // q = query point, k = # nearest-neighbours // O[m] = set of reference points // rad[m] = radius of partition around O[i] // deltaR = search radius increase on each step // maxR = maximum search radius int r = 0;// search radius around q int stop = 0;// flag for when to halt int seen[];// partition i already searched? Obj S[];// result set, k nearest neighbours Obj lp[], rp[];// lists of limits with partitions while (!stop) { r = r + deltaR stop = SearchRefs(q,r) } ...
| ... Searching with iDistance | 64/65 |
int SearchRefs(q,r) {
f = furthest(S,q)
stop = (dist(f,q) < r && |S| == k)
for (i = 0; i < m; i++) {
d = dist(O[i],q)
if (!seen[i]) {
hs[i] = sphere(O[i],rad[i])
if (hs[i] contains q) {
seen[i] = 1
leaf = BtreeSearch(i*c+d)
lp[i] = SearchInward(leaf, i*c+d-r)
rp[i] = SearchOutward(leaf, i*c+d+r)
}
elseif (hs[i] intersects sphere(q,r)) {
seen[i] = 1
leaf = BtreeSearch(rad[i])
lp[i] = SearchInward(leaf, i*c+d-r)
}
}
else {
if (lp[i] != null)
lp[i] = SearchInward(left(lp[i]), i*c+d-r)
if (rp[i] != null)
rp[i] = SearchOutward(right(rp[i]), i*c+d+r)
}
}
return stop
}
| ... Searching with iDistance | 65/65 |
Cost depends on data distribution and position of Oi
See analysis in ACM Trans on DB Systems, v.30, Jun 2005, p.364
Determining the set of reference points:
deltaRProduced: 17 Jul 2019