Similarity-based Selection


Similarity Selection1/65

Relational selection is based on a boolean condition C

Uses for relational selection:


... Similarity Selection2/65

Similarity selection is used in contexts where

Uses for similarity matching:


Similarity-based Retrieval3/65

Similarity-based retrieval typically works as follows:

The system can measure distance between any object and q ...

How to restrict solution set to only the "most similar" objects:


... Similarity-based Retrieval4/65

Tuple structure for storing such data typically contains

Properties of typical distance functions   (on objects x,y,z)


... Similarity-based Retrieval5/65

Naive approach to similarity-based retrieval

q = ...    // query object
dmax = ... // dmax > 0  =>  using threshold
knn = ...  // knn > 0   =>  using nearest-neighbours
Dists = [] // empty list
foreach tuple t in R {
    d = dist(t.val, q)
    insert (t.oid,d) into Dists  // sorted on d
}
n = 0;  Results = []
foreach (i,d) in Dists {
    if (dmax > 0 && d > dmax) break;
    if (knn > 0 && ++n > knn) break;
    insert (i,d) into Results  // sorted on d
}
return Results;

Cost  =  read all r feature vectors  +  compute distance() for each


... Similarity-based Retrieval6/65

For some applications, Cost(dist(x,y)) is comparable to Tr

  computing dist(t.val,q) for every tuple t is infeasible.

To improve this aspect:

i.e. replace dist(t,tq) by dist'(vec(t),vec(tq)) in previous algorithm.

Further optimisation: dimension-reduction to make vectors smaller


... Similarity-based Retrieval7/65

Content of feature vectors depends on application ...

Typically use multiple features, concatenated into single vector.

Feature vectors represent points in a very high-dimensional space.

Query: feature vector representing one point in vh-dim space.

Answer: list of objects "near to" query object in this space.


Example: Content-based Image Retrieval8/65

User supplies a description or sample of desired image (features).

System returns a ranked list of "matching" images from database.

[Diagram:Pics/select/cbir-sys-small.png]


... Example: Content-based Image Retrieval9/65

At the SQL level, this might appear as ...

create view Sunset as
select image from MyPhotos
where  title = 'Pittwater Sunset'
       and taken = '2009-01-01';

create view SimilarSunsets as
select title, image
from   MyPhotos
where  (image ~~ (select * from Sunset)) < 0.05
order  by (image ~~ (select * from Sunset));

where the ~~ operator measures distance between images.


... Example: Content-based Image Retrieval10/65

Implementing content-based retrieval requires ...


... Example: Content-based Image Retrieval11/65

Data structures for naive similarity-retrieval:

[Diagram:Pics/select/vectors-small.png]


... Example: Content-based Image Retrieval12/65

Insertion of an image into the database:

Insertion cost:


... Example: Content-based Image Retrieval13/65

Inputs to content-based similarity-retrieval:

Outputs from content-based similarity-retrieval:


Approaches to kNN Retrieval14/65

Partition-based

Approximation-based


... Approaches to kNN Retrieval15/65

Above approaches mostly try to reduce number of objects considered.

Other optimisations to make kNN retrieval faster


VA Files


VA (Signature) Files17/65

Vector Approximation (VA) file developed by Weber and Blott.

Why give up on sub-linear complexity? Note: d = number of dimensions


... VA (Signature) Files18/65

Uses a signature file "parallel" to the main feature vector file

[Diagram:Pics/select/sigfile-small.png]


... VA (Signature) Files19/65

VA signatures have properties:

Approach to querying:


VA-File Signatures20/65

Computing VA signatures:

Implemented by taking m high-order bits from each feature vector element.

Consider a 3-d RGB feature vector, with 0..255 for each RGB value:


... VA-File Signatures21/65

[Diagram:Pics/select/va-partition-small.png]


Insertion with VA-Files22/65

Given: a feature vector vobj for a new object.

Insertion is extremely simple.

sig = signature(vobj)
append vobj to feature vector file
append sig to signature file

Storage overhead determined by m   (e.g. 3/32).


Query with VA-Files23/65

Input:   query vector vq

results = [];  maxD = infinity;
for (i = 0; i < r; i++) {
   // fast distance calculation
   dist = minDistance(region[va[i]], vq)
   if  (#results < k  ||  dist < maxD)  {
      dist = distance(v[i],vq)
      if (#results < k  ||  dist < maxD)  {
         insert (tid[i],dist) into results
         // sorted(results) && length(results) <= k
         maxD = largest distance in results
      }
   }
}

Result is a list f tids of k database objects nearest to query.


... Query with VA-Files24/65

VA signatures allow fast elimination of objects by region

Given query vector q, data vector v, can quickly compute:

Thus, instead of r expensive D computations, we require


... Query with VA-Files25/65

[Diagram:Pics/select/va-bounds-small.png]

Marks on each dimension i are stored as array Ri[0], Ri[1], .. Ri[2m].


... Query with VA-Files26/65

Distance bound formula:

lower(q,v) = √∑i=1dlowi2
   where
      lowi = qi - Ri[vi+1],               if vi < qi
           = 0,                          if vi = qi
           = Ri[vi] - qi,                 if vi > qi

upper(q,v) = √∑i=1duppi2
   where
      uppi = qi - Ri[vi],                 if vi < qi
           = max(qi-Ri[vi], Ri[vi+1]-qi),  if vi = qi
           = Ri[vi+1] - qi,               if vi > qi


Cost of VA File Searching27/65

Performing VA search requires:

A problem with VA files: Can fix this problem using uneven region sizes (but needs more storage)


... Cost of VA File Searching28/65

Example:   d=200, m=3, r=106, B=4096, TD≅0.001Tr

Cost (without VA file) = 100000Tr + 106TD ≅ 101000Tr Cost (with VA file) = 18518Tr + 1000Tr + 1000TD ≅ 19519Tr


Improved VA Search Algorithm29/65

An improved query algorithm that guarantees minimal D computations:

vaq = vasignature(vq)
results = [];  maxD = infinity;  pending = [];
for (i = 0; i < r; i++) {
   lowD = lower(vq,region[va[i]])
   uppD = upper(vq,region[va[i]])
   if  (#results < k  ||  dist < uppD)  {
      sortedInsert (i,uppD) into results
      heapInsert (i,lowD) into pending
   }
}
results = [];  heapRemove (i,lowD) from pending
while (lowD < maxD) {
   dist = distance(v[i],vq)
   if (#results < k  ||  dist < maxD) {
      sortedInsert (i,dist) into results
      // sorted(results) && length(results) <= k
      maxD = largest distance in results
   }
   heapRemove (i,lowD) from pending
}


Curve-based Similarity Search


Curve-based Searching31/65

The strategy when using curve-based indexing is as follows:

Indexing based on curves needs to ...


... Curve-based Searching32/65

Mapping from 3-d to 1-d

[Diagram:Pics/select/curve-search-small.png]

Note: impossible to preserve all NN-in-space as NN-on-curve.


Curve Index File Organisation33/65

Data structures for curve-based searching:

[Diagram:Pics/select/curve-file-small.png]


Insertion with Curve Index34/65

For each image inserted into the database:


Searching with Curve Index35/65

Overview of curve-based searching algorithm:

Input: query feature vector vq

p = map(vq)
cur1 = cur2 = lookup(p)
while (not enough answers) {
    (pt,vec,tid) = next(cur1)
    remember tid if D(vec,vq) small enough
    (pt,vec,tid) = prev(cur2)
    remember tid if D(vec,vq) small enough
}

Output: collected tids of similar objects


... Searching with Curve Index36/65

How curve is scanned to find potential near-neighbours:

[Diagram:Pics/select/curve-scan-small.png]


... Searching with Curve Index37/65

What kind of curves are required to make this approach viable?

Possible candidate curves: Experiments have shown that the Hilbert curve is the most suitable for data access.


2d Space-filling Curves38/65

[Diagram:Pics/select/curves1-small.png]


... 2d Space-filling Curves39/65

[Diagram:Pics/select/curves2-small.png]


The CurveIx Scheme40/65

What we would like from the above curve mapping:

With a single Hilbert curve (although not some other curves) One curve is not enough, so use several "complementary" curves.


... The CurveIx Scheme41/65

Example curveix scheme with 2 curves on a 2d space:

[Diagram:Pics/select/c-curves-small.png]


... The CurveIx Scheme42/65

The basic idea:

We want candidate set to contain most/all of kNN ...


Curve Mapping Functions43/65

Each Ci has the following properties:

Points within a small region of the curve are likely to have been mapped from vectors that are close in d-space.

How to generate multiple Cis from a single vector:


Data Structures for CurveIx44/65

Derived from data structures for the QBIC system:

Ci a mapping function for each of the m different space-filling curves (i=1..m)
Db a database of r d-dimensional feature vectors; each entry is a pair (imageId, vector) where imageId forms a primary key
Index a database of rm curve points; each point is represented by a tuple (curveId, point, imageId); the pair (curveId, point) forms a primary key with an ordering, where point = CcurveId(vimageId)
vq feature vector for query q


... Data Structures for CurveIx45/65

[Diagram:Pics/select/curveix-file-small.png]


Database Construction46/65

For each image obj, insertion requires:

id = identifier for obj
for (i in 1..m) {
     p = Ci(vobj)
     insert (i, p, id) into Index
}
insert (id, vobj) into Db


Example: Search with 5 Curves47/65

[Diagram:Pics/select/curves-small.png]


Finding k-NN in CurveIx48/65

Given:   vq ,   C1 .. Cm ,   Index ,   Db

for (i in 1..m) {
    p = Ci(vq)
    lookup (i,p) in Index
    fetch (i,p1,j)
    while (p1 "close to" p on curve i) {
        collect j as candidate
        fetch next (i,p1,j)
}   }
for each candidate j {
     lookup j in Db
     fetch (j,vj)
     d = D(vj , vq)
     include j in k-NN if d small enough
}


CurveIx vs. Linear Scan49/65

Linear scan:   Cost  =  read all N vectors  +  compute D for each

CurveIx:   Cost   =   m B-tree lookups  +  compute D   fN times

Some observations ...

Difficult to formally analyse further, so we implemented the system to see how it performed ...


Ci Values as Keys50/65

One problem raised early in the implementation:

To use such values as database keys, we need them as fixed precision, so we limit expansion (to 4 levels).

Problems:

Solution:


Performance Analysis51/65

Since CurveIx does not guarantee to deliver the k-NN among its candidates, we set an "acceptable" accuracy level of 90%.

In other words, CurveIx must deliver 0.9 k nearest-neighbours to be considered useful.

The initial experiments aimed to answer the questions:


Experiments52/65

Measures for accuracy:

Measures for efficiency:


... Experiments53/65

To determine how these measures vary:

Also implemented a linear scan version for comparison and to collect the exact answer sets.


Sample Comparison54/65

CurveIx Indexing Linear Scan

QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 524
1.67user 0.23system ...

QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 20000
1.93user 1.12system ...


Experimental Results55/65

For fixed database (20K), effect of varying Range, Ncurves

  #Curves     Range     Acc1     Acc2     #Dist  
20 20 6.72 0.20 426
20 30 7.28 0.28 695
20 40 7.68 0.36 874
40 30 8.16 0.40 1301
40 40 8.60 0.44 1703
60 30 8.40 0.44 1905
60 40 8.60 0.48 2413
80 30 8.87 0.58 2485
80 40 9.20 0.72 3381
100 30 9.10 0.70 3061
100 40 9.28 0.72 4156


Results: Size vs. Accuracy56/65

For fixed CurveIx parameters (80 curves, 30 range), effect of varying database size:

  #Images     Acc1     Acc2  
5K 9.72 0.76
10K 9.44 0.80
15K 9.16 0.70
20K 9.04 0.64


iDistance


iDistance58/65

iDistance   (Jagadish, Ooi, Tan, Yu, Zhang)

The basic idea:


... iDistance59/65

Computing the iDistance:   D(p) = j.c + dist(p,Oj)

[Diagram:Pics/select/iDistance-small.png]

where c is a constant to "spread" the partitions over the iDistance line


Searching with iDistance60/65

The approach:


... Searching with iDistance61/65

First iteration of search (small r):

[Diagram:Pics/select/iD-search1-small.png]


... Searching with iDistance62/65

Later iteration of search (larger r):

[Diagram:Pics/select/iD-search2-small.png]


... Searching with iDistance63/65

Details of search method:

//Inputs:
// q = query point, k = # nearest-neighbours
// O[m] = set of reference points
// rad[m] = radius of partition around O[i]
// deltaR = search radius increase on each step
// maxR   = maximum search radius
int r = 0;  // search radius around q
int stop = 0;  // flag for when to halt
int seen[];  // partition i already searched?
Obj S[];  // result set, k nearest neighbours
Obj lp[], rp[];  // lists of limits with partitions

while (!stop) {
   r = r + deltaR
   stop = SearchRefs(q,r)
}
...


... Searching with iDistance64/65

int SearchRefs(q,r) {
   f = furthest(S,q)
   stop = (dist(f,q) < r && |S| == k)
   for (i = 0; i < m; i++) {
      d = dist(O[i],q)
      if (!seen[i]) {
         hs[i] = sphere(O[i],rad[i])
         if (hs[i] contains q) {
            seen[i] = 1
            leaf = BtreeSearch(i*c+d)
            lp[i] = SearchInward(leaf, i*c+d-r)
            rp[i] = SearchOutward(leaf, i*c+d+r)
         }
         elseif (hs[i] intersects sphere(q,r)) {
            seen[i] = 1
            leaf = BtreeSearch(rad[i])
            lp[i] = SearchInward(leaf, i*c+d-r)
         }
      }
      else {
         if (lp[i] != null)
            lp[i] = SearchInward(left(lp[i]), i*c+d-r)
         if (rp[i] != null)
            rp[i] = SearchOutward(right(rp[i]), i*c+d+r)
      }
   }
   return stop
}


... Searching with iDistance65/65

Cost depends on data distribution and position of Oi

See analysis in ACM Trans on DB Systems, v.30, Jun 2005, p.364

Determining the set of reference points:

Determining the size of deltaR:


Produced: 17 Jul 2019