COMP9315 Week 05 Thursday Lecture

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [0/45]
❖ Things To Note

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [1/45]
❖ Debugging Assignment 1


The lz4 bug ...

The fix ...
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [2/45]
❖ Debugging Assignment 1 (cont)

./run_test.py is a convenience feature, but not essential

Also, ceases to be useful once the load average on vxdb > 20

You can DIY and get more fine-grained feedback

Testing directories contain

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [3/45]
❖ Debugging Assignment 1 (cont)

To do some manual debugging ...

... assume we have done env and p1 ...
$ cd /localstorage/$USER/testing
$ make   ... fix errors in pname.c and pname.source
$ dropdb xyz
$ createdb xyz
$ psql xyz -f pname.sql  ... fix any errors in pname.source
$ cd tests/some_test_directory
$ psql xyz -f schema.sql
$ psql xyz -f data1.sql
$ psql xyz -f queries1.sql
$ ... compare output with expected-data1-queries1.log ...

If it crashes server, run individual queries from queries1.sql

Find which query causes first crash

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [4/45]
❖ Debugging Assignment 1 (cont)

Having found the offending query, may indicate a specific function

$ cd ../..
$ edit pname.c  ... add some elog() statements
$ make
... may need to rebuild the xyz database ...
$ cd tests/some_test_directory
$ psql xyz
xyz=# run the problem query ... examine elog output
xyz=#\q
$ cd ../..
$ edit pname.c  ... maybe remove elog() statements
... repeat above steps until query runs successfully ...
$ ... compare output with expected-data1-queries1.log ...

Problem solved!

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [5/45]
❖ Tree Indexes for N-d Selection

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [6/45]
❖ Multi-dimensional Tree Indexes

Over the last 20 years, from a range of problem areas

Consider three popular schemes:   kd-trees, Quad-trees, R-trees.

Example data for multi-d trees is based on the following relation:

create table Rel (
    X char(1) check (X between 'a' and 'z'),
    Y integer check (Y between 0 and 9)
);

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [7/45]
❖ Multi-dimensional Tree Indexes (cont)

Example tuples:

Rel('a',1)  Rel('a',5)  Rel('b',2)  Rel('d',1)
Rel('d',2)  Rel('d',4)  Rel('d',8)  Rel('g',3)
Rel('j',7)  Rel('m',1)  Rel('r',5)  Rel('z',9)

The tuple-space for the above tuples:

[Diagram:Pics/select/2d-space.png]

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [8/45]
❖ Exercise: Query Types and Tuple Space

Which part of the tuple-space does each query represent?


Q1: select * from Rel where X = 'd' and Y = 4
Q2: select * from Rel where 'j' < X ≤ 'r'
Q3: select * from Rel where X > 'm' and Y > 4
Q4: select * from Rel where 'k' ≤ X ≤ 'p' and 3 ≤ Y ≤ 6

[Diagram:Pics/select/2d-space.png]

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [9/45]
❖ kd-Trees

kd-trees are multi-way search trees where

[Diagram:Pics/select/kd-tree.png]

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [10/45]
❖ kd-Trees (cont)

How this tree partitions the tuple space:

[Diagram:Pics/select/kd-tree-space.png]

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [11/45]
❖ Searching in kd-Trees

// Started by Search(Q, R, 0, kdTreeRoot)
Search(Query Q, Relation R, Level L, Node N)
{
   if (isDataPage(N)) {
      Buf = getPage(fileOf(R),idOf(N))
      check Buf for matching tuples
   } else {
      a = attrLev[L]
      if (!hasValue(Q,a))
         nextNodes = all children of N
      else {
         val = getAttr(Q,a)
         nextNodes = find(N,Q,a,val)
      }
      for each C in nextNodes
         Search(Q, R, L+1, C)
}  }

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [12/45]
❖ Exercise: Searching in kd-Trees

Using the following kd-tree index

[Diagram:Pics/select/kd-tree.png]

Answer the queries   (m,1),   (a,?),   (?,1),   (?,?)

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [13/45]
❖ Quad Trees

Quad trees use regular, disjoint partitioning of tuple space.

Example:

[Diagram:Pics/select/quad-tree-space.png]

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [14/45]
❖ Quad Trees (cont)

Basis for the partitioning:

Note: effective for d≤5, ok for 6≤d≤10, ineffective for d>10
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [15/45]
❖ Quad Trees (cont)

The previous partitioning gives this tree structure, e.g.

[Diagram:Pics/select/quad-tree1.png]

In this and following examples, we give coords of top-left,bottom-right of a region

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [16/45]
❖ Searching in Quad-tree

Space query example:

[Diagram:Pics/select/quad-query.png]

Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [17/45]
❖ Exercise: Searching in Quad-trees

Using the following quad-tree index

[Diagram:Pics/select/quad-tree-space.png]

Answer the queries   (m,1),   (a,?),   (?,1),   (?,?)

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [18/45]
❖ R-Trees

R-trees use a flexible, overlapping partitioning of tuple space.

Overlap and partial cover means: Aim: height-balanced, partly-full index pages   (cf. B-tree)
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [19/45]
❖ R-Trees (cont)

[Diagram:Pics/select/r-tree.png]

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [20/45]
❖ Insertion into R-tree

Insertion of an object R occurs as follows:

Note that R may be a point or a polygon.
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [21/45]
❖ Query with R-trees

Designed to handle space queries and "where-am-I" queries.

"Where-am-I" query: find all regions containing a given point P:

Space (region) queries are handled in a similar way
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [22/45]
❖ Exercise: Query with R-trees

Using the following R-tree:

[Diagram:Pics/select/r-treeA.png]

Show how the following queries would be answered:


Q1: select * from Rel where X='a' and Y=4
Q2: select * from Rel where X='i' and Y=6
Q3: select * from Rel where 'c'≤X≤'j' and Y=5
Q4: select * from Rel where X='c'

Note: can view unknown value X=? as range min(X) ≤ X ≤ max(X)

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [23/45]
❖ Multi-d Trees in PostgreSQL

Up to version 8.2, PostgreSQL had R-tree implementation

Superseded by GiST = Generalized Search Trees

GiST indexes parameterise: data type, searching, splitting

GiST trees have the following structural constraints: Details: src/backend/access/gist
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [24/45]
❖ Costs of Search in Multi-d Trees

Difficult to determine cost precisely.

Best case: pmr query where all attributes have known values

Typical case: some attributes are unknown or defined by range
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [25/45]
❖ Similarity Retrieval

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [26/45]
❖ Similarity Selection

Relational selection is based on a boolean condition C

Uses for relational selection:
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [27/45]
❖ Similarity Selection (cont)

Similarity selection is used in contexts where

Uses for similarity matching:
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [28/45]
❖ Example: Content-based Image Retrieval

User supplies a description or sample of desired image.

System returns a ranked list of "matching" images from database.

[Diagram:Pics/select/cbir-sys.png]

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [29/45]
❖ Example: Content-based Image Retrieval (cont)

At the SQL level, this might appear as ...

// relational matching
create view Sunset as
select image from MyPhotos
where  title = 'Pittwater Sunset'
       and taken = '2012-01-01';
// similarity matching with threshold
create view SimilarSunsets as
select title, image
from   MyPhotos
where  (image ~~ (select * from Sunset)) < 0.05
order  by (image ~~ (select * from Sunset));

where (imaginary) ~~ operator measures how "alike" images are

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [30/45]
❖ Similarity-based Retrieval

Database contains media objects, but also tuples, e.g.


BLOB = Binary Large OBject

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [31/45]
❖ Similarity-based Retrieval (cont)

Similarity-based retrieval requires a distance measure

where x and y are two objects (in the database)

Note: distance calculation often requires substantial computational effort


How to restrict solution set to only the "most similar" objects:

BUT both above methods require knowing distance between query object and all objects in DB
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [32/45]
❖ Similarity-based Retrieval (cont)

Naive approach to similarity-based retrieval

q = ...    // query object
dmax = ... // dmax > 0  =>  using threshold
knn = ...  // knn > 0   =>  using nearest-neighbours
Dists = [] // empty list
foreach tuple t in R {
    d = dist(t.val, q)
    insert (t.oid,d) into Dists  // sorted on d
}
n = 0;  Results = []
foreach (i,d) in Dists {
    if (dmax > 0 && d > dmax) break;
    if (knn > 0 && ++n > knn) break;
    insert (i,d) into Results  // sorted on d
}
return Results;

Cost  =  fetch all r objects  +  compute distance()  for each

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [33/45]
❖ Similarity-based Retrieval (cont)

For some applications, Cost(dist(x,y)) is comparable to Tr

  computing dist(t.val,q) for every tuple t is infeasible.

To improve this ...

i.e. replace dist(t,q) by dist'(vec(t),vec(q)) in previous algorithm.

Further optimisation: dimension-reduction to make vectors smaller

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [34/45]
❖ Similarity-based Retrieval (cont)

Feature vectors ...

Content of feature vectors depends on application ... Query: feature vector representing one point in vh-dim space

Answer: list of objects "near to" query object in this space

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [35/45]
❖ Similarity-based Retrieval (cont)

Inputs to content-based similarity-retrieval:

Outputs from content-based similarity-retrieval:
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [36/45]
❖ Approaches to kNN Retrieval

Partition-based

Approximation-based
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [37/45]
❖ Approaches to kNN Retrieval (cont)

Above approaches try to reduce number of objects considered.

Other optimisations to make kNN retrieval faster
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [38/45]
❖ Similarity Retrieval in PostgreSQL

PostgreSQL has always supported simple "similarity" on strings

-- for most SQL implementations
select * from Students where name like '%oo%';
-- and PostgreSQL-specific
select * from Students where name ~ '[Ss]mit';


Also provides support for ranked similarity on text values

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [39/45]
❖ Similarity Retrieval in PostgreSQL (cont)

Example of PostgreSQL text retrieval:

create table Docs
   ( id integer, title text, body text );
// add column to hold document feature vectors
alter table Docs add column features tsvector;
update Docs set features =
   to_tsvector('english', title||' '||body);
// ask query and get results in ranked order
select title, ts_rank(d.features, query) as rank
from   Docs d,
       to_tsquery('potter|(roger&rabbit)') as query
where  query @@ d.features
order  by rank desc
limit  10;

For more details, see PostgreSQL documentation, Chapter 12.

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [40/45]
❖ Signature-based Selection

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [41/45]
❖ Indexing with Signatures

Signature-based indexing:

Each tuple is associated with a signature

Instead of scanning/testing tuples, do pre-filtering via signatures.
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [42/45]
❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile1.png]

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [43/45]
❖ Signatures

A signature "summarises" the data in one tuple

A tuple consists of N attribute values A1 .. An

A codeword cw(Ai) is

A tuple descriptor (signature) is built by combining cw(Ai), i=1..n
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [44/45]
❖ Generating Codewords

Generating a k-in-m codeword for attribute Ai

bits codeword(char *attr_value, int m, int k)
{
   int  nbits = 0;   // count of set bits
   bits cword = 0;   // assuming m <= 32 bits
   srandom(hash(attr_value));
   while (nbits < k) {
      int i = random() % m;
      if (((1 << i) & cword) == 0) {
         cword |= (1 << i);
         nbits++;
      }
   }
   return cword;  // m-bits with k 1-bits and m-k 0-bits
}

COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [45/45]


Produced: 14 Mar 2024