COMP9315 Week 05 Thursday Lecture

∧ >>

❖ Things To Note

Assignment 1
- due midnight Friday 15 March (TOMORROW!)
- standard UNSW late penalties apply: 5%/day for 5 days
- 3 weeks ago, I said "Don't leave it to the last minute"
Quiz 3
- out Monday 18 March ... due Friday 22 March
Assignment 2
- does not involve PostgreSQL
- build a DBMS component (indexing scheme) in C
- out Friday 22 March ... due Friday 12 April

<< ∧ >>

❖ Debugging Assignment 1

The lz4 bug ...

you mess up setting the length of a PersonName object
PostgreSQL sees it as a large object and wants to TOAST it
TOASTed objects are compressed using lz4 compression

The fix ...

set the size of PersonName objects using SET_VARSIZE()
make sure the size includes the VARHDRSZ + name length + 1

<< ∧ >>

❖ Debugging Assignment 1 (cont)

./run_test.py is a convenience feature, but not essential

Also, ceases to be useful once the load average on vxdb > 20

You can DIY and get more fine-grained feedback

Testing directories contain

info.txt ... description of what is being tested
schema.sql ... schema, including PersonNames
dataX.sql ... sets of tuples, with PersonNames
queriesY.sql ... set of tests in SQL
expected-dataX-queriesY.log ... expected output

<< ∧ >>

❖ Debugging Assignment 1 (cont)

To do some manual debugging ...

... assume we have done env and p1 ...
$ cd /localstorage/$USER/testing
$ make   ... fix errors in pname.c and pname.source
$ dropdb xyz
$ createdb xyz
$ psql xyz -f pname.sql  ... fix any errors in pname.source
$ cd tests/some_test_directory
$ psql xyz -f schema.sql
$ psql xyz -f data1.sql
$ psql xyz -f queries1.sql
$ ... compare output with expected-data1-queries1.log ...

If it crashes server, run individual queries from queries1.sql

Find which query causes first crash

<< ∧ >>

❖ Debugging Assignment 1 (cont)

Having found the offending query, may indicate a specific function

$ cd ../..
$ edit pname.c  ... add some elog() statements
$ make
... may need to rebuild the xyz database ...
$ cd tests/some_test_directory
$ psql xyz
xyz=# run the problem query ... examine elog output
xyz=#\q
$ cd ../..
$ edit pname.c  ... maybe remove elog() statements
... repeat above steps until query runs successfully ...
$ ... compare output with expected-data1-queries1.log ...

Problem solved!

<< ∧ >>

❖ Tree Indexes for N-d Selection

<< ∧ >>

❖ Multi-dimensional Tree Indexes

Over the last 20 years, from a range of problem areas

different multi-d tree index schemes have been proposed
varying primarily in how they partition tuple-space

Consider three popular schemes: kd-trees, Quad-trees, R-trees.

Example data for multi-d trees is based on the following relation:

create table Rel (
    X char(1) check (X between 'a' and 'z'),
    Y integer check (Y between 0 and 9)
);

<< ∧ >>

❖ Multi-dimensional Tree Indexes (cont)

Example tuples:

Rel('a',1)  Rel('a',5)  Rel('b',2)  Rel('d',1)
Rel('d',2)  Rel('d',4)  Rel('d',8)  Rel('g',3)
Rel('j',7)  Rel('m',1)  Rel('r',5)  Rel('z',9)

The tuple-space for the above tuples:

<< ∧ >>

❖ Exercise: Query Types and Tuple Space

Which part of the tuple-space does each query represent?


Q1: select * from Rel where X = 'd' and Y = 4
Q2: select * from Rel where 'j' < X ≤ 'r'
Q3: select * from Rel where X > 'm' and Y > 4
Q4: select * from Rel where 'k' ≤ X ≤ 'p' and 3 ≤ Y ≤ 6

<< ∧ >>

❖ kd-Trees

kd-trees are multi-way search trees where

each level of the tree partitions on a different attribute
each node contains n-1 key values, pointers to n subtrees

<< ∧ >>

❖ kd-Trees (cont)

How this tree partitions the tuple space:

<< ∧ >>

❖ Searching in kd-Trees

// Started by Search(Q, R, 0, kdTreeRoot)
Search(Query Q, Relation R, Level L, Node N)
{
   if (isDataPage(N)) {
      Buf = getPage(fileOf(R),idOf(N))
      check Buf for matching tuples
   } else {
      a = attrLev[L]
      if (!hasValue(Q,a))
         nextNodes = all children of N
      else {
         val = getAttr(Q,a)
         nextNodes = find(N,Q,a,val)
      }
      for each C in nextNodes
         Search(Q, R, L+1, C)
}  }

<< ∧ >>

❖ Exercise: Searching in kd-Trees

Using the following kd-tree index

Answer the queries (m,1), (a,?), (?,1), (?,?)

<< ∧ >>

❖ Quad Trees

Quad trees use regular, disjoint partitioning of tuple space.

for 2d, partition space into quadrants (NW, NE, SW, SE)
each quadrant can be further subdivided into four, etc.

Example:

[Diagram:Pics/select/quad-tree-space.png]

<< ∧ >>

❖ Quad Trees (cont)

Basis for the partitioning:

a quadrant that has no sub-partitions is a leaf quadrant
each leaf quadrant maps to a single data page
subdivide until points in each quadrant fit into one data page
ideal: same number of points in each leaf quadrant (balanced)
point density varies over space
⇒ different regions require different levels of partitioning
this means that the tree is not necessarily balanced

Note: effective for d≤5, ok for 6≤d≤10, ineffective for d>10

<< ∧ >>

❖ Quad Trees (cont)

The previous partitioning gives this tree structure, e.g.

In this and following examples, we give coords of top-left,bottom-right of a region

<< ∧ >>

❖ Searching in Quad-tree

Space query example:

Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).

<< ∧ >>

❖ Exercise: Searching in Quad-trees

Using the following quad-tree index

Answer the queries (m,1), (a,?), (?,1), (?,?)

<< ∧ >>

❖ R-Trees

R-trees use a flexible, overlapping partitioning of tuple space.

each node in the tree represents a kd hypercube
its children represent (possibly overlapping) subregions
the child regions do not need to cover the entire parent region

Overlap and partial cover means:

can optimize space partitioning wrt data distribution
so that there are similar numbers of points in each region

Aim: height-balanced, partly-full index pages (cf. B-tree)

<< ∧ >>

❖ R-Trees (cont)

<< ∧ >>

❖ Insertion into R-tree

Insertion of an object R occurs as follows:

start at root, look for children that completely contain R
if no child completely contains R, choose one of the children and expand its boundaries so that it does contain R
if several children contain R, choose one and proceed to child
repeat above containment search in children of current node
once we reach data page, insert R if there is room
if no room in data page, replace by two data pages
partition existing objects between two data pages
update node pointing to data pages
(may cause B-tree-like propagation of node changes up into tree)

Note that R may be a point or a polygon.

<< ∧ >>

❖ Query with R-trees

Designed to handle space queries and "where-am-I" queries.

"Where-am-I" query: find all regions containing a given point P:

start at root, select all children whose subregions contain P
if there are zero such regions, search finishes with P not found
otherwise, recursively search within node for each subregion
once we reach a leaf, we know that region contains P

Space (region) queries are handled in a similar way

we traverse down any path that intersects the query region

<< ∧ >>

❖ Exercise: Query with R-trees

Using the following R-tree:

Show how the following queries would be answered:


Q1: select * from Rel where X='a' and Y=4
Q2: select * from Rel where X='i' and Y=6
Q3: select * from Rel where 'c'≤X≤'j' and Y=5
Q4: select * from Rel where X='c'

Note: can view unknown value X=? as range min(X) ≤ X ≤ max(X)

<< ∧ >>

❖ Multi-d Trees in PostgreSQL

Up to version 8.2, PostgreSQL had R-tree implementation

Superseded by GiST = Generalized Search Trees

GiST indexes parameterise: data type, searching, splitting

via seven user-defined functions (e.g. picksplit())

GiST trees have the following structural constraints:

every node is at least fraction f full (e.g. 0.5)
the root node has at least two children (unless also a leaf)
all leaves appear at the same level

Details: src/backend/access/gist

<< ∧ >>

❖ Costs of Search in Multi-d Trees

Difficult to determine cost precisely.

Best case: pmr query where all attributes have known values

in kd-trees and quad-trees, follow single tree path
cost is equal to depth D of tree
in R-trees, may follow several paths (overlapping partitions)

Typical case: some attributes are unknown or defined by range

need to visit multiple sub-trees
how many depends on: range, choice-points in tree nodes

<< ∧ >>

❖ Similarity Retrieval

<< ∧ >>

❖ Similarity Selection

Relational selection is based on a boolean condition C

evaluate C for each tuple t
if C(t) is true, add t to result set
if C(t) is false, t is not part of solution
result is a set of tuples { t₁, t₂, ..., t_n } all of which satisfy C

Uses for relational selection:

precise matching on structured data
using individual attributes with known, exact values

<< ∧ >>

❖ Similarity Selection (cont)

Similarity selection is used in contexts where

cannot define a precise matching condition
can define a measure d of "distance" between tuples
d=0 is an exact match, d>0 is less accurate match
result is a list of pairs [ (t₁,d₁), (t₂,d₂), ..., (t_n,d_n) ] (ordered by d_i)

Uses for similarity matching:

text or multimedia (image/music) retrieval
ranked queries in conventional databases

<< ∧ >>

❖ Example: Content-based Image Retrieval

User supplies a description or sample of desired image.

System returns a ranked list of "matching" images from database.

<< ∧ >>

❖ Example: Content-based Image Retrieval (cont)

At the SQL level, this might appear as ...

// relational matching
create view Sunset as
select image from MyPhotos
where  title = 'Pittwater Sunset'
       and taken = '2012-01-01';
// similarity matching with threshold
create view SimilarSunsets as
select title, image
from   MyPhotos
where  (image ~~ (select * from Sunset)) < 0.05
order  by (image ~~ (select * from Sunset));

where (imaginary) ~~ operator measures how "alike" images are

<< ∧ >>

❖ Similarity-based Retrieval

Database contains media objects, but also tuples, e.g.

id to uniquely identify object (e.g. PostgreSQL oid)
metadata (e.g. artist, title, genre, date taken, ...)
value of object itself (e.g. PostgreSQL BLOB or bytea)

BLOB = Binary Large OBject

BLOB stored in separate file; tuple contains reference (cf. TOAST)
BLOBs are typically MB in size (1MB..2GB)

<< ∧ >>

❖ Similarity-based Retrieval (cont)

Similarity-based retrieval requires a distance measure

dist(x,y) ∈ 0..1, dist(x,x) = 0, dist(x,y) = dist(y,x)

where x and y are two objects (in the database)

Note: distance calculation often requires substantial computational effort

How to restrict solution set to only the "most similar" objects:

threshold d_max (only objects t such that dist(t,q) ≤ d_max)
count k (k closest objects (k nearest neighbours))

BUT both above methods require knowing distance between query object and all objects in DB

<< ∧ >>

❖ Similarity-based Retrieval (cont)

Naive approach to similarity-based retrieval

q = ...    // query object
dmax = ... // dmax > 0  =>  using threshold
knn = ...  // knn > 0   =>  using nearest-neighbours
Dists = [] // empty list
foreach tuple t in R {
    d = dist(t.val, q)
    insert (t.oid,d) into Dists  // sorted on d
}
n = 0;  Results = []
foreach (i,d) in Dists {
    if (dmax > 0 && d > dmax) break;
    if (knn > 0 && ++n > knn) break;
    insert (i,d) into Results  // sorted on d
}
return Results;

Cost = fetch all r objects + compute distance() for each

<< ∧ >>

❖ Similarity-based Retrieval (cont)

For some applications, Cost(dist(x,y)) is comparable to T_r

⇒ computing dist(t.val,q) for every tuple t is infeasible.

To improve this ...

compute feature vector to capture "critical" object properties
store feature vectors "in parallel" with objects (cf. signatures)
compute distance using feature vectors (not objects)

i.e. replace dist(t,q) by dist'(vec(t),vec(q)) in previous algorithm.

Further optimisation: dimension-reduction to make vectors smaller

<< ∧ >>

❖ Similarity-based Retrieval (cont)

Feature vectors ...

often use multiple features, concatenated into single vector
represent points in a very high-dimensional (vh-dim) space

Content of feature vectors depends on application ...

image ... colour histogram (e.g. 100's of values/dimensions)
music ... loudness/pitch/tone (e.g. 100's of values/dimensions)
text ... term frequencies (e.g. 1000's of values/dimensions)

Query: feature vector representing one point in vh-dim space

Answer: list of objects "near to" query object in this space

<< ∧ >>

❖ Similarity-based Retrieval (cont)

Inputs to content-based similarity-retrieval:

a database of r objects (obj₁, obj₂, ..., obj_r) plus associated ...
r × n-dimensional feature vectors (v_obj₁, v_obj₂, ..., v_{obj_r})
a query image q with associated n-dimensional vector (v_q)
a distance measure D(v_i,v_j) : [0..1) (D=0 → v_i=v_j)

Outputs from content-based similarity-retrieval:

a list of the k nearest objects in the database [a₁, a₂, ... a_k]
ordered by distance D(v_a₁,v_q) ≤ D(v_a₂,v_q) ≤ ... ≤ D(v_{a_k},v_q)

<< ∧ >>

❖ Approaches to kNN Retrieval

Partition-based

use auxiliary data structure to identify candidates
space/data-partitioning methods: e.g. k-d-B-tree, R-tree, ...
unfortunately, such methods "fail" when #dims > 10..20
absolute upper bound on d before linear scan is best d = 610

Approximation-based

use approximating data structure to identify candidates
signatures: VA-files
projections: iDistance, LSH, MedRank, CurveIX, Pyramid

<< ∧ >>

❖ Approaches to kNN Retrieval (cont)

Above approaches try to reduce number of objects considered.

cf. indexes in relational databases

Other optimisations to make kNN retrieval faster

reduce I/O by reducing size of vectors (compression, d-reduction)
reduce I/O by placing "similar" records together (clustering)
reduce I/O by remembering previous pages (caching)
reduce cpu by making distance computation faster

<< ∧ >>

❖ Similarity Retrieval in PostgreSQL

PostgreSQL has always supported simple "similarity" on strings

-- for most SQL implementations
select * from Students where name like '%oo%';
-- and PostgreSQL-specific
select * from Students where name ~ '[Ss]mit';

Also provides support for ranked similarity on text values

using tsvector data type (stemmed, stopped feature vector for text)
using tsquery data type (stemmed, stopped feature vector for strings)
using @@ similarity operator

<< ∧ >>

❖ Similarity Retrieval in PostgreSQL (cont)

Example of PostgreSQL text retrieval:

create table Docs
   ( id integer, title text, body text );
// add column to hold document feature vectors
alter table Docs add column features tsvector;
update Docs set features =
   to_tsvector('english', title||' '||body);
// ask query and get results in ranked order
select title, ts_rank(d.features, query) as rank
from   Docs d,
       to_tsquery('potter|(roger&rabbit)') as query
where  query @@ d.features
order  by rank desc
limit  10;

For more details, see PostgreSQL documentation, Chapter 12.

<< ∧ >>

❖ Signature-based Selection

<< ∧ >>

❖ Indexing with Signatures

Signature-based indexing:

designed for pmr queries (conjunction of equalities)
does not try to achieve better than O(n) performance
attempts to provide an "efficient" linear scan

Each tuple is associated with a signature

a compact (lossy) descriptor for the tuple
formed by combining information from multiple attributes
stored in a signature file, parallel to data file

Instead of scanning/testing tuples, do pre-filtering via signatures.

<< ∧ >>

❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

<< ∧ >>

❖ Signatures

A signature "summarises" the data in one tuple

A tuple consists of N attribute values A₁ .. A_n

A codeword cw(A_i) is

a bit-string, m bits long, where k bits are set to 1 (k ≪ m)
derived from the value of a single attribute A_i

A tuple descriptor (signature) is built by combining cw(A_i), i=1..n

could combine by overlaying (or concatenating) codewords
aim to have roughly half of the bits set to 1

<< ∧

❖ Generating Codewords

Generating a k-in-m codeword for attribute A_i

bits codeword(char *attr_value, int m, int k)
{
   int  nbits = 0;   // count of set bits
   bits cword = 0;   // assuming m <= 32 bits
   srandom(hash(attr_value));
   while (nbits < k) {
      int i = random() % m;
      if (((1 << i) & cword) == 0) {
         cword |= (1 << i);
         nbits++;
      }
   }
   return cword;  // m-bits with k 1-bits and m-k 0-bits
}