COMP9315 Week 05 Thursday Lecture
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [0/45]
- Assignment 1
- due midnight Friday 15 March (TOMORROW!)
- standard UNSW late penalties apply: 5%/day for 5 days
- 3 weeks ago, I said "Don't leave it to the last minute"
- Quiz 3
- out Monday 18 March ... due Friday 22 March
- Assignment 2
- does not involve PostgreSQL
- build a DBMS component (indexing scheme) in C
- out Friday 22 March ... due Friday 12 April
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [1/45]
The lz4
bug ...
- you mess up setting the length of a
PersonName
object
- PostgreSQL sees it as a large object and wants to TOAST it
- TOASTed objects are compressed using
lz4
compression
The fix ...
- set the size of
PersonName
objects using SET_VARSIZE()
- make sure the size includes the VARHDRSZ + name length + 1
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [2/45]
❖ Debugging Assignment 1 (cont) | |
./run_test.py
is a convenience feature, but not essential
Also, ceases to be useful once the load average on vxdb > 20
You can DIY and get more fine-grained feedback
Testing directories contain
-
info.txt
... description of what is being tested
-
schema.sql
... schema, including PersonName
s
-
dataX.sql
... sets of tuples, with PersonName
s
-
queriesY.sql
... set of tests in SQL
-
expected-dataX-queriesY.log
... expected output
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [3/45]
❖ Debugging Assignment 1 (cont) | |
To do some manual debugging ...
$ cd /localstorage/$USER/testing
$ make
$ dropdb xyz
$ createdb xyz
$ psql xyz -f pname.sql
$ cd tests/some_test_directory
$ psql xyz -f schema.sql
$ psql xyz -f data1.sql
$ psql xyz -f queries1.sql
$
If it crashes server, run individual queries from queries1.sql
Find which query causes first crash
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [4/45]
❖ Debugging Assignment 1 (cont) | |
Having found the offending query, may indicate a specific function
$ cd ../..
$ edit pname.c
$ make
$ cd tests/some_test_directory
$ psql xyz
xyz=#
xyz=#\q
$ cd ../..
$ edit pname.c
$
Problem solved!
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [5/45]
❖ Tree Indexes for N-d Selection | |
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [6/45]
❖ Multi-dimensional Tree Indexes | |
Over the last 20 years, from a range of problem areas
- different multi-d tree index schemes have been proposed
- varying primarily in how they partition tuple-space
Consider three popular schemes: kd-trees, Quad-trees, R-trees.
Example data for multi-d trees is based on the following relation:
create table Rel (
X char(1) check (X between 'a' and 'z'),
Y integer check (Y between 0 and 9)
);
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [7/45]
❖ Multi-dimensional Tree Indexes (cont) | |
Example tuples:
Rel('a',1) Rel('a',5) Rel('b',2) Rel('d',1)
Rel('d',2) Rel('d',4) Rel('d',8) Rel('g',3)
Rel('j',7) Rel('m',1) Rel('r',5) Rel('z',9)
The tuple-space for the above tuples:
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [8/45]
❖ Exercise: Query Types and Tuple Space | |
Which part of the tuple-space does each query represent?
Q1: select * from Rel where X = 'd' and Y = 4
Q2: select * from Rel where 'j' < X ≤ 'r'
Q3: select * from Rel where X > 'm' and Y > 4
Q4: select * from Rel where 'k' ≤ X ≤ 'p' and 3 ≤ Y ≤ 6
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [9/45]
kd-trees are multi-way search trees where
- each level of the tree partitions on a different attribute
- each node contains n-1 key values, pointers to n subtrees
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [10/45]
How this tree partitions the tuple space:
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [11/45]
Search(Query Q, Relation R, Level L, Node N)
{
if (isDataPage(N)) {
Buf = getPage(fileOf(R),idOf(N))
check Buf for matching tuples
} else {
a = attrLev[L]
if (!hasValue(Q,a))
nextNodes = all children of N
else {
val = getAttr(Q,a)
nextNodes = find(N,Q,a,val)
}
for each C in nextNodes
Search(Q, R, L+1, C)
} }
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [12/45]
❖ Exercise: Searching in kd-Trees | |
Using the following kd-tree index
Answer the queries (m,1)
, (a,?)
, (?,1)
, (?,?)
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [13/45]
Quad trees use regular, disjoint partitioning of tuple space.
- for 2d, partition space into quadrants (NW, NE, SW, SE)
- each quadrant can be further subdivided into four, etc.
Example:
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [14/45]
Basis for the partitioning:
- a quadrant that has no sub-partitions is a leaf quadrant
- each leaf quadrant maps to a single data page
- subdivide until points in each quadrant fit into one data page
- ideal: same number of points in each leaf quadrant (balanced)
- point density varies over space
⇒ different regions require different levels of partitioning
- this means that the tree is not necessarily balanced
Note: effective for
d≤5, ok for
6≤d≤10, ineffective for
d>10
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [15/45]
The previous partitioning gives this tree structure, e.g.
In this and following examples, we give coords of top-left,bottom-right of a region
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [16/45]
Space query example:
Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [17/45]
❖ Exercise: Searching in Quad-trees | |
Using the following quad-tree index
Answer the queries (m,1)
, (a,?)
, (?,1)
, (?,?)
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [18/45]
R-trees use a flexible, overlapping partitioning of tuple space.
- each node in the tree represents a kd hypercube
- its children represent (possibly overlapping) subregions
- the child regions do not need to cover the entire parent region
Overlap and partial cover means:
- can optimize space partitioning wrt data distribution
- so that there are similar numbers of points in each region
Aim: height-balanced, partly-full index pages
(cf. B-tree)
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [19/45]
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [20/45]
Insertion of an object R occurs as follows:
- start at root, look for children that completely contain R
- if no child completely contains R,
choose one of the children
and expand its boundaries so that it does contain R
- if several children contain R, choose one and proceed to child
- repeat above containment search in children of current node
- once we reach data page, insert R if there is room
- if no room in data page, replace by two data pages
- partition existing objects between two data pages
- update node pointing to data pages
(may cause B-tree-like propagation of node changes up into tree)
Note that
R may be a point or a polygon.
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [21/45]
Designed to handle space queries and "where-am-I" queries.
"Where-am-I" query: find all regions containing a given point P:
- start at root, select all children whose subregions contain P
- if there are zero such regions, search finishes with P not found
- otherwise, recursively search within node for each subregion
- once we reach a leaf, we know that region contains P
Space (region) queries are handled in a similar way
- we traverse down any path that intersects the query region
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [22/45]
❖ Exercise: Query with R-trees | |
Using the following R-tree:
Show how the following queries would be answered:
Q1: select * from Rel where X='a' and Y=4
Q2: select * from Rel where X='i' and Y=6
Q3: select * from Rel where 'c'≤X≤'j' and Y=5
Q4: select * from Rel where X='c'
Note: can view unknown value X=?
as range min(X
) ≤ X
≤ max(X
)
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [23/45]
❖ Multi-d Trees in PostgreSQL | |
Up to version 8.2, PostgreSQL had R-tree implementation
Superseded by GiST = Generalized Search Trees
GiST indexes parameterise: data type, searching, splitting
- via seven user-defined functions (e.g.
picksplit()
)
GiST trees have the following structural constraints:
- every node is at least fraction f full (e.g. 0.5)
- the root node has at least two children (unless also a leaf)
- all leaves appear at the same level
Details:
src/backend/access/gist
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [24/45]
❖ Costs of Search in Multi-d Trees | |
Difficult to determine cost precisely.
Best case: pmr query where all attributes have known values
- in kd-trees and quad-trees, follow single tree path
- cost is equal to depth D of tree
- in R-trees, may follow several paths (overlapping partitions)
Typical case: some attributes are unknown or defined by range
- need to visit multiple sub-trees
- how many depends on: range, choice-points in tree nodes
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [25/45]
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [26/45]
Relational selection is based on a boolean condition C
- evaluate C for each tuple t
- if C(t) is true, add t to result set
- if C(t) is false, t is not part of solution
- result is a set of tuples { t1, t2, ..., tn } all of which satisfy C
Uses for relational selection:
- precise matching on structured data
- using individual attributes with known, exact values
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [27/45]
❖ Similarity Selection (cont) | |
Similarity selection is used in contexts where
- cannot define a precise matching condition
- can define a measure d of "distance" between tuples
- d=0 is an exact match, d>0 is less accurate match
- result is a list of pairs [ (t1,d1), (t2,d2), ..., (tn,dn) ] (ordered by di)
Uses for similarity matching:
- text or multimedia (image/music) retrieval
- ranked queries in conventional databases
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [28/45]
❖ Example: Content-based Image Retrieval | |
User supplies a description or sample of desired image.
System returns a ranked list of "matching" images from database.
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [29/45]
❖ Example: Content-based Image Retrieval (cont) | |
At the SQL level, this might appear as ...
create view Sunset as
select image from MyPhotos
where title = 'Pittwater Sunset'
and taken = '2012-01-01';
create view SimilarSunsets as
select title, image
from MyPhotos
where (image ~~ (select * from Sunset)) < 0.05
order by (image ~~ (select * from Sunset));
where (imaginary) ~~
operator measures how "alike" images are
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [30/45]
❖ Similarity-based Retrieval | |
Database contains media objects, but also tuples, e.g.
-
id
to uniquely identify object (e.g. PostgreSQL oid
)
- metadata (e.g. artist, title, genre, date taken, ...)
-
val
ue of object itself (e.g. PostgreSQL BLOB or bytea
)
BLOB = Binary Large OBject
- BLOB stored in separate file; tuple contains reference (cf. TOAST)
- BLOBs are typically MB in size (1MB..2GB)
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [31/45]
❖ Similarity-based Retrieval (cont) | |
Similarity-based retrieval requires a distance measure
- dist(x,y) ∈ 0..1, dist(x,x) = 0,
- dist(x,y) = dist(y,x)
where
x and
y are two objects (in the database)
Note: distance calculation often requires substantial computational effort
How to restrict solution set to only the "most similar" objects:
- threshold dmax
(only objects t such that dist(t,q) ≤ dmax)
- count k
(k closest objects (k nearest neighbours))
BUT both above methods require knowing distance between query object and all objects in DB
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [32/45]
❖ Similarity-based Retrieval (cont) | |
Naive approach to similarity-based retrieval
q = ...
dmax = ...
knn = ...
Dists = []
foreach tuple t in R {
d = dist(t.val, q)
insert (t.oid,d) into Dists
}
n = 0; Results = []
foreach (i,d) in Dists {
if (dmax > 0 && d > dmax) break;
if (knn > 0 && ++n > knn) break;
insert (i,d) into Results
}
return Results;
Cost = fetch all r objects + compute distance() for each
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [33/45]
❖ Similarity-based Retrieval (cont) | |
For some applications, Cost(dist(x,y)) is comparable to Tr
⇒ computing dist(t.val,q)
for every tuple t
is infeasible.
To improve this ...
- compute feature vector to capture "critical" object properties
- store feature vectors "in parallel" with objects (cf. signatures)
- compute distance using feature vectors (not objects)
i.e. replace
dist(t,q) by
dist'(vec(t),vec(q)) in previous algorithm.
Further optimisation: dimension-reduction to make vectors smaller
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [34/45]
❖ Similarity-based Retrieval (cont) | |
Feature vectors ...
- often use multiple features, concatenated into single vector
- represent points in a very high-dimensional (vh-dim) space
Content of feature vectors depends on application ...
- image ... colour histogram (e.g. 100's of values/dimensions)
- music ... loudness/pitch/tone (e.g. 100's of values/dimensions)
- text ... term frequencies (e.g. 1000's of values/dimensions)
Query: feature vector representing one point in vh-dim space
Answer: list of objects "near to" query object in this space
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [35/45]
❖ Similarity-based Retrieval (cont) | |
Inputs to content-based similarity-retrieval:
- a database of r objects
(obj1, obj2, ..., objr)
plus associated ...
- r × n-dimensional feature vectors
(vobj1, vobj2, ..., vobjr)
- a query image q with associated n-dimensional vector (vq)
- a distance measure D(vi,vj) : [0..1)
(D=0 → vi=vj)
Outputs from content-based similarity-retrieval:
- a list of the k nearest objects in the database
[a1, a2, ... ak]
- ordered by distance
D(va1,vq) ≤
D(va2,vq) ≤ ... ≤
D(vak,vq)
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [36/45]
❖ Approaches to kNN Retrieval | |
Partition-based
- use auxiliary data structure to identify candidates
- space/data-partitioning methods: e.g. k-d-B-tree, R-tree, ...
- unfortunately, such methods "fail" when #dims > 10..20
- absolute upper bound on d before linear scan is best d = 610
Approximation-based
- use approximating data structure to identify candidates
- signatures: VA-files
- projections: iDistance, LSH, MedRank, CurveIX, Pyramid
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [37/45]
❖ Approaches to kNN Retrieval (cont) | |
Above approaches try to reduce number of objects considered.
- cf. indexes in relational databases
Other optimisations to make
kNN retrieval faster
- reduce I/O by reducing size of vectors (compression, d-reduction)
- reduce I/O by placing "similar" records together (clustering)
- reduce I/O by remembering previous pages (caching)
- reduce cpu by making distance computation faster
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [38/45]
❖ Similarity Retrieval in PostgreSQL | |
PostgreSQL has always supported simple "similarity" on strings
select * from Students where name like '%oo%';
select * from Students where name ~ '[Ss]mit';
Also provides support for ranked similarity on text
values
- using
tsvector
data type (stemmed, stopped feature vector for text
)
- using
tsquery
data type (stemmed, stopped feature vector for strings)
- using
@@
similarity operator
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [39/45]
❖ Similarity Retrieval in PostgreSQL (cont) | |
Example of PostgreSQL text retrieval:
create table Docs
( id integer, title text, body text );
alter table Docs add column features tsvector;
update Docs set features =
to_tsvector('english', title||' '||body);
select title, ts_rank(d.features, query) as rank
from Docs d,
to_tsquery('potter|(roger&rabbit)') as query
where query @@ d.features
order by rank desc
limit 10;
For more details, see PostgreSQL documentation, Chapter 12.
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [40/45]
❖ Signature-based Selection | |
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [41/45]
❖ Indexing with Signatures | |
Signature-based indexing:
- designed for pmr queries (conjunction of equalities)
- does not try to achieve better than O(n) performance
- attempts to provide an "efficient" linear scan
Each tuple is associated with a signature
- a compact (lossy) descriptor for the tuple
- formed by combining information from multiple attributes
- stored in a signature file, parallel to data file
Instead of scanning/testing tuples, do pre-filtering via signatures.
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [42/45]
❖ Indexing with Signatures (cont) | |
File organisation for signature indexing (two files)
One signature slot per tuple slot; unused signature slots are zeroed.
Signatures do not determine record placement ⇒ can use with other indexing.
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [43/45]
A signature "summarises" the data in one tuple
A tuple consists of N attribute values A1 .. An
A codeword cw(Ai) is
- a bit-string, m bits long, where k bits are set to 1 (k ≪ m)
- derived from the value of a single attribute Ai
A
tuple descriptor (signature) is built by combining
cw(Ai), i=1..n
- could combine by overlaying (or concatenating) codewords
- aim to have roughly half of the bits set to 1
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [44/45]
Generating a k-in-m codeword for attribute Ai
bits codeword(char *attr_value, int m, int k)
{
int nbits = 0;
bits cword = 0;
srandom(hash(attr_value));
while (nbits < k) {
int i = random() % m;
if (((1 << i) & cword) == 0) {
cword |= (1 << i);
nbits++;
}
}
return cword;
}
COMP9315 24T1 ♢ Week 5 Thursday Lecture ♢ [45/45]
Produced: 14 Mar 2024