❖ N-dimensional Queries |
Have looked at one-dimensional queries, e.g.
select * from R where a = K select * from R where a between Lo and Hi
and heaps, hashing, indexing as ways of efficient implementation.
Now consider techniques for efficient multi-dimensional queries.
Compared to 1-d queries, multi-dimensional queries
❖ Operations for Nd Select |
N-dimensional select queries = condition on ≥1 attributes.
select * from Employees where job = 'Manager' and gender = 'M';
select * from Employees where 20 ≤ age ≤ 50 and 40K ≤ salary ≤ 60K
❖ N-d Selection via Heaps |
Heap files can handle pmr or space using standard method:
// select * from R where C
r = openRelation("R",READ);
for (p = 0; p < nPages(r); p++) {
buf = getPage(file(r), p);
for (i = 0; i < nTuples(buf); i++) {
t = getTuple(buf,i);
if (matches(t,C))
add t to result set
}
}
Costpmr = Costspace = b
❖ N-d Selection via Multiple Indexes |
DBMSs already support building multiple indexes on a table.
Which indexes to build depends on which queries are asked.
create table R (a int, b int, c int); create index Rax on R (a); create index Rbx on R (b); create index Rcx on R (c); create index Rabx on R (a,b); create index Racx on R (a,c); create index Rbcx on R (b,c); create index Rallx on R (a,b,c);
But more indexes ⇒ space + update overheads.
❖ N-d Queries and Indexes |
Generalised view of pmr and space queries:
select * from R where a1 op1 C1 and ... and an opn Cn
pmr : all opi are equality tests. space : some opi are range tests.
Possible approaches to handling such queries ...
❖ N-d Queries and Indexes (cont) |
If using just one of several indexes, which one to use?
select * from R where a1 op1 C1 and ... and an opn Cn
The one with best selectivity for ai opi Ci (i.e. fewest matches)
Factors determining selectivity of ai opi Ci
❖ N-d Queries and Indexes (cont) |
Implementing selection using one of several indices:
// Query: select * from R where a1op1C1 and ... and anopnCn // choose ai with best selectivity TupleIDs = IndexLookup(R,ai,opi,Ci) // gives { tid1, tid2, ...} for tuples satisfying aiopiCi PageIDs = { } foreach tid in TupleIDs { PageIDs = PageIDs ∪ {pageOf(tid)} } // PageIDs = a set of bqix page numbers ...
Cost = Costindex + bqix (some pages do not contain answers, bqix > bq)
DBMSs typically maintain statistics to assist with determining selectivity
❖ N-d Queries and Indexes (cont) |
Implementing selection using multiple indices:
// Query: select * from R where a1op1C1 and ... and anopnCn // assumes an index on at least ai TupleIDs = IndexLookup(R,a1,op1,C1) foreach attribute ai with an index { tids = IndexLookup(R,ai,opi,Ci) TupleIDs = TupleIDs ∩ tids } PageIDs = { } foreach tid in TupleIDs { PageIDs = PageIDs ∪ {pageOf(tid)} } // PageIDs = a set of bq page numbers ...
Cost = k.Costindex + bq (assuming indexes on k of n attrs)
❖ Bitmap Indexes |
Alternative index structure, focussing on sets of tuples:
Index contains bit-strings of r bits, one for each value/range
❖ Bitmap Indexes (cont) |
Answering queries using bitmap index:
Matches = AllOnes(r) foreach attribute A with index { // select ith bit-string for attribute A // based on value associated with A in WHERE Matches = Matches & Bitmaps[A][i] } // Matches contains 1-bit for each matching tuple foreach i in 0..r-1 { if (Matches[i] == 0) continue; Pages = Pages ∪ {pageOf(Tids[i])} } foreach pid in Pages { P = getPage(pid) extract matching tuples from P }
❖ Bitmap Indexes (cont) |
Storage costs for bitmap indexes:
❖ Hashing and pmr |
For a pmr query like
select * from R where a1 = C1 and ... and an = Cn
❖ Hashing and pmr (cont) |
Multi-attribute hashing parameters:
❖ MA.Hashing Example |
Consider relation Deposit(branch,acctNo,name,amount)
Assume a small data file with 8 main data pages (plus overflows).
Hash parameters: d=3 d1=1 d2=1 d3=1 d4=0
Note that we ignore the amount
Assumes that nobody will want to ask queries like
select * from Deposit where amount=533
Choice vector is designed taking expected queries into account.
❖ MA.Hashing Example (cont) |
Choice vector:
This choice vector tells us:
❖ MA.Hashing Example (cont) |
Consider the tuple:
branch | acctNo | name | amount |
Downtown | 101 | Johnston | 512 |
Hash value (page address) is computed by:
❖ MA.Hashing Hash Functions |
Auxiliary definitions:
#define MaxHashSize 32 typedef unsigned int HashVal; // extracts i'th bit from hash value h #define bit(i,h) (((h) & (1 << (i))) >> (i)) // choice vector elems typedef struct { int attr, int bit } CVelem; typedef CVelem ChoiceVec[MaxHashSize]; // hash function for individual attributes HashVal hash_any(char *val) { ... }
❖ MA.Hashing Hash Functions (cont) |
Produce combined d-bit hash value for tuple t :
HashVal hash(Tuple t, ChoiceVec cv, int d)
{
HashVal h[nAttr(t)+1]; // hash for each attr
HashVal res = 0, oneBit;
int i, a, b;
for (i = 1; i <= nAttr(t); i++)
h[i] = hash_any(attrVal(t,i));
for (i = 0; i < d; i++) {
a = cv[i].attr;
b = cv[i].bit;
oneBit = bit(b, h[a]);
res = res | (oneBit << i);
}
return res;
}
❖ Queries with MA.Hashing |
In a partial match query:
select amount from Deposit where branch = 'Brighton' and name = 'Green'
for which we use the shorthand (Brighton, ?, Green, ?)
❖ Queries with MA.Hashing (cont) |
Consider query: select amount from Deposit where name='Green'
Matching tuples must be in pages: 100
101
110
111
❖ MA.Hashing Query Algorithm |
// Builds the partial hash value (e.g. 10*0*1) // Treats query like tuple with some attr values missing nstars = 0; for each attribute i in query Q { if (hasValue(Q,i)) { set d[i] bits in composite hash using choice vector and hash(Q,i) } else { set d[i] *'s in composite hash using choice vector nstars += d[i] } } ...
❖ MA.Hashing Query Algorithm (cont) |
... // Use the partial hash to find candidate pages r = openRelation("R",READ); for (i = 0; i < 2nstars ; i++) { pid = composite hash replace *'s in pid using i and choice vector Buf = readPage(file(r), pid); for each tuple T in Buf { if (T satisfies pmr query) add T to results } }
❖ Query Cost for MA.Hashing |
Multi-attribute hashing handles a range of query types, e.g.
select * from R where a=1 select * from R where d=2 select * from R where b=3 and c=4 select * from R where a=5 and b=6 and c=7
A relation with n attributes has 2n different query types.
Different query types have different costs
(different no. of *
Cost(Q) = 2s where s = ∑i ∉ Q di (alternatively Cost(Q) = ∏i ∉ Q 2di)
Query distribution gives probability pQ of asking each query type Q.
❖ Query Cost for MA.Hashing (cont) |
Min query cost occurs when all attributes are used in query
Min Costpmr = 1
Max query cost occurs when no attributes are specified
Max Costpmr = 2d = b
Average cost is given by weighted sum over all query types:
Avg Costpmr = ∑Q pQ ∏i ∉ Q 2di
Aim to minimise the weighted average query cost over possible query types
❖ Optimising MA.Hashing Cost |
For a given application, useful to minimise Costpmr.
Can be achieved by choosing appropriate values for di (cv)
Heuristics:
This is a combinatorial optimisation problem
(solve via standard optimisation techniques e.g. simulated annealing)
❖ Multi-dimensional Tree Indexes |
Over the last 20 years, from a range of problem areas
Consider three popular schemes: kd-trees, Quad-trees, R-trees.
Example data for multi-d trees is based on the following relation:
create table Rel ( X char(1) check (X between 'a' and 'z'), Y integer check (Y between 0 and 9) );
❖ Multi-dimensional Tree Indexes (cont) |
Example tuples:
Rel('a',1) Rel('a',5) Rel('b',2) Rel('d',1) Rel('d',2) Rel('d',4) Rel('d',8) Rel('g',3) Rel('j',7) Rel('m',1) Rel('r',5) Rel('z',9)
The tuple-space for the above tuples:
❖ kd-Trees |
kd-trees are multi-way search trees where
❖ Searching in kd-Trees |
// Started by Search(Q, R, 0, kdTreeRoot)
Search(Query Q, Relation R, Level L, Node N)
{
if (isDataPage(N)) {
Buf = getPage(fileOf(R),idOf(N))
check Buf for matching tuples
} else {
a = attrLev[L]
if (!hasValue(Q,a))
nextNodes = all children of N
else {
val = getAttr(Q,a)
nextNodes = find(N,Q,a,val)
}
for each C in nextNodes
Search(Q, R, L+1, C)
} }
❖ Quad Trees |
Quad trees use regular, disjoint partitioning of tuple space.
❖ Quad Trees (cont) |
Basis for the partitioning:
❖ Quad Trees (cont) |
The previous partitioning gives this tree structure, e.g.
In this and following examples, we give coords of top-left,bottom-right of a region
❖ Searching in Quad-tree |
Space query example:
Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).
❖ R-Trees |
R-trees use a flexible, overlapping partitioning of tuple space.
❖ Insertion into R-tree |
Insertion of an object R occurs as follows:
❖ Query with R-trees |
Designed to handle space queries and "where-am-I" queries.
"Where-am-I" query: find all regions containing a given point P:
❖ Multi-d Trees in PostgreSQL |
Up to version 8.2, PostgreSQL had R-tree implementation
Superseded by GiST = Generalized Search Trees
GiST indexes parameterise: data type, searching, splitting
picksplit()
❖ Costs of Search in Multi-d Trees |
Difficult to determine cost precisely.
Best case: pmr query where all attributes have known values
❖ Similarity Selection |
Relational selection is based on a boolean condition C
❖ Similarity Selection (cont) |
Similarity selection is used in contexts where
❖ Example: Content-based Image Retrieval |
User supplies a description or sample of desired image.
System returns a ranked list of "matching" images from database.
❖ Example: Content-based Image Retrieval (cont) |
At the SQL level, this might appear as ...
// relational matching create view Sunset as select image from MyPhotos where title = 'Pittwater Sunset' and taken = '2012-01-01'; // similarity matching with threshold create view SimilarSunsets as select title, image from MyPhotos where (image ~~ (select * from Sunset)) < 0.05 order by (image ~~ (select * from Sunset));
where ~~
❖ Similarity-based Retrieval |
Database contains media objects, but also tuples, e.g.
id
oid
val
bytea
BLOB = Binary Large OBject
❖ Similarity-based Retrieval (cont) |
Similarity-based retrieval requires a distance measure
How to restrict solution set to only the "most similar" objects:
❖ Similarity-based Retrieval (cont) |
Naive approach to similarity-based retrieval
q = ... // query object dmax = ... // dmax > 0 => using threshold knn = ... // knn > 0 => using nearest-neighbours Dists = [] // empty list foreach tuple t in R { d = dist(t.val, q) insert (t.oid,d) into Dists // sorted on d } n = 0; Results = [] foreach (i,d) in Dists { if (dmax > 0 && d > dmax) break; if (knn > 0 && ++n > knn) break; insert (i,d) into Results // sorted on d } return Results;
Cost = fetch all r objects + compute distance() for each
❖ Similarity-based Retrieval (cont) |
For some applications, Cost(dist(x,y)) is comparable to Tr
⇒ computing dist(t.val,q)
t
To improve this ...
Further optimisation: dimension-reduction to make vectors smaller
❖ Similarity-based Retrieval (cont) |
Feature vectors ...
Answer: list of objects "near to" query object in this space
❖ Similarity-based Retrieval (cont) |
Inputs to content-based similarity-retrieval:
❖ Approaches to kNN Retrieval |
Partition-based
❖ Approaches to kNN Retrieval (cont) |
Above approaches try to reduce number of objects considered.
❖ Similarity Retrieval in PostgreSQL |
PostgreSQL has always supported simple "similarity" on strings
-- for most SQL implementations select * from Students where name like '%oo%'; -- and PostgreSQL-specific select * from Students where name ~ '[Ss]mit';
Also provides support for ranked similarity on text
tsvector
text
tsquery
@@
❖ Similarity Retrieval in PostgreSQL (cont) |
Example of PostgreSQL text retrieval:
create table Docs ( id integer, title text, body text ); // add column to hold document feature vectors alter table Docs add column features tsvector; update Docs set features = to_tsvector('english', title||' '||body); // ask query and get results in ranked order select title, ts_rank(d.features, query) as rank from Docs d, to_tsquery('potter|(roger&rabbit)') as query where query @@ d.features order by rank desc limit 10;
For more details, see PostgreSQL documentation, Chapter 12.
❖ Indexing with Signatures |
Signature-based indexing:
Each tuple is associated with a signature
❖ Indexing with Signatures (cont) |
File organisation for signature indexing (two files)
One signature slot per tuple slot; unused signature slots are zeroed.
Signatures do not determine record placement
❖ Signatures |
A signature "summarises" the data in one tuple
A tuple consists of N attribute values A1 .. An
A codeword cw(Ai) is
❖ Indexing with Signatures |
Signature-based indexing:
Each tuple is associated with a signature
❖ Indexing with Signatures (cont) |
File organisation for signature indexing (two files)
One signature slot per tuple slot; unused signature slots are zeroed.
Signatures do not determine record placement
❖ Signatures |
A signature "summarises" the data in one tuple
A tuple consists of N attribute values A1 .. An
A codeword cw(Ai) is
❖ Generating Codewords |
Generating a k-in-m codeword for attribute Ai
bits codeword(char *attr_value, int m, int k) { int nbits = 0; // count of set bits Bits Codeword = 0; seed random number generator with hash of attr_value while (less than k bits set in Codeword) { i = generate random bit position if (i'th bit in Codeword not already set) { set i'th bit in Codeword note one more bit set } } return Codeword; // m-bits with k 1-bits and m-k 0-bits }
❖ Superimposed Codewords (SIMC) |
In a superimposed codewords (simc) indexing scheme
OR
OR
OR
bits desc = 0 for (i = 1; i <= n; i++) { bits cw = codeword(A[i]) desc = desc | cw }
❖ SIMC Example |
Consider the following tuple (from bank deposit database)
Branch | AcctNo | Name | Amount |
Perryridge | 102 | Hayes | 400 |
It has the following codewords/descriptor (for m = 12, k = 2 )
Ai | cw(Ai) |
Perryridge | 010000000001 |
102 | 000000000011 |
Hayes | 000001000100 |
400 | 000010000100 |
desc(r) | 010011000111 |
❖ SIMC Queries |
To answer query q in SIMC
desc(q) is formed by OR of codewords for known attributes.
E.g. consider the query (Perryridge, ?, ?, ?)
Ai | cw(Ai) |
Perryridge | 010000000001 |
? |
000000000000 |
? |
000000000000 |
? |
000000000000 |
desc(q) | 010000000001 |
❖ SIMC Queries (cont) |
Once we have a query descriptor, we search the signature file:
pagesToCheck = {}
for each descriptor D[i] in signature file {
if (matches(D[i],desc(q))) {
pid = pageOf(tupleID(i))
pagesToCheck = pagesToCheck ∪ pid
}
}
for each P in pagesToCheck {
Buf = getPage(f,P)
check tuples in Buf for answers
}
// where ...
#define matches(rdesc,qdesc)
((rdesc & qdesc) == qdesc)
❖ Example SIMC Query |
Consider the query and the example database:
Signature | Deposit Record |
010000000001 |
(Perryridge,?,?,?) |
100101001001 |
(Brighton,217,Green,750) |
010011000111 |
(Perryridge,102,Hayes,400) |
101001001001 |
(Downtown,101,Johnshon,512) |
101100000011 |
(Mianus,215,Smith,700) |
010101010101 |
(Clearview,117,Throggs,295) |
100101010011 |
(Redwood,222,Lindsay,695) |
Gives two matches: one true match, one false match.
❖ SIMC Parameters |
False match probablity pF = likelihood of a false match
How to reduce likelihood of false matches?
Having k too high ⇒ increased overlapping.
Having k too low ⇒ increased hash collisions.
❖ SIMC Parameters (cont) |
How to determine "optimal" m and k?
m = ( 1/loge 2 )2 . n . loge ( 1/pF )
❖ Query Cost for SIMC |
Cost to answer pmr query: Costpmr = bD + bq
E.g. m=64, B=8192, r=104 ⇒ cD = 1024, bD=10
bq includes pages with rq matching tuples and rF false matches
Expected false matches = rF = (r - rq).pF ≅ r.pF if rq ≪ r
E.g. Worst bq = rq+rF, Best bq = 1, Avg bq = ceil(b(rq+rF)/r)
❖ Page-level SIMC |
SIMC has one descriptor per tuple ... potentially inefficient.
Alternative approach: one descriptor for each data page.
Every attribute of every tuple in page contributes to descriptor.
Size of page descriptor (PD) (clearly larger than tuple descriptor):
Typically, pages are 1..8KB ⇒ 1..9 PD/page (NPD).
❖ Bit-sliced SIMC (cont) |
At query time
matches = ~0 //all ones
for each bit i set to 1 in desc(q) {
slice = fetch bit-slice i
matches = matches & slice
}
for each bit i set to 1 in matches {
fetch page i
scan page for matching records
}
Effective because desc(q) typically has less than half bits set to 1
Produced: 30 Apr 2024