Selection on Multiple Attributes
Multi-dimensional (Nd) Selection |
Operations for Nd Select | 2/152 |
N-dimensional select queries = condition on several attributes.
select * from Employees where job = 'Manager' and gender = 'M';
select * from Employees where age > 55 and dept = 'Sales';
Tuple Space | 3/152 |
One view of N-dimensional selection on a relation R...
Heaps | 4/152 |
Heap files can handle pmr or space using standard method:
// select * from R where Cond f = openFile(fileName("R"),READ); for (p = 0; p < nPages(f); p++) { buf = getPage(f, p); for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (matches(tup,Cond)) add tup to result set } }
Costpmr = Costspace = b (i.e. O(n), worst case scenario)
Multiple Indexes | 5/152 |
DBMSs already support building multiple indexes on a table.
Which indexes to build, depends on which queries are asked.
If we don't know queries, index all attribute subsets:
create table R (a int, b int, c int); create index Rax on R (a); create index Rbx on R (b); create index Rcx on R (c); create index Rabx on R (a,b); create index Racx on R (a,c); create index Rbcx on R (b,c); create index Rallx on R (a,b,c);
... Multiple Indexes | 6/152 |
Characteristics of indexes:
... Multiple Indexes | 7/152 |
Generalised view of pmr and space queries:
select * from R where a1 op1 C1 and a2 op1 C2 and ... and an opn Cn
where, for pmr, all opi are equality tests
Possible approaches to handling such queries ...
Selection using One Index | 8/152 |
Using index on one attribute:
... Selection using One Index | 9/152 |
Abstract algorithm:
// select * from R where Cond // Cond = a1op1C1 and a2op2C2 and a3op3C3 ... Tids = IndexLookup(R, a[i], op[i], C[i])// gives { tid1, tid2, ...} for tuples satisfying aiopiCi Pages = { } foreach tid in Tids { if (pageOf(tid) ∉ Pages) Pages = Pages ∪ {pageOf(tid)} } f = openFile(fileName("R"),READ) foreach page in Pages { Buf = getPage(f,page); foreach tup in Buf { if (matches(tup, (a1 op1 C1 ...)) add tup to Results } }
... Selection using One Index | 10/152 |
Cost of this approach to selection:
select * from R where ai opi Ci
matches(tup,
)
Selection using Many Indexes | 11/152 |
Using indexes on several attributes:
... Selection using Many Indexes | 12/152 |
Abstract algorithm:
// select * from R where a1 op1 C1 and a2 op2 C2 ... for (i = 1; i <= #QueryTerms; i++) { if (no index on a[i]) continue Tids = IndexLookup(R, a[i], op[i], C[i]) PageSets[i] = { } foreach tid in Tids { if (pageOf(tid) ∉ PageSets[i]) PageSets[i] = PageSets[i] ∪ pageOf(tid) } }// PageSets[i] contains pages with tuples matching (ai opi Ci) ...
... Selection using Many Indexes | 13/152 |
...// compute intersection of Pages[i] Pages = PageSets[1] for (i = 2; i <= #QueryTerms; i++) { Pages = Pages ∩ PageSets[i]; }// finally ... fetch data pages f = openFile(fileName("R"),READ); foreach page in Pages { Buf = getPage(f,page);// this page has at least 1 match foreach tup in Buf { if (matches(tup, (a1 op1 C1 ...)) add tup to Results } }
Bitmap Indexes | 14/152 |
A bitmap index assists computation of result sets in pmr.
... Bitmap Indexes | 15/152 |
Structure of bitmap index:
... Bitmap Indexes | 16/152 |
Consider using bitmap index to solve query:
select * from Employees where dept='Sales' and gender='M'
Method:
tSales = bitmap[dept,'Sales'] tMales = bitmap[gender,'M'] matches = tSales & tMales for (i in 0..r-1) { if (matches[i] == 1) { tid = slotToTupleId(i) fetch tuple t via TupleId tid } }
... Bitmap Indexes | 17/152 |
Also useful to have a file of tid
... Bitmap Indexes | 18/152 |
Answering queries using bitmap index:
Matches = AllOnes(r) foreach attribute A with index {// select ith bit-string for attribute A // based on value associated with A in WHERE Matches = Matches & Bitmaps[A][i] }// Matches contains 1-bit for each matching tuple foreach i in 0..r-1 { if (Matches[i] == 0) continue; Pages = Pages ∪ {pageOf(Tids[i])} } foreach pid in Pages { P = getPage(pid) extract matching tuples from P }
... Bitmap Indexes | 19/152 |
Costs associated with bitmap indexes ...
Storage costs:
N-dimensional Hashing |
Hashing and pmr | 21/152 |
Consider these pmr queries on a table hashed using salary
Q1: select * from Employees where salary = 60000 and gender = 'M' Q2: select * from Employees where dept = 'Sales' and salary = 60K Q3: select * from Employees where salary > 60000 and dept = 'Sales'
CostQ1 = 1+Ov, CostQ2 = b+bOv, CostQ3 = b+bOv
... Hashing and pmr | 22/152 |
Problem: hashing only effective if hash key used in query.
No problem if all queries involve conditions like key=val
But frequently tables are accessed in multiple ways.
Can we devise hashing that "involves" attributes?
Yes, using multi-attribute hashing (mah) ...
... Hashing and pmr | 23/152 |
Multi-attribute hashing ...
... Hashing and pmr | 24/152 |
Multi-attribute hashing parameters:
MA.Hashing Example | 25/152 |
Consider branch,acctNo,name,amount) table (+ hash values)
branch | h(B) | acctNo | h(Ac) | name | h(N) | amount |
Brighton | 1011 |
217 | 1001 |
Green | 0101 |
750 |
Downtown | 0000 |
101 | 0101 |
Johnson | 1101 |
512 |
Mianus | 1101 |
215 | 1011 |
Smith | 0001 |
700 |
Perryridge | 0100 |
102 | 0110 |
Hayes | 0010 |
400 |
Redwood | 0110 |
222 | 1110 |
Lindsay | 1000 |
695 |
Round Hill | 1111 |
305 | 0001 |
Turner | 0110 |
350 |
Clearview | 1110 |
117 | 0101 |
Throggs | 0110 |
295 |
Note that we ignore the amount
select * from where amount=533
... MA.Hashing Example | 26/152 |
Hash parameters: d=3 d1=1 d2=1 d3=1
Choice vector:
This choice vector tells us:
... MA.Hashing Example | 27/152 |
Consider the tuple:
branch | acctNo | name | amount |
Brighton | 217 | Green | 750 |
Hash value (page address) is computed by:
... MA.Hashing Example | 28/152 |
Consider the tuple:
branch | acctNo | name | amount |
Downtown | 101 | Johnston | 512 |
Hash value (page address) is computed by:
... MA.Hashing Example | 29/152 |
Consider the tuple:
branch | acctNo | name | amount |
Round Hill | 305 | Turner | 350 |
Hash value (page address) is computed by:
... MA.Hashing Example | 30/152 |
Hash values for all tuples:
tuple | Hash |
(Brighton,217,Green,750) | 111 |
(Downtown,101,Johnshon,512) | 110 |
(Mianus,215,Smith,700) | 111 |
(Perryridge,102,Hayes,400) | 000 |
(Redwood,222,Lindsay,695) | 000 |
(Round Hill,305,Turner,350) | 011 |
(Clearview,117,Throggs,295) | 010 |
... MA.Hashing Example | 31/152 |
If inserted, in order of appearance, into a database containing 8 pages:
MA.Hashing Hash Functions | 32/152 |
Auxiliary definitions:
#define HashSize 32 typedef unsigned int HashVal;// extracts i'th bit from hash value #define bit(i,h) ((h) & (1 << (i)))// choice vector elems typedef struct { int attr, int bit } CVelem; typedef CVelem ChoiceVec[HashSize];// compute hash for i'th attribute of tuple t HashVal hash(Tuple t, int i) { ... }
... MA.Hashing Hash Functions | 33/152 |
Produce combined d-bit hash value for tuple tup:
HashVal hash(Tuple tup, ChoiceVec cv, int d) { HashVal h[nAttr(tup)];// hash for each attr HashVal res = 0, oneBit; int i, attr, bpos; for (i = 0; i < nAttr(tup); i++) h[i] = hash(tup,i); for (i = 0; i < d; i++) { attr = cv[i].attr; bpos = cv[i].bit; oneBit = bit(bpos, h[attr]); res = res | (oneBit << (i-bpos)); } return res; }
Queries with MA.Hashing | 34/152 |
In a partial match query:
select amount from Deposit where branch = 'Brighton' and name = 'Green'
for which we use the shorthand (Brighton, ?, Green, ?)
... Queries with MA.Hashing | 35/152 |
If we try to form a composite hash for a query, we don't know values for some bits:
... Queries with MA.Hashing | 36/152 |
How to answer this query?
query | Hash |
(Brighton, ?, Green, ?) | 1*1 |
Any matching tuples must be in pages with addresses 101
111
We need to read just these pages.
... Queries with MA.Hashing | 37/152 |
Consider the query:
select amount from Deposit where name = 'Green'
Need to check pages: 100
101
110
111
(With original hashing scheme, would have read all pages)
... Queries with MA.Hashing | 38/152 |
Consider the query:
select amount from Deposit where branch='Brighton' and acctNo=217 and name='Green'
Need to check only page 111
MA.hashing Query Algorithm | 39/152 |
// Build the partial hash value (e.g. 10*0*1) // Treats query like tuple with some attr values missing nstars = 0; for each attribute i in query Q { if (hasValue(Q,i)) { set d[i] bits in composite hash using choice vector and hash(Q,i) } else { set d[i] *'s in composite hash using choice vector nstars++; } } ...
... MA.hashing Query Algorithm | 40/152 |
... // Use the partial hash to find candidate pages f = openFile(fileName("R"),READ); for (i = 0; i < 2**nstars; i++) { P = composite hash replace *'s in P using i and choice vector Buf = getPage(f, P); for each tuple T in Buf { if (T satisfies pmr query) add T to results } }
Query Cost for MA.Hashing | 41/152 |
Multi-attribute hashing handles a range of query types, e.g.
select * from R where a=1 select * from R where d=2 select * from R where b=3 and c=4 select * from R where a=5 and b=6 and c=7
A relation with n attributes has 2n different query types.
Different query types have different costs
(different no. of *
... Query Cost for MA.Hashing | 42/152 |
A query type Q may be defined via a set of attribute numbers,
indicating which attributes have values specified.
Example Query Types:
Type | ||
{1} | ||
{1,3} | ||
{3,4} | ||
{1,2,3,4} | ||
{} (open query) |
... Query Cost for MA.Hashing | 43/152 |
In practice, different query types are not equally likely.
E.g. maybe most of the queries asked are of the form
select * from R where a=1
with occasionally a query of the form
select * from R where b=3 and c=4
For each application, we can define a query distribution
which gives the probability pQ of asking each query type Q.
... Query Cost for MA.Hashing | 44/152 |
Consider a query of type Q with m attributes unspecified.
Each unspecified Ai contributes di *'s.
Total number of *'s is s = ∑i ∉ Q di.
⇒ Number of pages to read is 2s = ∏i ∉ Q 2di.
If we assume no overflow pages, Cost(Q) = 2s (where s is determined by Q)
... Query Cost for MA.Hashing | 45/152 |
Min query cost occurs when all attributes are used in query
Min Costpmr = 1
Max query cost occurs when no attributes are specified
Max Costpmr = 2d = b
Average cost is given by weighted sum over all query types:
Avg Costpmr = ∑Q pQ ∏i ∉ Q 2di
Since all query types are possible, aim to minimise average cost.
Optimising MA.Hashing Cost | 46/152 |
For a given application, the aim is to minimise Costpmr.
Can be achieved by choosing appropriate values for di (cv)
Heuristics:
This is a combinatorial optimisation problem, and can be handled by standard optimisation techniques e.g. simulated annealing.
MA.Hashing Cost Example | 47/152 |
Our example database has 16 possible query types:
Query type | Cost | pQ |
(?, ?, ?, ?) | 8 | 0 |
(br, ?, ?, ?) | 4 | 0.25 |
(?, ac, ?, ?) | 4 | 0 |
(br, ac, ?, ?) | 2 | 0 |
(?, ?, nm, ?) | 4 | 0 |
(br, ?, nm, ?) | 2 | 0 |
(?, ac, nm, ?) | 2 | 0.25 |
(br, ac, nm, ?) | 1 | 0 |
(?, ?, ?, amt) | 8 | 0 |
(br, ?, ?, amt) | 4 | 0 |
(?, ac, ?, amt) | 4 | 0 |
(br, ac, ?, amt) | 2 | 0 |
(?, ?, nm, amt) | 4 | 0 |
(br, ?, nm, amt) | 2 | 0.5 |
(?, ac, nm, amt) | 2 | 0 |
(br, ac, nm, amt) | 1 | 0 |
Cost values are based on earlier choice vector
(dbr = dac = dnm = 1)
pQ values can be determined by observation of DB use.
... MA.Hashing Cost Example | 48/152 |
Consider r=106, Nr=100, b=104, d=14.
Attribute br occurs in 0.5+0.25 used query types
⇒ allocate many bits to it e.g. d1=6.
Attribute nm occurs in 0.5+0.25 of queries
⇒ allocate many bits to it e.g. d3=4.
Attribute amt occurs in 0.5 of queries
⇒ allocate less bits to it e.g. d4=2.
Attribute ac occurs in 0.25 of queries
⇒ allocate least bits to it e.g. d2=2.
... MA.Hashing Cost Example | 49/152 |
With bits distributed as: d1=6, d2=2, d3=4, d4=2
Query type | Cost | pQ |
(br, ?, ?, ?) | 28 = 256 | 0.25 |
(?, ac, nm, ?) | 28 = 256 | 0.25 |
(br, ?, nm, amt) | 22 = 4 | 0.5 |
Cost = 0.5 × 22 + 0.25 × 28 + 0.25 × 28 = 130
MA.Hashing vs One-Attribute Hashing | 50/152 |
Multi-attribute hashing is basically just a different way of forming hash value for each tuple.
Hence, can be used with all flexible hashing schemes.
As the file grows, the di's are increased one by one, using the choice vector.
Difference between mah and standard hashing:
queries "suggest" a set of pages rather than a single page.
Cost for insertion, deletion, ordered scan is same as for underlying hashing scheme.
Grid Files | 51/152 |
The mah approach attempts to combine all attributes in the hash value.
Alternative approach is to keep hash values separate and use auxiliary structure to provide combination.
A generalisation of extendible hashing, called grid files provides this.
Ext.hashing uses a directory on one attribute.
Grid files use a k-dimensional directory to handle k attributes.
... Grid Files | 52/152 |
Selection with Grid Files | 53/152 |
Consider the following grid file directory:
for a table R with two attributes a and b
... Selection with Grid Files | 54/152 |
If d1=3 and d2=2 ...
select...where a= and b= |
one cell | |
select...where a= |
one row (four cells) | |
select...where b= |
one column (eight cells) |
Number of directory cells visited is similar to number of pages visted in mah.
... Selection with Grid Files | 55/152 |
Execution of select...where b=
where hash(C1) = ...001
...110
... Selection with Grid Files | 56/152 |
Selection for pmr in a 2d grid file ( nAttrs = d )
for (i = 0; i < nAttr(tup); i++) { if (!hasValue(tup,i)) { lo[i] = 0; hi[i] = 2**d-1 } else { val = bits(d[i], getAttr(tup,i)) lo[i] = val; hi[i] = val } } Pages = [] for (i = lo[1]; i <= hi[1]; i++) { for (j = lo[2]; j <= lo[2]; j++) { if (!(grid[i][j] in Pages)) add grid[i][j] to Pages } } for each P in Pages { Buf = getPage(f,P) check Buf for solutions }
Query Cost for Grid Files | 57/152 |
Cost depends on:
... Query Cost for Grid Files | 58/152 |
For pmr with j < k attributes specified
Typically, 1 ≤ gq ≤ 50 (i.e. grid directories are relatively small)
Other Operations on Grid Files | 59/152 |
Insertion has several cases:
As with all hashing schemes, grid file does not help ordered scan.
N-dimensional Tree Indexes |
Multi-dimensional Tree Indexes | 61/152 |
We have seen how hashing can be extended to handle multiple attributes.
Can tree-based index structures be generalised as well?
Yes ... and a very large number of techniques have been suggested in research literature.
E.g. BSP-trees, Cell-trees, LSD-trees, SS-trees, TV-trees, X-trees, ...
We consider three popular examples: kd-trees, Quad-trees, R-trees.
Example 2d Relation | 62/152 |
As an example for multi-dimensional trees, consider the following relation:
create table Rel ( A1 char(1), # 'a'..'z' A2 integer # 0..9 );
with example tuples:
Rel('a',1) Rel('a',5) Rel('b',2) Rel('d',1) Rel('d',2) Rel('d',4) Rel('d',8) Rel('g',3) Rel('j',7) Rel('m',1) Rel('r',5) Rel('z',9)
... Example 2d Relation | 63/152 |
And the tuple-space for the above tuples:
kd-Trees | 64/152 |
kd-trees are multi-way search trees where
Example kd-Tree | 65/152 |
Each leaf node contains a reference to a bucket containing tuples from a small region of k-dimensional space.
... Example kd-Tree | 66/152 |
How this tree partititons the tuple space:
Searching in kd-Trees | 67/152 |
Consider first how to answer pmr queries ...
If all attributes have values specified:
... Searching in kd-Trees | 68/152 |
As a tree traversal, the algorithm is best specified recursively.
Parameters for search function:
Search(Q, 0, kdTreeRoot)
... Searching in kd-Trees | 69/152 |
Search(Query Q, Level L, Node N) { if (isLeaf(N)) { Buf = getPage(f,N.page) check Buf for matching tuples } else { ai = attrLev[L] if (hasValue(Q,ai)) { Val = getAttr(Q,ai) newN = search N for Val Search(Q, L+1, newN) } else { for each child newN of N Search(Q, L+1, newN) } } }
... Searching in kd-Trees | 70/152 |
For space queries ...
We require a change to the above Search
if (hasValue(Q,ai)) { Val = getAttr(Q,ai) newN = search N for Val Search(Q, L+1, newN) }
becomes
if (isRange(Q,ai)) { for each node N on current level { for each value V in range(Q,ai)) { newN = childOf(N,V) Search(Q, L+1, newN) } } }
Query cost for kd-Trees | 71/152 |
Different kinds of queries (i.e. different specified attributes)
If we include all n attributes as indexes, then tree has depth ≥ n.
Depth of tree is also affected by branching factor at each node.
Note: kd-trees are not necessarily balanced.
... Query cost for kd-Trees | 72/152 |
Query examples:
Open query: (?,?,?,?,?,...)
(a,b,c,d,e,...)
(a,?,c,?,e,...)
Insertion into kd-Trees | 73/152 |
Input: tuple with all values specified.
Method:
traverse tree to leaf node (reaching data page P1) if (not full P1) insert tuple else { create new data page P2 compute new partition point Part distribute tuples between P1 and P2 (using Part) create new internal node N with Part link N into tree, link N to P1 and P2 } write any modified pages (data or index)
... Insertion into kd-Trees | 74/152 |
Example insertion into non-full data page:
... Insertion into kd-Trees | 75/152 |
Example insertion into full data page:
kd-B-Trees | 76/152 |
kd-B-trees: a disk-based index structure using kd-tree ideas.
Aims to have a tree which is:
... kd-B-Trees | 77/152 |
Simple example kd-B-tree (max 2 kd-tree levels in each node):
... kd-B-Trees | 78/152 |
What this produces:
Selection in kd-B-Trees | 79/152 |
To answer a pmr query:
Pages = kdBsearch(query,root) for each page P in Pages { Buf = getPage(f,P) scan Buf for matching tuples } kdbSearch(Query, Page) { if (Page is a data page) Pages = Page else { Buf = getPage(kdbf,Page) search kd-tree in Buf using attributes from Query Pages = [] for each indicated external subtree S { P = kdbSearch(Query, S) add P to Pages } } return Pages }
... Selection in kd-B-Trees | 80/152 |
Size of kd-B-tree:
... Selection in kd-B-Trees | 81/152 |
If all attributes are specified (i.e. point search):
Costpmr = (d+1)
If some attributes unspecified ...
space queries can be handled similarly to pmr queries.
Insertion/Deletion in kd-B-Trees | 82/152 |
Conceptually, insertion is same as for kd-tree:
... Insertion/Deletion in kd-B-Trees | 83/152 |
Minimum insert cost is 1w (no split, no change to kd-tree)
Insert cost with split and kd-tree update: 3w
(need to write re-partitioned data pages and modified index page)
Worst case insert cost is high (tree/data re-organisation).
Deletion is straightforward (mark as deleted).
If data page occupancy falls too low, might consider
Quad Trees | 84/152 |
Quad trees use a regular, recursive partitioning of kd tuple space.
At each level, partition space into non-overlapping kd hypercubes.
For 2d case (others are much harder to visualise):
... Quad Trees | 85/152 |
Quad-tree on example 2d tuple space:
... Quad Trees | 86/152 |
Basis for the partitioning:
... Quad Trees | 87/152 |
The previous partitioning gives this tree structure, e.g.
... Quad Trees | 88/152 |
Each node of a k dimensional quad-tree has 2k children.
Quad-trees originally devised as memory based structures, for spatial data.
In this domain, k = 2 or k = 3 and branching is fairly small.
For index pages in a disk-based data structure:
Insertion into Quad Tree | 89/152 |
Inserting a new tuple into a 2d quad-tree indexed file involves:
traverse tree to leaf node (reaching data page P1) if (not full P1) insert tuple else { create new data pages P2, P3, P4 distribute tuples between P1 .. P4 create new internal node N link N into tree, link N to P1 .. P4 } write any modified pages (data or index)
... Insertion into Quad Tree | 90/152 |
Example:
Cost | = | traversal cost + update cost |
= | read ≥ 1 node pages + write ≥ 1 data pages |
... Insertion into Quad Tree | 91/152 |
Potential problems with quad-trees:
Query with Quad Tree | 92/152 |
For pmr query
... Query with Quad Tree | 93/152 |
For space query
... Query with Quad Tree | 94/152 |
Space query example:
Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).
R-Trees | 95/152 |
R-trees use a flexible, hierachical partitioning of kd tuple space.
... R-Trees | 96/152 |
Example 2d R-tree node (with two levels of children):
... R-Trees | 97/152 |
... R-Trees | 98/152 |
The R-tree was initally designed for indexing polygons
(via their minimum bounding rectangles (MBR))
It adapts naturally to tuple-space points (i.e. tuples)
(each MBR represents a region of the tuple-space)
Goals in constructing R-trees (cf. B-trees)
Query with R-trees | 99/152 |
Designed to handle space queries and "where-am-I" queries.
"Where-am-I" query: find all regions containing a given point P:
... Query with R-trees | 100/152 |
Algorithm for "where am I" query in 2d R-tree:
Regions = RtreeSearch(a,b,root) RtreeSearch(x,y,Node) { Regions = [] if (leaf(Node)) { for each (x1,y1,x2,y2,P) in Node { if (x1,y1,x2,y2) contains (a,b) { add (x1,y1,x2,y2) to Regions } } } else { for each (x1,y1,x2,y2,Child) in Node { if (x1,y1,x2,y2) contains (a,b) { Rs = RtreeSearch(x,y,Child) add Rs to Regions } } } return Regions; }
Insertion into R-tree | 101/152 |
Insertion of an object R occurs as follows:
... Insertion into R-tree | 102/152 |
Heuristics in R-tree insertion ...
Choose child to expand to contain R
... Insertion into R-tree | 103/152 |
Quadratic split method to partition objects in S:
find objects a and b in S that produce the largest MBR place a in S1 place b in S2 for every other object c in the S { s1 = current size of S1 i1 = size of S1 with c added s2 = current size of S2 i2 = size of S2 with c added if ((s1-i1) < (s2-i2)) add c to S1 else add c to S2 }
R-tree variants | 104/152 |
R+ tree does not permit overlapping regions.
R-trees in PostgreSQL | 105/152 |
Up to version 8.2, PostgreSQL had separate R-tree implementation
rtget.c
rtree.c
rtpicksplit()
rtree.c
GiST in PostgreSQL | 106/152 |
PostgreSQL provides an implementation of Generalized Search Trees (GiST).
Above discussion of tree-based indexes shows commonalities:
... GiST in PostgreSQL | 107/152 |
GiST trees have the following structural constraints:
Users are free to implement their own p to produce their own "flavour" of search tree.
Details: src/backend/access/gist
Signature-based Selection |
Indexing with Signatures | 109/152 |
Signature-based indexing: an "efficient" linear scan.
Why abandon sub-linear complexity? (e.g. O(1) for hashing)
Arising from work with high-d feature vectors in multimedia:
... Indexing with Signatures | 110/152 |
A signature is a compact (lossy) descriptor for a tuple.
Scanning the whole data file is too expensive, so
Signature Files | 111/152 |
File organisation for signature indexing (two files)
(One signature slot per tuple slot in the data file; unused signature slots are zeroed)
... Signature Files | 112/152 |
Note that signatures do not determine tuple placement.
The only requirement is that
Thus, signatures can be used in conjunction with other indexing schemes.
Signatures | 113/152 |
Signatures (descriptors)
... Signatures | 114/152 |
A tuple r consists of n attributes A1..An.
A codeword cw(Ai) is
Generating Codewords | 115/152 |
A method for generating a k-in-m codeword for attribute Ai
seed = hash(A[i]) nbits = 0 codeword = 0/* m zero bits */ while (nbits < k) { i = random(0,m-1) if (bit i not set in codeword) { set bit i in codeword nbits++ } }
Requires bit-string operations and a pseudo-random number generator.
Superimposed Codewords (SIMC) | 116/152 |
In a superimposed codewords (simc) indexing scheme
OR
OR
OR
SIMC Example | 117/152 |
Consider the following tuple (from a bank deposit database)
branch | acctNo | name | amount |
Perryridge | 102 | Hayes | 400 |
It has the following codewords/descriptor (for m = 12, k = 2)
cw(Ai) | ||
010000000001 |
||
000000000011 |
||
000001000100 |
||
000010000100 |
||
010011000111 |
... SIMC Example | 118/152 |
Consider a very small example database, with associated signatures:
Tuple | ||
100101001001 |
(Brighton,217,Green,750) | |
010101010101 |
(Clearview,117,Throggs,295) | |
101001001001 |
(Downtown,101,Johnshon,512) | |
101100000011 |
(Mianus,215,Smith,700) | |
010011000111 |
(Perryridge,102,Hayes,400) | |
100101010011 |
(Redwood,222,Lindsay,695) | |
011110111010 |
(Round Hill,305,Turner,350) |
SIMC Queries | 119/152 |
To answer query q in SIMC
desc(q) is formed by bitwise OR of codewords for supplied attributes.
E.g. consider the query (Perryridge, ?, ?, ?)
Ai | cw(Ai) |
Perryridge | 010000000001 |
? |
000000000000 |
? |
000000000000 |
? |
000000000000 |
desc(q) | 010000000001 |
... SIMC Queries | 120/152 |
Once we have a query descriptor, we search the signature file:
for each descriptor D[i] in signature file { if (D[i] matches desc(q)) mark tuple R[i] as a match } for each marked tuple R { fetch tuple R }
The critical assumption to make this approach viable:
... SIMC Queries | 121/152 |
What sort of signature comparison is required to answer the example query?
Clearly, any matching tuple must have the value "Perryridge" for A1.
Any tuple with "Perryridge" must have these bits 010000000001
So, checking whether each descriptor has these bits set, gives us the matches.
This can be generalised to multiple supplied values in the query:
... SIMC Queries | 122/152 |
Query/tuple descriptor matching can be implemented efficiently using logic operations on bit-strings.
Note that if bits(A) ⊆ bits(B) then A AND
Thus, the query/tuple descriptor matching can be implemented as:
AND
Example SIMC Query | 123/152 |
Consider the query and the example database:
Tuple | ||
010000000001 |
(Perryridge,?,?,?) |
|
100101001001 |
(Brighton,217,Green,750) | |
010101010101 |
(Clearview,117,Throggs,295) | |
101001001001 |
(Downtown,101,Johnshon,512) | |
101100000011 |
(Mianus,215,Smith,700) | |
010011000111 |
(Perryridge,102,Hayes,400) | |
100101010011 |
(Redwood,222,Lindsay,695) | |
011110111010 |
(Round Hill,305,Turner,350) |
... Example SIMC Query | 124/152 |
The query has potential matches:
branch | acctNo | name | amount |
Clearview | 117 | Throggs | 295 |
Perryridge | 102 | Hayes | 400 |
The first is an example of a false match.
False matches are caused by:
OR
False Matches | 125/152 |
Example:
Ai | cw(Ai) |
Perryridge | 010000000001 |
102 | 000000000011 |
Hayes | 000001000100 |
400 | 000010000100 |
desc(r) | 010011000111 |
Ai | cw(Ai) |
Clearview | 010000010000 |
117 | 000100000001 |
Throggs | 000001000100 |
295 | 000100000100 |
desc(r) | 010101010101 |
... False Matches | 126/152 |
How to reduce likelihood of false matches?
Having k too high ⇒ increased overlapping.
Having k too low ⇒ increased codeword collisions.
Thus, for SIMC schemes, there is the notion of choosing optimal m, k.
Design with SIMC | 127/152 |
SIMC schemes have the notion of false match probability pF.
As suggested above, pF is affected by m, k and n (#attributes)
In designing a SIMC index for a data file, the aim is to choose indexing parameters (m, k) to minimise the likelihood of false matches.
m = ( 1/loge 2 )2 . n . loge ( 1/pF )
... Design with SIMC | 128/152 |
Design example:
Query Cost for SIMC | 129/152 |
A page-oriented view of how SIMC queries are answered:
determine query descriptor QD for each page SB of tuple descriptors { for each tuple descriptor RD in page SB { if (RD matches QD) { add tuple R to list of matches add pageOf(R) to Pages } } } for each page P in Pages { Buf = getPage(f,P) check Buf for matching tuples }
With a buffer manager, the previous tuple-oriented version would behave like this.
... Query Cost for SIMC | 130/152 |
To answer pmr query, must read r signatures.
Since one signature per tuple, cannot feasibly store them in memory.
If signature contains m bits, can fit ND = ⌊ 8B/m ⌋ signatures per page.
Thus, must read bD = ⌈ r/ND ⌉ pages of signature data.
E.g. m=32, B=1024, r=105 ⇒ ND = 256, bD=391
How many data pages do we read?
... Query Cost for SIMC | 131/152 |
Number of data pages read depends on:
Expected false matches = rF = (r - rq).pF.
Assume matching tuples are uniformly distributed over the data pages.
Then total data pages read = bq = ⌈ (rq+rF)/r × b ⌉
Costpmr = bD + bq
... Query Cost for SIMC | 132/152 |
For one queries, can use same optimisation as for Heaps.
That is, stop the search as soon as the single matching tuple is found.
The average cost in the case would be:
... Query Cost for SIMC | 133/152 |
SIMC provides no assistance in answering range, space queries, unless the data file is sorted.
If the data file is sorted,
SIMC provides no asistance at all for sim queries (linear scan).
Applications of SIMC | 134/152 |
SIMC is useful for pmr, which allows a wide range of query types, e.g.
select * from R where a=1 select * from R where a=3 and c=1 and d=4 select * from R where a=2 and b=3 and c=4 and d=5
The above discussion has assumed that n attributes are indexed.
In fact, SIMC is reasonably flexible about how many attributes are involved
Thus, it also has applications in information retrieval:
Two-level SIMC | 135/152 |
Scanning one descriptor for every tuple is not efficient.
How to reduce number of descriptors?
Have a smaller number of larger descriptors.
E.g. one descriptor for each data page.
Every attribute of every tuple in page contributes to descriptor.
How large is a page descriptor (BD)?
Use formulae above, but with Nr.n ``attributes''.
Typically, pages are 1..8KB ⇒ 8..64 BD/page (NBD).
Two-Level SIMC Files | 136/152 |
File organisation for two-level superimposed codeword index
... Two-Level SIMC Files | 137/152 |
Alternative file organisation:
(Assumes that we need to read the data pages anyway, so may as well include tuple descriptors there ... or simply omit them altogether)
Queries with Two-level SIMC | 138/152 |
Method for answering queries:
form query descriptor QD for each page descriptor BD { if (BD matches QD) add P to Pages } for each page P in Pages { read page P check for matching tuples// could do this using tuple descriptors }
Costpmr = ⌈ b/NBD ⌉ + bq
Bit-sliced SIMC | 139/152 |
Reading all page descriptors is still not efficient, especially when only a few bits are set.
To improve efficiency, reconsider the matching process:
Tuple | ||
010000000001 |
(Perryridge,?,?,?) |
|
100101001001 |
(Brighton,217,Green,750) | |
010101010101 |
(Clearview,117,Throggs,295) | |
101001001001 |
(Downtown,101,Johnshon,512) | |
101100000011 |
(Mianus,215,Smith,700) | |
010011000111 |
(Perryridge,102,Hayes,400) | |
100101010011 |
(Redwood,222,Lindsay,695) | |
011110111010 |
(Round Hill,305,Turner,350) |
... Bit-sliced SIMC | 140/152 |
Rather than storing b m-bit page descriptors, store m b-bit descriptor slices.
... Bit-sliced SIMC | 141/152 |
To answer queries:
form query descriptor QD// Result is a bit-slice Result = AllOnes for each bit b set in QD { read bit-slice b Result = Result AND b }
Result
Queries much more efficient because often only a few bits set in QD.
However, updates are expensive ... need to set k different slices for each insert.
Changing b is very expensive.
Query Cost for Bit-sliced SIMC | 142/152 |
A major cost in SIMC queries is reading the page-descriptor slices.
Each slice is b bits long (one bit for each data file page).
Thus, each slice occupies ⌊ 8B/b ⌋ bytes.
Generally, we can fit several page descriptor slices per page.
However, we make the pessimistic assumption ...
... Query Cost for Bit-sliced SIMC | 143/152 |
Consider a query (val1, ?, val2, ?, ...)
Thus, the descriptor-reading cost in queries can be made constant.
The downside of this trick is a (slight) increase in likelihood of false matches.
Costpmr = j + bq
Disjoint Codewords (DJC) | 144/152 |
In a disjoint codewords (djc) indexing scheme
DJC example | 145/152 |
Consider the following tuple (from a bank deposit database)
branch | acctNo | name | amount |
Perryridge | 102 | Hayes | 400 |
It has the following codewords/descriptor (for m = 4, k = 2)
cw(Ai) | ||
0101 |
||
0011 |
||
1001 |
||
0101 |
||
0101001110010101 |
(Note: length of tuple = ~28 bytes, length of descriptor = 16 bits = 2 bytes)
DJC Queries | 146/152 |
Use a similar approach to SIMC:
0000
... DJC Queries | 147/152 |
E.g. consider the query (Perryridge, ?, Hayes, ?)
Ai | cw(Ai) |
Perryridge | 0101 |
? | 0000 |
Hayes | 1001 |
? | 0000 |
desc(t) | 0101000010010000 |
If the query has s supplied attributes, then desc(q) has sk bits set to 1.
... DJC Queries | 148/152 |
Searching the signature file is done in the same way as for SIMC:
Matches = {} for each descriptor D[i] in sig file { if (D[i] matches desc(q)) Matches = Matches ∪ TupleId[i] } for each Tid in Matches fetch tuple t via Tid
The matching process is also the same as for SIMC
AND
Example DJC Query | 149/152 |
Consider the query and the example database:
Tuple | ||
0101 0000 1001 0000 |
(Perryridge,?,Hayes,?) |
|
1010 0110 1001 0110 |
(Brighton,217,Green,750) | |
0101 0101 0101 0101 |
(Clearview,117,Throggs,295) | |
1010 1010 1001 1010 |
(Downtown,101,Johnson,512) | |
1010 0011 0011 0101 |
(Mianus,215,Smith,700) | |
0101 0011 1001 1010 |
(Perryridge,102,Hayes,400) | |
1001 0101 0011 1010 |
(Redwood,222,Lindsay,695) | |
0101 0110 1001 0110 |
(Round Hill,305,Turner,350) |
False Matches | 150/152 |
The query has potential matches:
branch | acctNo | name | amount |
Perryridge | 102 | Hayes | 400 |
Round Hill | 305 | Turner | 350 |
DJC also suffers from the "false match" problem, due to:
... False Matches | 151/152 |
How to reduce likelihood of false matches?
DJC can be "tuned" to the mix of pmr queries used on the relation.
Query Cost for DJC | 152/152 |
If we assume that
However, we can make sure that certain query types are answered with less likelihood of false matches by adjusting the codeword lengths.
The optimisations from SIMC (two-level, bit-slice) can also be applied.
As for SIMC, DJC assists with one and pmr, but not range, space, sim.
Produced: 12 Jul 2019