❖ Things To Note |
❖ N-dimensional Queries |
Have looked at one-dimensional queries, e.g.
select * from R where a = K select * from R where a between Lo and Hi
and heaps, hashing, indexing as ways of efficient implementation.
Now consider techniques for efficient multi-dimensional queries.
Compared to 1-d queries, multi-dimensional queries
❖ Operations for Nd Select |
N-dimensional select queries = condition on ≥1 attributes.
select * from Employees where job = 'Manager' and gender = 'M';
select * from Employees where 20 ≤ age ≤ 50 and 40K ≤ salary ≤ 60K
❖ N-d Selection via Heaps |
Heap files can handle pmr or space using standard method:
// select * from R where C
r = openRelation("R",READ);
for (p = 0; p < nPages(r); p++) {
buf = getPage(file(r), p);
for (i = 0; i < nTuples(buf); i++) {
t = getTuple(buf,i);
if (matches(t,C))
add t to result set
}
}
Costpmr = Costspace = b
❖ N-d Selection via Multiple Indexes |
DBMSs already support building multiple indexes on a table.
Which indexes to build depends on which queries are asked.
create table R (a int, b int, c int); create index Rax on R (a); create index Rbx on R (b); create index Rcx on R (c); create index Rabx on R (a,b); create index Racx on R (a,c); create index Rbcx on R (b,c); create index Rallx on R (a,b,c);
But more indexes ⇒ space + update overheads.
❖ N-d Queries and Indexes |
Generalised view of pmr and space queries:
select * from R where a1 op1 C1 and ... and an opn Cn
pmr : all opi are equality tests. space : some opi are range tests.
Possible approaches to handling such queries ...
❖ N-d Queries and Indexes (cont) |
If using just one of several indexes, which one to use?
select * from R where a1 op1 C1 and ... and an opn Cn
The one with best selectivity for ai opi Ci (i.e. fewest matches)
Factors determining selectivity of ai opi Ci
❖ N-d Queries and Indexes (cont) |
Implementing selection using one of several indices:
// Query: select * from R where a1op1C1 and ... and anopnCn // choose ai with best selectivity TupleIDs = IndexLookup(R,ai,opi,Ci) // gives { tid1, tid2, ...} for tuples satisfying aiopiCi PageIDs = { } foreach tid in TupleIDs { PageIDs = PageIDs ∪ {pageOf(tid)} } // PageIDs = a set of bqix page numbers ...
Cost = Costindex + bqix (some pages do not contain answers, bqix > bq)
DBMSs typically maintain statistics to assist with determining selectivity
❖ N-d Queries and Indexes (cont) |
Implementing selection using multiple indices:
// Query: select * from R where a1op1C1 and ... and anopnCn // assumes an index on at least ai TupleIDs = IndexLookup(R,a1,op1,C1) foreach attribute ai with an index { tids = IndexLookup(R,ai,opi,Ci) TupleIDs = TupleIDs ∩ tids } PageIDs = { } foreach tid in TupleIDs { PageIDs = PageIDs ∪ {pageOf(tid)} } // PageIDs = a set of bq page numbers ...
Cost = k.Costindex + bq (assuming indexes on k of n attrs)
❖ Exercise: One vs Multiple Indices |
Consider a relation with r = 100,000, B = 4096, defined as:
create table Students ( id integer primary key, name char(10), -- simplified gender char(1), -- 'm','f','?' birthday char(5) -- 'MM-DD' );
Assumptions:
❖ Exercise: One vs Multiple Indices (cont) |
For Students(id,name,gender,birthday)
select * from Students where name='John' and birthday='04-01'
name
birthday
❖ Bitmap Indexes |
Alternative index structure, focussing on sets of tuples:
Index contains bit-strings of r bits, one for each value/range
❖ Bitmap Indexes (cont) |
Answering queries using bitmap index:
Matches = AllOnes(r) foreach attribute A with index { // select ith bit-string for attribute A // based on value associated with A in WHERE Matches = Matches & Bitmaps[A][i] } // Matches contains 1-bit for each matching tuple foreach i in 0..r-1 { if (Matches[i] == 0) continue; Pages = Pages ∪ {pageOf(Tids[i])} } foreach pid in Pages { P = getPage(pid) extract matching tuples from P }
❖ Exercise: Bitmap Index |
Using the following file structure:
Show how the following queries would be answered:
select * from Parts where colour='red' and price < 4.00 select * from Parts where colour='green' or colour ='blue'
❖ Bitmap Indexes |
Storage costs for bitmap indexes:
❖ Hashing and pmr |
For a pmr query like
select * from R where a1 = C1 and ... and an = Cn
❖ Hashing and pmr (cont) |
Multi-attribute hashing parameters:
❖ MA.Hashing Example |
Consider relation Deposit(branch,acctNo,name,amount)
Assume a small data file with 8 main data pages (plus overflows).
Hash parameters: d=3 d1=1 d2=1 d3=1 d4=0
Note that we ignore the amount
Assumes that nobody will want to ask queries like
select * from Deposit where amount=533
Choice vector is designed taking expected queries into account.
❖ MA.Hashing Example (cont) |
Choice vector:
This choice vector tells us:
❖ MA.Hashing Example (cont) |
Consider the tuple:
branch | acctNo | name | amount |
Downtown | 101 | Johnston | 512 |
Hash value (page address) is computed by:
❖ MA.Hashing Hash Functions |
Auxiliary definitions:
#define MaxHashSize 32 typedef unsigned int HashVal; // extracts i'th bit from hash value h #define bit(i,h) (((h) & (1 << (i))) >> (i)) // choice vector elems typedef struct { int attr, int bit } CVelem; typedef CVelem ChoiceVec[MaxHashSize]; // hash function for individual attributes HashVal hash_any(char *val) { ... }
❖ MA.Hashing Hash Functions (cont) |
Produce combined d-bit hash value for tuple t :
HashVal hash(Tuple t, ChoiceVec cv, int d)
{
HashVal h[nAttr(t)+1]; // hash for each attr
HashVal res = 0, oneBit;
int i, a, b;
for (i = 1; i <= nAttr(t); i++)
h[i] = hash_any(attrVal(t,i));
for (i = 0; i < d; i++) {
a = cv[i].attr;
b = cv[i].bit;
oneBit = bit(b, h[a]);
res = res | (oneBit << i);
}
return res;
}
❖ Exercise: Multi-attribute Hashing |
Compute the hash value for the tuple
('John Smith','BSc(CompSci)',1990,99.5)
where d=6, d1=3, d2=2, d3=1, and
'John Smith'
...0101010110110100
'BSc(CompSci)'
...1011111101101111
1990
...0001001011000000
❖ Queries with MA.Hashing |
In a partial match query:
select amount from Deposit where branch = 'Brighton' and name = 'Green'
for which we use the shorthand (Brighton, ?, Green, ?)
❖ Queries with MA.Hashing (cont) |
Consider query: select amount from Deposit where name='Green'
Matching tuples must be in pages: 100
101
110
111
❖ Exercise: Partial hash values in MAH |
Given the following:
❖ MA.Hashing Query Algorithm |
// Builds the partial hash value (e.g. 10*0*1) // Treats query like tuple with some attr values missing nstars = 0; for each attribute i in query Q { if (hasValue(Q,i)) { set d[i] bits in composite hash using choice vector and hash(Q,i) } else { set d[i] *'s in composite hash using choice vector nstars += d[i] } } ...
❖ MA.Hashing Query Algorithm (cont) |
... // Use the partial hash to find candidate pages r = openRelation("R",READ); for (i = 0; i < 2nstars ; i++) { pid = composite hash replace *'s in pid using i and choice vector Buf = readPage(file(r), pid); for each tuple T in Buf { if (T satisfies pmr query) add T to results } }
❖ Exercise: Representing Stars |
Our hash values are bit-strings (e.g. 100101110101
MA.Hashing introduces a third value (*
How could we represent "bit"-strings like 1011*1*0**010
❖ Exercise: MA.Hashing Query Cost |
Consider R(x,y,z)
d = 9 dx = 5 dy = 3 dz = 1
How many buckets are accessed in answering each query?
select * from R where x = 4 and y = 2 and z = 1
select * from R where x = 5 and y = 3
select * from R where y = 99
select * from R where z = 23
select * from R where x > 5
❖ Query Cost for MA.Hashing |
Multi-attribute hashing handles a range of query types, e.g.
select * from R where a=1 select * from R where d=2 select * from R where b=3 and c=4 select * from R where a=5 and b=6 and c=7
A relation with n attributes has 2n different query types.
Different query types have different costs
(different no. of *
Cost(Q) = 2s where s = ∑i ∉ Q di (alternatively Cost(Q) = ∏i ∉ Q 2di)
Query distribution gives probability pQ of asking each query type Q.
❖ Query Cost for MA.Hashing (cont) |
Min query cost occurs when all attributes are used in query
Min Costpmr = 1
Max query cost occurs when no attributes are specified
Max Costpmr = 2d = b
Average cost is given by weighted sum over all query types:
Avg Costpmr = ∑Q pQ ∏i ∉ Q 2di
Aim to minimise the weighted average query cost over possible query types
❖ Optimising MA.Hashing Cost |
For a given application, useful to minimise Costpmr.
Can be achieved by choosing appropriate values for di (cv)
Heuristics:
This is a combinatorial optimisation problem
(solve via standard optimisation techniques e.g. simulated annealing)
❖ Exercise: MA.Hashing Design |
Consider relation Person(name,gender,age)
pQ | Query Type Q |
0.5 | select name from Person where gender=X and age=Y |
0.25 | select age from Person where name=X |
0.25 | select name from Person where gender=X |
Assume that all other query types have pQ=0.
Design a choice vector to minimise average selection cost.
Produced: 11 Mar 2024