COMP9315 Week 05 Monday Lecture

❖ Things To Note

Assignment 1
- due midnight Friday 15 March (this Friday)
- before asking on Forum ... (a) search, (b) FAQ
Assignment 2
- released end week 6 ... due end week 9
Quiz 3
- released Monday week 6 ... due Friday week 6

❖ N-dimensional Selection

❖ N-dimensional Queries

Have looked at one-dimensional queries, e.g.

select * from R where a = K
select * from R where a between Lo and Hi

and heaps, hashing, indexing as ways of efficient implementation.

Now consider techniques for efficient multi-dimensional queries.

Compared to 1-d queries, multi-dimensional queries

typically produce fewer results
require us to consider more information
require more effort to produce results

<< ∧ >>

❖ Operations for Nd Select

N-dimensional select queries = condition on ≥1 attributes.

pmr = partial-match retrieval (equality tests), e.g.

select * from Employees
where  job = 'Manager' and gender = 'M';

space = tuple-space queries (range tests), e.g.

select * from Employees
where 20 ≤ age ≤ 50 and 40K ≤ salary ≤ 60K

<< ∧ >>

❖ N-d Selection via Heaps

Heap files can handle pmr or space using standard method:

// select * from R where C
r = openRelation("R",READ);
for (p = 0; p < nPages(r); p++) {
    buf = getPage(file(r), p);
    for (i = 0; i < nTuples(buf); i++) {
        t = getTuple(buf,i);
        if (matches(t,C))
            add t to result set
    }
}

Cost_pmr = Cost_space = b

<< ∧ >>

❖ N-d Selection via Multiple Indexes

DBMSs already support building multiple indexes on a table.

Which indexes to build depends on which queries are asked.

create table R (a int, b int, c int);
create index Rax on R (a);
create index Rbx on R (b);
create index Rcx on R (c);
create index Rabx on R (a,b);
create index Racx on R (a,c);
create index Rbcx on R (b,c);
create index Rallx on R (a,b,c);

But more indexes ⇒ space + update overheads.

<< ∧ >>

❖ N-d Queries and Indexes

Generalised view of pmr and space queries:

select * from R
where  a₁ op₁ C₁ and ... and a_n op_n C_n

pmr : all op_i are equality tests. space : some op_i are range tests.

Possible approaches to handling such queries ...

use index on one a_i to reduce tuple tests
use indexes on all a_i, and intersect answer sets

<< ∧ >>

❖ N-d Queries and Indexes (cont)

If using just one of several indexes, which one to use?

select * from R
where  a₁ op₁ C₁ and ... and a_n op_n C_n

The one with best selectivity for a_i op_i C_i (i.e. fewest matches)

Factors determining selectivity of a_i op_i C_i

assume uniform distribution of values in dom(a_i)
equality test on primary key gives at most one match
equality test on larger dom(a_i) ⇒ less matches
range test over large part of dom(a_i) ⇒ many matches

<< ∧ >>

❖ N-d Queries and Indexes (cont)

Implementing selection using one of several indices:

// Query: select * from R where a₁op₁C₁ and ... and a_nop_nC_n
// choose a_i with best selectivity
TupleIDs = IndexLookup(R,a_i,op_i,C_i)
// gives { tid₁, tid₂, ...} for tuples satisfying a_iop_iC_i
PageIDs = { }
foreach tid in TupleIDs
   { PageIDs = PageIDs ∪ {pageOf(tid)} }

// PageIDs = a set of b_{q_ix} page numbers
...

Cost = Cost_index + b_{q_ix} (some pages do not contain answers, b_{q_ix} > b_q)

DBMSs typically maintain statistics to assist with determining selectivity

<< ∧ >>

❖ N-d Queries and Indexes (cont)

Implementing selection using multiple indices:

// Query: select * from R where a₁op₁C₁ and ... and a_nop_nC_n
// assumes an index on at least a_i
TupleIDs = IndexLookup(R,a₁,op₁,C₁)
foreach attribute a_i with an index {
   tids = IndexLookup(R,a_i,op_i,C_i)
   TupleIDs = TupleIDs ∩ tids
}
PageIDs = { }
foreach tid in TupleIDs
   { PageIDs = PageIDs ∪ {pageOf(tid)} }
// PageIDs = a set of b_q page numbers
...

Cost = k.Cost_index + b_q (assuming indexes on k of n attrs)

<< ∧ >>

❖ Exercise: One vs Multiple Indices

Consider a relation with r = 100,000, B = 4096, defined as:

create table Students (
    id       integer primary key,
    name     char(10), -- simplified
    gender   char(1),  -- 'm','f','?'
    birthday char(5)   -- 'MM-DD'
);

Assumptions:

data file is not ordered on any attribute
has a dense B-tree index on each attribute
96 bytes of header in each data/index page

<< ∧ >>

❖ Exercise: One vs Multiple Indices (cont)

For Students(id,name,gender,birthday) ...

calculate the size of the data file and each index
describe the selectivity of each attribute

Now consider a query on this relation:

select * from Students
where  name='John' and birthday='04-01'

estimate the cost of answering using name index
estimate the cost of answering using birthday index
estimate the cost of answering using both indices

<< ∧ >>

❖ Bitmap Indexes

Alternative index structure, focussing on sets of tuples:

Index contains bit-strings of r bits, one for each value/range

<< ∧ >>

❖ Bitmap Indexes (cont)

Also useful to have a file of tids, giving file structures:

[Diagram:Pics/file-struct/bitmap-files.png]

<< ∧ >>

❖ Bitmap Indexes (cont)

Answering queries using bitmap index:


Matches = AllOnes(r)
foreach attribute A with index {
   // select i^th bit-string for attribute A
   // based on value associated with A in WHERE
   Matches = Matches & Bitmaps[A][i]
}
// Matches contains 1-bit for each matching tuple
foreach i in 0..r-1 {
   if (Matches[i] == 0) continue;
   Pages = Pages ∪ {pageOf(Tids[i])}
}
foreach pid in Pages {
   P = getPage(pid)
   extract matching tuples from P
}

<< ∧ >>

❖ Exercise: Bitmap Index

Using the following file structure:

Show how the following queries would be answered:

select * from Parts
where colour='red' and price < 4.00

select * from Parts
where colour='green' or colour ='blue'

<< ∧ >>

❖ Bitmap Indexes

Storage costs for bitmap indexes:

one bitmap for each value/range for each indexed attribute
each bitmap has length ceil(r/8) bytes
e.g. with 50K records and 8KB pages, bitmap fits in one page

Query execution costs for bitmap indexes:

read one bitmap for each indexed attribute in query
perform bitwise AND on bitmaps (in memory)
read pages containing matching tuples

Note: bitmaps could index pages rather than tuples (shorter bitmaps)

<< ∧ >>

❖ Hashing for N-d Selection

<< ∧ >>

❖ Hashing and pmr

For a pmr query like

select * from R where a₁ = C₁ and ... and a_n = C_n

if one a_i is the hash key, query is very efficient
if no a_i is the hash key, need to use linear scan

Can be alleviated using multi-attribute hashing (mah)

form a composite hash value involving all attributes
at query time, some components of composite hash are known
(allows us to limit the number of data pages which need to be checked)

MA.hashing works in conjunction with any dynamic hash scheme.

<< ∧ >>

❖ Hashing and pmr (cont)

Multi-attribute hashing parameters:

file size = b = 2^d pages ⇒ use d-bit hash values
relation has n attributes: a₁, a₂, ...a_n
attribute a_i has hash function h_i
attribute a_i contributes d_i bits (to the combined hash value)
total bits d = ∑_i=1ⁿ d_i
a choice vector (cv) specifies for all k ...
bit j from h_i(a_i) contributes bit k in combined hash value

<< ∧ >>

❖ MA.Hashing Example

Consider relation Deposit(branch,acctNo,name,amount)

Assume a small data file with 8 main data pages (plus overflows).

Hash parameters: d=3 d₁=1 d₂=1 d₃=1 d₄=0

Note that we ignore the amount attribute (d₄=0)

Assumes that nobody will want to ask queries like

select * from Deposit where amount=533

Choice vector is designed taking expected queries into account.

<< ∧ >>

❖ MA.Hashing Example (cont)

Choice vector:

This choice vector tells us:

bit 0 in hash comes from bit 0 of hash₁(a₁) ( b_1,0 )
bit 1 in hash comes from bit 0 of hash₂(a₂) ( b_2,0 )
bit 2 in hash comes from bit 0 of hash₃(a₃) ( b_3,0 )
bit 3 in hash comes from bit 1 of hash₁(a₁) ( b_1,1 )
etc. etc. etc. (up to as many bits of hashing as required, e.g. 32)

<< ∧ >>

❖ MA.Hashing Example (cont)

Consider the tuple:

branch acctNo name amount

Downtown 101 Johnston 512

Hash value (page address) is computed by:

<< ∧ >>

❖ MA.Hashing Hash Functions

Auxiliary definitions:

#define MaxHashSize 32
typedef unsigned int HashVal;

// extracts i'th bit from hash value h
#define bit(i,h) (((h) & (1 << (i))) >> (i))

// choice vector elems
typedef struct { int attr, int bit } CVelem;
typedef CVelem ChoiceVec[MaxHashSize];

// hash function for individual attributes
HashVal hash_any(char *val) { ... }

<< ∧ >>

❖ MA.Hashing Hash Functions (cont)

Produce combined d-bit hash value for tuple t :


HashVal hash(Tuple t, ChoiceVec cv, int d)
{
    HashVal h[nAttr(t)+1];  // hash for each attr
    HashVal res = 0, oneBit;
    int     i, a, b;
    for (i = 1; i <= nAttr(t); i++)
        h[i] = hash_any(attrVal(t,i));
    for (i = 0; i < d; i++) {
        a = cv[i].attr;
        b = cv[i].bit;
        oneBit = bit(b, h[a]);
        res = res | (oneBit << i);
    }
    return res;
}

<< ∧ >>

❖ Exercise: Multi-attribute Hashing

Compute the hash value for the tuple

('John Smith','BSc(CompSci)',1990,99.5)

where d=6, d₁=3, d₂=2, d₃=1, and

cv = <(1,0), (1,1), (2,0), (3,0), (1,2), (2,1), (3,1), (1,3), ...>
hash₁('John Smith') = ...0101010110110100
hash₂('BSc(CompSci)') = ...1011111101101111
hash₃(1990) = ...0001001011000000

<< ∧ >>

❖ Queries with MA.Hashing

In a partial match query:

values of some attributes are known
values of other attributes are unknown

E.g.

select amount
from   Deposit
where  branch = 'Brighton' and name = 'Green'

for which we use the shorthand (Brighton, ?, Green, ?)

<< ∧ >>

❖ Queries with MA.Hashing (cont)

Consider query: select amount from Deposit where name='Green'

Matching tuples must be in pages: 100, 101, 110, 111.

<< ∧ >>

❖ Exercise: Partial hash values in MAH

Given the following:

d=6, b=2⁶, CV = <(1,0),(1,1),(2,0),(3,0),(2,1),(1,2), ...>
hash (a) = ...00101101001101
hash (b) = ...00101101001101
hash (c) = ...00101101001101

What are the query hashes for each of the following:

(a,b,c), (?,b,c), (a,?,?), (?,?,?)

<< ∧ >>

❖ MA.Hashing Query Algorithm

// Builds the partial hash value (e.g. 10*0*1)
// Treats query like tuple with some attr values missing
nstars = 0;
for each attribute i in query Q {
    if (hasValue(Q,i)) {
        set d[i] bits in composite hash
            using choice vector and hash(Q,i)
    } else {
        set d[i] *'s in composite hash
            using choice vector
        nstars += d[i]
    }
}
...

<< ∧ >>

❖ MA.Hashing Query Algorithm (cont)

...
// Use the partial hash to find candidate pages

r = openRelation("R",READ);
for (i = 0; i < 2^nstars; i++) {
    pid = composite hash
    replace *'s in pid
        using i and choice vector
    Buf = readPage(file(r), pid);
    for each tuple T in Buf {
        if (T satisfies pmr query)
            add T to results
    }
}

<< ∧ >>

❖ Exercise: Representing Stars

Our hash values are bit-strings (e.g. 100101110101)

MA.Hashing introduces a third value (* = unknown)

How could we represent "bit"-strings like 1011*1*0**010?

<< ∧ >>

❖ Exercise: MA.Hashing Query Cost

Consider R(x,y,z) using multi-attribute hashing where

d = 9 d_x = 5 d_y = 3 d_z = 1

How many buckets are accessed in answering each query?

select * from R where x = 4 and y = 2 and z = 1
select * from R where x = 5 and y = 3
select * from R where y = 99
select * from R where z = 23
select * from R where x > 5

<< ∧ >>

❖ Query Cost for MA.Hashing

Multi-attribute hashing handles a range of query types, e.g.

select * from R where a=1
select * from R where d=2
select * from R where b=3 and c=4
select * from R where a=5 and b=6 and c=7

A relation with n attributes has 2ⁿ different query types.

Different query types have different costs (different no. of *'s)

Cost(Q) = 2^s where s = ∑_{i ∉ Q} d_i (alternatively Cost(Q) = ∏_{i ∉ Q} 2^d_i)

Query distribution gives probability p_Q of asking each query type Q.

<< ∧ >>

❖ Query Cost for MA.Hashing (cont)

Min query cost occurs when all attributes are used in query

Min Cost_pmr = 1

Max query cost occurs when no attributes are specified

Max Cost_pmr = 2^d = b

Average cost is given by weighted sum over all query types:

Avg Cost_pmr = ∑_Q p_Q ∏_{i ∉ Q} 2^d_i

Aim to minimise the weighted average query cost over possible query types

<< ∧ >>

❖ Optimising MA.Hashing Cost

For a given application, useful to minimise Cost_pmr.

Can be achieved by choosing appropriate values for d_i (cv)

Heuristics:

distribution of query types (more bits to frequently used attributes)
size of attribute domain (≤ #bits to represent all values in domain)
discriminatory power (more bits to highly discriminating attributes)

Trade-off: making Q_j more efficient makes Q_k less efficient.

This is a combinatorial optimisation problem
(solve via standard optimisation techniques e.g. simulated annealing)

<< ∧

❖ Exercise: MA.Hashing Design

Consider relation Person(name,gender,age) ...

p_Q	Query Type Q
0.5	`select name from Person` `where gender=X and age=Y`
0.25	`select age from Person` `where name=X`
0.25	`select name from Person` `where gender=X`

Assume that all other query types have p_Q=0.

Design a choice vector to minimise average selection cost.

branch	acctNo	name	amount
Downtown	101	Johnston	512