COMP9315 Week 05 Monday Lecture

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [0/37]
❖ Things To Note


COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [1/37]
❖ N-dimensional Selection

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [2/37]
❖ N-dimensional Queries

Have looked at one-dimensional queries, e.g.

select * from R where a = K
select * from R where a between Lo and Hi

and heaps, hashing, indexing as ways of efficient implementation.

Now consider techniques for efficient multi-dimensional queries.

Compared to 1-d queries, multi-dimensional queries

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [3/37]
❖ Operations for Nd Select

N-dimensional select queries = condition on ≥1 attributes.

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [4/37]
❖ N-d Selection via Heaps

Heap files can handle pmr or space using standard method:

// select * from R where C
r = openRelation("R",READ);
for (p = 0; p < nPages(r); p++) {
    buf = getPage(file(r), p);
    for (i = 0; i < nTuples(buf); i++) {
        t = getTuple(buf,i);
        if (matches(t,C))
            add t to result set
    }
}

Costpmr  =  Costspace  =  b

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [5/37]
❖ N-d Selection via Multiple Indexes

DBMSs already support building multiple indexes on a table.

Which indexes to build depends on which queries are asked.

create table R (a int, b int, c int);
create index Rax on R (a);
create index Rbx on R (b);
create index Rcx on R (c);
create index Rabx on R (a,b);
create index Racx on R (a,c);
create index Rbcx on R (b,c);
create index Rallx on R (a,b,c);

But more indexes space + update overheads.

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [6/37]
❖ N-d Queries and Indexes

Generalised view of pmr and space queries:

select * from R
where  a1 op1 C1 and ... and an opn Cn 

pmr : all opi are equality tests.     space : some opi are range tests.

Possible approaches to handling such queries ...

  1. use index on one ai to reduce tuple tests
  2. use indexes on all ai, and intersect answer sets
COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [7/37]
❖ N-d Queries and Indexes (cont)

If using just one of several indexes, which one to use?

select * from R
where  a1 op1 C1 and ... and an opn Cn 

The one with best selectivity for ai opi Ci   (i.e. fewest matches)

Factors determining selectivity of ai opi Ci

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [8/37]
❖ N-d Queries and Indexes (cont)

Implementing selection using one of several indices:

// Query: select * from R where a1op1C1 and ... and anopnCn
// choose ai with best selectivity
TupleIDs = IndexLookup(R,ai,opi,Ci)
// gives { tid1, tid2, ...} for tuples satisfying aiopiCi
PageIDs = { }
foreach tid in TupleIDs
   { PageIDs = PageIDs ∪ {pageOf(tid)} }

// PageIDs = a set of bqix page numbers
...

Cost = Costindex + bqix    (some pages do not contain answers, bqix > bq)

DBMSs typically maintain statistics to assist with determining selectivity

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [9/37]
❖ N-d Queries and Indexes (cont)

Implementing selection using multiple indices:

// Query: select * from R where a1op1C1 and ... and anopnCn
// assumes an index on at least ai
TupleIDs = IndexLookup(R,a1,op1,C1)
foreach attribute ai with an index {
   tids = IndexLookup(R,ai,opi,Ci)
   TupleIDs = TupleIDs ∩ tids
}
PageIDs = { }
foreach tid in TupleIDs
   { PageIDs = PageIDs ∪ {pageOf(tid)} }
// PageIDs = a set of bq page numbers
...

Cost = k.Costindex + bq     (assuming indexes on k of n attrs)

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [10/37]
❖ Exercise: One vs Multiple Indices

Consider a relation with r = 100,000, B = 4096, defined as:

create table Students (
    id       integer primary key,
    name     char(10), -- simplified
    gender   char(1),  -- 'm','f','?'
    birthday char(5)   -- 'MM-DD'
);

Assumptions:

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [11/37]
❖ Exercise: One vs Multiple Indices (cont)

For Students(id,name,gender,birthday) ...

Now consider a query on this relation:

select * from Students
where  name='John' and birthday='04-01'

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [12/37]
❖ Bitmap Indexes

Alternative index structure, focussing on sets of tuples:

[Diagram:Pics/select/bitmap-index.png]

Index contains bit-strings of r bits, one for each value/range

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [13/37]
❖ Bitmap Indexes (cont)

Also useful to have a file of tids, giving file structures:


[Diagram:Pics/file-struct/bitmap-files.png]

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [14/37]
❖ Bitmap Indexes (cont)

Answering queries using bitmap index:


Matches = AllOnes(r)
foreach attribute A with index {
   // select ith bit-string for attribute A
   // based on value associated with A in WHERE
   Matches = Matches & Bitmaps[A][i]
}
// Matches contains 1-bit for each matching tuple
foreach i in 0..r-1 {
   if (Matches[i] == 0) continue;
   Pages = Pages ∪ {pageOf(Tids[i])}
}
foreach pid in Pages {
   P = getPage(pid)
   extract matching tuples from P
}

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [15/37]
❖ Exercise: Bitmap Index

Using the following file structure:

[Diagram:Pics/select/bitmap-index.png]

Show how the following queries would be answered:

select * from Parts
where colour='red' and price < 4.00

select * from Parts
where colour='green' or colour ='blue'

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [16/37]
❖ Bitmap Indexes

Storage costs for bitmap indexes:

Query execution costs for bitmap indexes: Note: bitmaps could index pages rather than tuples (shorter bitmaps)
COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [17/37]
❖ Hashing for N-d Selection

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [18/37]
❖ Hashing and pmr

For a pmr query like

select * from R where a1 = C1 and ... and an = Cn

Can be alleviated using multi-attribute hashing (mah) MA.hashing works in conjunction with any dynamic hash scheme.
COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [19/37]
❖ Hashing and pmr (cont)

Multi-attribute hashing parameters:

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [20/37]
❖ MA.Hashing Example

Consider relation Deposit(branch,acctNo,name,amount)

Assume a small data file with 8 main data pages (plus overflows).

Hash parameters:    d=3     d1=1    d2=1    d3=1    d4=0

Note that we ignore the amount attribute (d4=0)

Assumes that nobody will want to ask queries like

select * from Deposit where amount=533

Choice vector is designed taking expected queries into account.

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [21/37]
❖ MA.Hashing Example (cont)

Choice vector:

[Diagram:Pics/select/choice-vector.png]

This choice vector tells us:

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [22/37]
❖ MA.Hashing Example (cont)

Consider the tuple:

branch acctNo name amount
Downtown 101 Johnston 512

Hash value (page address) is computed by:

[Diagram:Pics/select/mahins2.png]

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [23/37]
❖ MA.Hashing Hash Functions

Auxiliary definitions:

#define MaxHashSize 32
typedef unsigned int HashVal;

// extracts i'th bit from hash value h
#define bit(i,h) (((h) & (1 << (i))) >> (i))

// choice vector elems
typedef struct { int attr, int bit } CVelem;
typedef CVelem ChoiceVec[MaxHashSize];

// hash function for individual attributes
HashVal hash_any(char *val) { ... }

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [24/37]
❖ MA.Hashing Hash Functions (cont)

Produce combined d-bit hash value for tuple t :


HashVal hash(Tuple t, ChoiceVec cv, int d)
{
    HashVal h[nAttr(t)+1];  // hash for each attr
    HashVal res = 0, oneBit;
    int     i, a, b;
    for (i = 1; i <= nAttr(t); i++)
        h[i] = hash_any(attrVal(t,i));
    for (i = 0; i < d; i++) {
        a = cv[i].attr;
        b = cv[i].bit;
        oneBit = bit(b, h[a]);
        res = res | (oneBit << i);
    }
    return res;
}

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [25/37]
❖ Exercise: Multi-attribute Hashing

Compute the hash value for the tuple

('John Smith','BSc(CompSci)',1990,99.5)

where d=6,   d1=3,   d2=2,   d3=1, and

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [26/37]
❖ Queries with MA.Hashing

In a partial match query:

E.g.

select amount
from   Deposit
where  branch = 'Brighton' and name = 'Green'

for which we use the shorthand   (Brighton, ?, Green, ?)

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [27/37]
❖ Queries with MA.Hashing (cont)

Consider query: select amount from Deposit where name='Green'


[Diagram:Pics/select/mahquery2.png]


Matching tuples must be in pages: 100, 101, 110, 111.

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [28/37]
❖ Exercise: Partial hash values in MAH

Given the following:

What are the query hashes for each of the following:
COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [29/37]
❖ MA.Hashing Query Algorithm

// Builds the partial hash value (e.g. 10*0*1)
// Treats query like tuple with some attr values missing
nstars = 0;
for each attribute i in query Q {
    if (hasValue(Q,i)) {
        set d[i] bits in composite hash
            using choice vector and hash(Q,i)
    } else {
        set d[i] *'s in composite hash
            using choice vector
        nstars += d[i]
    }
}
...

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [30/37]
❖ MA.Hashing Query Algorithm (cont)

...
// Use the partial hash to find candidate pages

r = openRelation("R",READ);
for (i = 0; i < 2nstars; i++) {
    pid = composite hash
    replace *'s in pid
        using i and choice vector
    Buf = readPage(file(r), pid);
    for each tuple T in Buf {
        if (T satisfies pmr query)
            add T to results
    }
}

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [31/37]
❖ Exercise: Representing Stars


Our hash values are bit-strings (e.g. 100101110101)

MA.Hashing introduces a third value (* = unknown)

How could we represent "bit"-strings like 1011*1*0**010?

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [32/37]
❖ Exercise: MA.Hashing Query Cost

Consider R(x,y,z) using multi-attribute hashing where

d = 9     dx = 5     dy = 3     dz = 1

How many buckets are accessed in answering each query?

  1. select * from R where x = 4 and y = 2 and z = 1
  2. select * from R where x = 5 and y = 3
  3. select * from R where y = 99
  4. select * from R where z = 23
  5. select * from R where x > 5
COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [33/37]
❖ Query Cost for MA.Hashing

Multi-attribute hashing handles a range of query types, e.g.

select * from R where a=1
select * from R where d=2
select * from R where b=3 and c=4
select * from R where a=5 and b=6 and c=7

A relation with n  attributes has 2n different query types.

Different query types have different costs   (different no. of *'s)

Cost(Q) = 2s   where s = i ∉ Q di     (alternatively Cost(Q) = i ∉ Q 2di)

Query distribution gives probability pQ of asking each query type Q.

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [34/37]
❖ Query Cost for MA.Hashing (cont)

Min query cost occurs when all attributes are used in query

Min Costpmr  =  1

Max query cost occurs when no attributes are specified

Max Costpmr  =  2d  =  b

Average cost is given by weighted sum over all query types:

Avg Costpmr  =  ∑Q pQi ∉ Q 2di


Aim to minimise the weighted average query cost over possible query types

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [35/37]
❖ Optimising MA.Hashing Cost

For a given application, useful to minimise Costpmr.

Can be achieved by choosing appropriate values for di   (cv)

Heuristics:

Trade-off: making Qj more efficient makes Qk less efficient.

This is a combinatorial optimisation problem
(solve via standard optimisation techniques e.g. simulated annealing)

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [36/37]
❖ Exercise: MA.Hashing Design

Consider relation Person(name,gender,age) ...

pQ Query Type Q
0.5 select name from Person
where gender=X and age=Y
0.25 select age from Person
where name=X
0.25 select name from Person
where gender=X

Assume that all other query types have pQ=0.

Design a choice vector to minimise average selection cost.

COMP9315 24T1 ♢ Week 5 Monday Lecture ♢ [37/37]


Produced: 11 Mar 2024