E: Implementing Selection (multi-attribute)

COMP9315 24T1 ♢ Lectures Part C ♢ [0/74]
❖ N-dimensional Queries

Have looked at one-dimensional queries, e.g.

select * from R where a = K
select * from R where a between Lo and Hi

and heaps, hashing, indexing as ways of efficient implementation.

Now consider techniques for efficient multi-dimensional queries.

Compared to 1-d queries, multi-dimensional queries

COMP9315 24T1 ♢ Lectures Part C ♢ [1/74]
❖ Operations for Nd Select

N-dimensional select queries = condition on ≥1 attributes.

COMP9315 24T1 ♢ Lectures Part C ♢ [2/74]
❖ N-d Selection via Heaps

Heap files can handle pmr or space using standard method:

// select * from R where C
r = openRelation("R",READ);
for (p = 0; p < nPages(r); p++) {
    buf = getPage(file(r), p);
    for (i = 0; i < nTuples(buf); i++) {
        t = getTuple(buf,i);
        if (matches(t,C))
            add t to result set
    }
}

Costpmr  =  Costspace  =  b

COMP9315 24T1 ♢ Lectures Part C ♢ [3/74]
❖ N-d Selection via Multiple Indexes

DBMSs already support building multiple indexes on a table.

Which indexes to build depends on which queries are asked.

create table R (a int, b int, c int);
create index Rax on R (a);
create index Rbx on R (b);
create index Rcx on R (c);
create index Rabx on R (a,b);
create index Racx on R (a,c);
create index Rbcx on R (b,c);
create index Rallx on R (a,b,c);

But more indexes space + update overheads.

COMP9315 24T1 ♢ Lectures Part C ♢ [4/74]
❖ N-d Queries and Indexes

Generalised view of pmr and space queries:

select * from R
where  a1 op1 C1 and ... and an opn Cn 

pmr : all opi are equality tests.     space : some opi are range tests.

Possible approaches to handling such queries ...

  1. use index on one ai to reduce tuple tests
  2. use indexes on all ai, and intersect answer sets
COMP9315 24T1 ♢ Lectures Part C ♢ [5/74]
❖ N-d Queries and Indexes (cont)

If using just one of several indexes, which one to use?

select * from R
where  a1 op1 C1 and ... and an opn Cn 

The one with best selectivity for ai opi Ci   (i.e. fewest matches)

Factors determining selectivity of ai opi Ci

COMP9315 24T1 ♢ Lectures Part C ♢ [6/74]
❖ N-d Queries and Indexes (cont)

Implementing selection using one of several indices:

// Query: select * from R where a1op1C1 and ... and anopnCn
// choose ai with best selectivity
TupleIDs = IndexLookup(R,ai,opi,Ci)
// gives { tid1, tid2, ...} for tuples satisfying aiopiCi
PageIDs = { }
foreach tid in TupleIDs
   { PageIDs = PageIDs ∪ {pageOf(tid)} }

// PageIDs = a set of bqix page numbers
...

Cost = Costindex + bqix    (some pages do not contain answers, bqix > bq)

DBMSs typically maintain statistics to assist with determining selectivity

COMP9315 24T1 ♢ Lectures Part C ♢ [7/74]
❖ N-d Queries and Indexes (cont)

Implementing selection using multiple indices:

// Query: select * from R where a1op1C1 and ... and anopnCn
// assumes an index on at least ai
TupleIDs = IndexLookup(R,a1,op1,C1)
foreach attribute ai with an index {
   tids = IndexLookup(R,ai,opi,Ci)
   TupleIDs = TupleIDs ∩ tids
}
PageIDs = { }
foreach tid in TupleIDs
   { PageIDs = PageIDs ∪ {pageOf(tid)} }
// PageIDs = a set of bq page numbers
...

Cost = k.Costindex + bq     (assuming indexes on k of n attrs)

COMP9315 24T1 ♢ Lectures Part C ♢ [8/74]
❖ Bitmap Indexes

Alternative index structure, focussing on sets of tuples:

[Diagram:Pics/select/bitmap-index.png]

Index contains bit-strings of r bits, one for each value/range

COMP9315 24T1 ♢ Lectures Part C ♢ [9/74]
❖ Bitmap Indexes (cont)

Also useful to have a file of tids, giving file structures:


[Diagram:Pics/file-struct/bitmap-files.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [10/74]
❖ Bitmap Indexes (cont)

Answering queries using bitmap index:


Matches = AllOnes(r)
foreach attribute A with index {
   // select ith bit-string for attribute A
   // based on value associated with A in WHERE
   Matches = Matches & Bitmaps[A][i]
}
// Matches contains 1-bit for each matching tuple
foreach i in 0..r-1 {
   if (Matches[i] == 0) continue;
   Pages = Pages ∪ {pageOf(Tids[i])}
}
foreach pid in Pages {
   P = getPage(pid)
   extract matching tuples from P
}

COMP9315 24T1 ♢ Lectures Part C ♢ [11/74]
❖ Bitmap Indexes (cont)

Storage costs for bitmap indexes:

Query execution costs for bitmap indexes: Note: bitmaps could index pages rather than tuples (shorter bitmaps)
COMP9315 24T1 ♢ Lectures Part C ♢ [12/74]
❖ Hashing and pmr

For a pmr query like

select * from R where a1 = C1 and ... and an = Cn

Can be alleviated using multi-attribute hashing (mah) MA.hashing works in conjunction with any dynamic hash scheme.
COMP9315 24T1 ♢ Lectures Part C ♢ [13/74]
❖ Hashing and pmr (cont)

Multi-attribute hashing parameters:

COMP9315 24T1 ♢ Lectures Part C ♢ [14/74]
❖ MA.Hashing Example

Consider relation Deposit(branch,acctNo,name,amount)

Assume a small data file with 8 main data pages (plus overflows).

Hash parameters:    d=3     d1=1    d2=1    d3=1    d4=0

Note that we ignore the amount attribute (d4=0)

Assumes that nobody will want to ask queries like

select * from Deposit where amount=533

Choice vector is designed taking expected queries into account.

COMP9315 24T1 ♢ Lectures Part C ♢ [15/74]
❖ MA.Hashing Example (cont)

Choice vector:

[Diagram:Pics/select/choice-vector.png]

This choice vector tells us:

COMP9315 24T1 ♢ Lectures Part C ♢ [16/74]
❖ MA.Hashing Example (cont)

Consider the tuple:

branch acctNo name amount
Downtown 101 Johnston 512

Hash value (page address) is computed by:

[Diagram:Pics/select/mahins2.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [17/74]
❖ MA.Hashing Hash Functions

Auxiliary definitions:

#define MaxHashSize 32
typedef unsigned int HashVal;

// extracts i'th bit from hash value h
#define bit(i,h) (((h) & (1 << (i))) >> (i))

// choice vector elems
typedef struct { int attr, int bit } CVelem;
typedef CVelem ChoiceVec[MaxHashSize];

// hash function for individual attributes
HashVal hash_any(char *val) { ... }

COMP9315 24T1 ♢ Lectures Part C ♢ [18/74]
❖ MA.Hashing Hash Functions (cont)

Produce combined d-bit hash value for tuple t :


HashVal hash(Tuple t, ChoiceVec cv, int d)
{
    HashVal h[nAttr(t)+1];  // hash for each attr
    HashVal res = 0, oneBit;
    int     i, a, b;
    for (i = 1; i <= nAttr(t); i++)
        h[i] = hash_any(attrVal(t,i));
    for (i = 0; i < d; i++) {
        a = cv[i].attr;
        b = cv[i].bit;
        oneBit = bit(b, h[a]);
        res = res | (oneBit << i);
    }
    return res;
}

COMP9315 24T1 ♢ Lectures Part C ♢ [19/74]
❖ Queries with MA.Hashing

In a partial match query:

E.g.

select amount
from   Deposit
where  branch = 'Brighton' and name = 'Green'

for which we use the shorthand   (Brighton, ?, Green, ?)

COMP9315 24T1 ♢ Lectures Part C ♢ [20/74]
❖ Queries with MA.Hashing (cont)

Consider query: select amount from Deposit where name='Green'


[Diagram:Pics/select/mahquery2.png]


Matching tuples must be in pages: 100, 101, 110, 111.

COMP9315 24T1 ♢ Lectures Part C ♢ [21/74]
❖ MA.Hashing Query Algorithm

// Builds the partial hash value (e.g. 10*0*1)
// Treats query like tuple with some attr values missing
nstars = 0;
for each attribute i in query Q {
    if (hasValue(Q,i)) {
        set d[i] bits in composite hash
            using choice vector and hash(Q,i)
    } else {
        set d[i] *'s in composite hash
            using choice vector
        nstars += d[i]
    }
}
...

COMP9315 24T1 ♢ Lectures Part C ♢ [22/74]
❖ MA.Hashing Query Algorithm (cont)

...
// Use the partial hash to find candidate pages

r = openRelation("R",READ);
for (i = 0; i < 2nstars; i++) {
    pid = composite hash
    replace *'s in pid
        using i and choice vector
    Buf = readPage(file(r), pid);
    for each tuple T in Buf {
        if (T satisfies pmr query)
            add T to results
    }
}

COMP9315 24T1 ♢ Lectures Part C ♢ [23/74]
❖ Query Cost for MA.Hashing

Multi-attribute hashing handles a range of query types, e.g.

select * from R where a=1
select * from R where d=2
select * from R where b=3 and c=4
select * from R where a=5 and b=6 and c=7

A relation with n  attributes has 2n different query types.

Different query types have different costs   (different no. of *'s)

Cost(Q) = 2s   where s = i ∉ Q di     (alternatively Cost(Q) = i ∉ Q 2di)

Query distribution gives probability pQ of asking each query type Q.

COMP9315 24T1 ♢ Lectures Part C ♢ [24/74]
❖ Query Cost for MA.Hashing (cont)

Min query cost occurs when all attributes are used in query

Min Costpmr  =  1

Max query cost occurs when no attributes are specified

Max Costpmr  =  2d  =  b

Average cost is given by weighted sum over all query types:

Avg Costpmr  =  ∑Q pQi ∉ Q 2di


Aim to minimise the weighted average query cost over possible query types

COMP9315 24T1 ♢ Lectures Part C ♢ [25/74]
❖ Optimising MA.Hashing Cost

For a given application, useful to minimise Costpmr.

Can be achieved by choosing appropriate values for di   (cv)

Heuristics:

Trade-off: making Qj more efficient makes Qk less efficient.

This is a combinatorial optimisation problem
(solve via standard optimisation techniques e.g. simulated annealing)

COMP9315 24T1 ♢ Lectures Part C ♢ [26/74]
❖ Multi-dimensional Tree Indexes

Over the last 20 years, from a range of problem areas

Consider three popular schemes:   kd-trees, Quad-trees, R-trees.

Example data for multi-d trees is based on the following relation:

create table Rel (
    X char(1) check (X between 'a' and 'z'),
    Y integer check (Y between 0 and 9)
);

COMP9315 24T1 ♢ Lectures Part C ♢ [27/74]
❖ Multi-dimensional Tree Indexes (cont)

Example tuples:

Rel('a',1)  Rel('a',5)  Rel('b',2)  Rel('d',1)
Rel('d',2)  Rel('d',4)  Rel('d',8)  Rel('g',3)
Rel('j',7)  Rel('m',1)  Rel('r',5)  Rel('z',9)

The tuple-space for the above tuples:

[Diagram:Pics/select/2d-space.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [28/74]
❖ kd-Trees

kd-trees are multi-way search trees where

[Diagram:Pics/select/kd-tree.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [29/74]
❖ kd-Trees (cont)

How this tree partitions the tuple space:

[Diagram:Pics/select/kd-tree-space.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [30/74]
❖ Searching in kd-Trees

// Started by Search(Q, R, 0, kdTreeRoot)
Search(Query Q, Relation R, Level L, Node N)
{
   if (isDataPage(N)) {
      Buf = getPage(fileOf(R),idOf(N))
      check Buf for matching tuples
   } else {
      a = attrLev[L]
      if (!hasValue(Q,a))
         nextNodes = all children of N
      else {
         val = getAttr(Q,a)
         nextNodes = find(N,Q,a,val)
      }
      for each C in nextNodes
         Search(Q, R, L+1, C)
}  }

COMP9315 24T1 ♢ Lectures Part C ♢ [31/74]
❖ Quad Trees

Quad trees use regular, disjoint partitioning of tuple space.

Example:

[Diagram:Pics/select/quad-tree-space.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [32/74]
❖ Quad Trees (cont)

Basis for the partitioning:

Note: effective for d≤5, ok for 6≤d≤10, ineffective for d>10
COMP9315 24T1 ♢ Lectures Part C ♢ [33/74]
❖ Quad Trees (cont)

The previous partitioning gives this tree structure, e.g.

[Diagram:Pics/select/quad-tree1.png]

In this and following examples, we give coords of top-left,bottom-right of a region

COMP9315 24T1 ♢ Lectures Part C ♢ [34/74]
❖ Searching in Quad-tree

Space query example:

[Diagram:Pics/select/quad-query.png]

Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).

COMP9315 24T1 ♢ Lectures Part C ♢ [35/74]
❖ R-Trees

R-trees use a flexible, overlapping partitioning of tuple space.

Overlap and partial cover means: Aim: height-balanced, partly-full index pages   (cf. B-tree)
COMP9315 24T1 ♢ Lectures Part C ♢ [36/74]
❖ R-Trees (cont)

[Diagram:Pics/select/r-tree.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [37/74]
❖ Insertion into R-tree

Insertion of an object R occurs as follows:

Note that R may be a point or a polygon.
COMP9315 24T1 ♢ Lectures Part C ♢ [38/74]
❖ Query with R-trees

Designed to handle space queries and "where-am-I" queries.

"Where-am-I" query: find all regions containing a given point P:

Space (region) queries are handled in a similar way
COMP9315 24T1 ♢ Lectures Part C ♢ [39/74]
❖ Multi-d Trees in PostgreSQL

Up to version 8.2, PostgreSQL had R-tree implementation

Superseded by GiST = Generalized Search Trees

GiST indexes parameterise: data type, searching, splitting

GiST trees have the following structural constraints: Details: src/backend/access/gist
COMP9315 24T1 ♢ Lectures Part C ♢ [40/74]
❖ Costs of Search in Multi-d Trees

Difficult to determine cost precisely.

Best case: pmr query where all attributes have known values

Typical case: some attributes are unknown or defined by range
COMP9315 24T1 ♢ Lectures Part C ♢ [41/74]
❖ Similarity Selection

Relational selection is based on a boolean condition C

Uses for relational selection:
COMP9315 24T1 ♢ Lectures Part C ♢ [42/74]
❖ Similarity Selection (cont)

Similarity selection is used in contexts where

Uses for similarity matching:
COMP9315 24T1 ♢ Lectures Part C ♢ [43/74]
❖ Example: Content-based Image Retrieval

User supplies a description or sample of desired image.

System returns a ranked list of "matching" images from database.

[Diagram:Pics/select/cbir-sys.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [44/74]
❖ Example: Content-based Image Retrieval (cont)

At the SQL level, this might appear as ...

// relational matching
create view Sunset as
select image from MyPhotos
where  title = 'Pittwater Sunset'
       and taken = '2012-01-01';
// similarity matching with threshold
create view SimilarSunsets as
select title, image
from   MyPhotos
where  (image ~~ (select * from Sunset)) < 0.05
order  by (image ~~ (select * from Sunset));

where (imaginary) ~~ operator measures how "alike" images are

COMP9315 24T1 ♢ Lectures Part C ♢ [45/74]
❖ Similarity-based Retrieval

Database contains media objects, but also tuples, e.g.


BLOB = Binary Large OBject

COMP9315 24T1 ♢ Lectures Part C ♢ [46/74]
❖ Similarity-based Retrieval (cont)

Similarity-based retrieval requires a distance measure

where x and y are two objects (in the database)

Note: distance calculation often requires substantial computational effort


How to restrict solution set to only the "most similar" objects:

BUT both above methods require knowing distance between query object and all objects in DB
COMP9315 24T1 ♢ Lectures Part C ♢ [47/74]
❖ Similarity-based Retrieval (cont)

Naive approach to similarity-based retrieval

q = ...    // query object
dmax = ... // dmax > 0  =>  using threshold
knn = ...  // knn > 0   =>  using nearest-neighbours
Dists = [] // empty list
foreach tuple t in R {
    d = dist(t.val, q)
    insert (t.oid,d) into Dists  // sorted on d
}
n = 0;  Results = []
foreach (i,d) in Dists {
    if (dmax > 0 && d > dmax) break;
    if (knn > 0 && ++n > knn) break;
    insert (i,d) into Results  // sorted on d
}
return Results;

Cost  =  fetch all r objects  +  compute distance()  for each

COMP9315 24T1 ♢ Lectures Part C ♢ [48/74]
❖ Similarity-based Retrieval (cont)

For some applications, Cost(dist(x,y)) is comparable to Tr

  computing dist(t.val,q) for every tuple t is infeasible.

To improve this ...

i.e. replace dist(t,q) by dist'(vec(t),vec(q)) in previous algorithm.

Further optimisation: dimension-reduction to make vectors smaller

COMP9315 24T1 ♢ Lectures Part C ♢ [49/74]
❖ Similarity-based Retrieval (cont)

Feature vectors ...

Content of feature vectors depends on application ... Query: feature vector representing one point in vh-dim space

Answer: list of objects "near to" query object in this space

COMP9315 24T1 ♢ Lectures Part C ♢ [50/74]
❖ Similarity-based Retrieval (cont)

Inputs to content-based similarity-retrieval:

Outputs from content-based similarity-retrieval:
COMP9315 24T1 ♢ Lectures Part C ♢ [51/74]
❖ Approaches to kNN Retrieval

Partition-based

Approximation-based
COMP9315 24T1 ♢ Lectures Part C ♢ [52/74]
❖ Approaches to kNN Retrieval (cont)

Above approaches try to reduce number of objects considered.

Other optimisations to make kNN retrieval faster
COMP9315 24T1 ♢ Lectures Part C ♢ [53/74]
❖ Similarity Retrieval in PostgreSQL

PostgreSQL has always supported simple "similarity" on strings

-- for most SQL implementations
select * from Students where name like '%oo%';
-- and PostgreSQL-specific
select * from Students where name ~ '[Ss]mit';


Also provides support for ranked similarity on text values

COMP9315 24T1 ♢ Lectures Part C ♢ [54/74]
❖ Similarity Retrieval in PostgreSQL (cont)

Example of PostgreSQL text retrieval:

create table Docs
   ( id integer, title text, body text );
// add column to hold document feature vectors
alter table Docs add column features tsvector;
update Docs set features =
   to_tsvector('english', title||' '||body);
// ask query and get results in ranked order
select title, ts_rank(d.features, query) as rank
from   Docs d,
       to_tsquery('potter|(roger&rabbit)') as query
where  query @@ d.features
order  by rank desc
limit  10;

For more details, see PostgreSQL documentation, Chapter 12.

COMP9315 24T1 ♢ Lectures Part C ♢ [55/74]
❖ Indexing with Signatures

Signature-based indexing:

Each tuple is associated with a signature

Instead of scanning/testing tuples, do pre-filtering via signatures.
COMP9315 24T1 ♢ Lectures Part C ♢ [56/74]
❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile1.png]

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

COMP9315 24T1 ♢ Lectures Part C ♢ [57/74]
❖ Signatures

A signature "summarises" the data in one tuple

A tuple consists of N attribute values A1 .. An

A codeword cw(Ai) is

A tuple descriptor (signature) is built by combining cw(Ai), i=1..n
COMP9315 24T1 ♢ Lectures Part C ♢ [58/74]
❖ Indexing with Signatures

Signature-based indexing:

Each tuple is associated with a signature

Instead of scanning/testing tuples, do pre-filtering via signatures.
COMP9315 24T1 ♢ Lectures Part C ♢ [59/74]
❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile1.png]

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

COMP9315 24T1 ♢ Lectures Part C ♢ [60/74]
❖ Signatures

A signature "summarises" the data in one tuple

A tuple consists of N attribute values A1 .. An

A codeword cw(Ai) is

A tuple descriptor (signature) is built by combining cw(Ai), i=1..n
COMP9315 24T1 ♢ Lectures Part C ♢ [61/74]
❖ Generating Codewords

Generating a k-in-m codeword for attribute Ai

bits codeword(char *attr_value, int m, int k)
{
   int  nbits = 0;   // count of set bits
   Bits Codeword = 0;
   seed random number generator with hash of attr_value
   while (less than k bits set in Codeword) {
      i = generate random bit position
      if (i'th bit in Codeword not already set) {
         set i'th bit in Codeword
         note one more bit set
      }
   }
   return Codeword;
   // m-bits with k 1-bits and m-k 0-bits
}

COMP9315 24T1 ♢ Lectures Part C ♢ [62/74]
❖ Superimposed Codewords (SIMC)

In a superimposed codewords (simc) indexing scheme

A tuple descriptor desc(r) is Method (assuming all n attributes are used in descriptor):

bits desc = 0 
for (i = 1; i <= n; i++) {
   bits cw = codeword(A[i])
   desc = desc | cw
}

COMP9315 24T1 ♢ Lectures Part C ♢ [63/74]
❖ SIMC Example

Consider the following tuple (from bank deposit database)

Branch AcctNo Name Amount
Perryridge 102 Hayes 400

It has the following codewords/descriptor (for m = 12,   k = 2 )

Ai cw(Ai)
Perryridge 010000000001
102 000000000011
Hayes 000001000100
400 000010000100
desc(r) 010011000111
COMP9315 24T1 ♢ Lectures Part C ♢ [64/74]
❖ SIMC Queries

To answer query q in SIMC

desc(q) is formed by OR of codewords for known attributes.

E.g. consider the query (Perryridge, ?, ?, ?).

Ai cw(Ai)
Perryridge 010000000001
? 000000000000
? 000000000000
? 000000000000
desc(q) 010000000001
COMP9315 24T1 ♢ Lectures Part C ♢ [65/74]
❖ SIMC Queries (cont)

Once we have a query descriptor, we search the signature file:

pagesToCheck = {}
for each descriptor D[i] in signature file {
    if (matches(D[i],desc(q))) {
        pid = pageOf(tupleID(i))
        pagesToCheck = pagesToCheck ∪ pid
    }
}
for each P in pagesToCheck {
    Buf = getPage(f,P)
    check tuples in Buf for answers
}
// where ...
#define matches(rdesc,qdesc)
               ((rdesc & qdesc) == qdesc)

COMP9315 24T1 ♢ Lectures Part C ♢ [66/74]
❖ Example SIMC Query

Consider the query and the example database:

Signature Deposit Record
010000000001 (Perryridge,?,?,?)
100101001001 (Brighton,217,Green,750)
010011000111 (Perryridge,102,Hayes,400)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010101010101 (Clearview,117,Throggs,295)
100101010011 (Redwood,222,Lindsay,695)

Gives two matches:  one true match,  one false match.

COMP9315 24T1 ♢ Lectures Part C ♢ [67/74]
❖ SIMC Parameters

False match probablity pF  =  likelihood of a false match

How to reduce likelihood of false matches?


Larger m means reading more descriptor data.

Having k too high    increased overlapping.
Having k too low    increased hash collisions.

COMP9315 24T1 ♢ Lectures Part C ♢ [68/74]
❖ SIMC Parameters (cont)

How to determine "optimal" m and k?

  1. start by choosing acceptable pF
      (e.g. pF ≤ 10-5 i.e. one false match in 10,000)
  2. then choose m and k to achieve no more than this pF.
Formulae to derive m and k given pF and n:

k  =  1/loge2 . loge ( 1/pF )

m  =  ( 1/loge 2 )2 . n . loge ( 1/pF )

COMP9315 24T1 ♢ Lectures Part C ♢ [69/74]
❖ Query Cost for SIMC

Cost to answer pmr query: Costpmr = bD + bq

bD = ceil( r/cD )  and  cD = floor(B/ceil(m/8))

E.g. m=64,   B=8192,   r=104    ⇒    cD = 1024,   bD=10

bq includes pages with rq matching tuples and rF false matches

Expected false matches = rF  =  (r - rq).pF  ≅  r.pF   if rq ≪ r

E.g. Worst bq = rq+rF,   Best bq = 1,   Avg bq = ceil(b(rq+rF)/r)

COMP9315 24T1 ♢ Lectures Part C ♢ [70/74]
❖ Page-level SIMC

SIMC has one descriptor per tuple ... potentially inefficient.

Alternative approach: one descriptor for each data page.

Every attribute of every tuple in page contributes to descriptor.

Size of page descriptor (PD) (clearly larger than tuple descriptor):

E.g. n = 4, c = 128, pF = 10-3   ⇒   m ≅ 7000bits ≅ 900bytes

Typically, pages are 1..8KB    1..9 PD/page (NPD).

COMP9315 24T1 ♢ Lectures Part C ♢ [71/74]
❖ Page-Level SIMC Files

File organisation for page-level superimposed codeword index

[Diagram:Pics/select/simc-2level.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [72/74]
❖ Bit-sliced SIMC

Improvement: store b m-bit page descriptors as m b-bit "bit-slices"


[Diagram:Pics/select/bit-sliced.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [73/74]
❖ Bit-sliced SIMC (cont)

At query time

matches = ~0  //all ones
for each bit i set to 1 in desc(q) {
   slice = fetch bit-slice i
   matches = matches & slice
}
for each bit i set to 1 in matches {
   fetch page i
   scan page for matching records
}


Effective because desc(q) typically has less than half bits set to 1

COMP9315 24T1 ♢ Lectures Part C ♢ [74/74]


Produced: 30 Apr 2024