E: Implementing Selection (multi-attribute)

❖ N-dimensional Queries

Have looked at one-dimensional queries, e.g.

select * from R where a = K
select * from R where a between Lo and Hi

and heaps, hashing, indexing as ways of efficient implementation.

Now consider techniques for efficient multi-dimensional queries.

Compared to 1-d queries, multi-dimensional queries

typically produce fewer results
require us to consider more information
require more effort to produce results

<< ∧ >>

❖ Operations for Nd Select

N-dimensional select queries = condition on ≥1 attributes.

pmr = partial-match retrieval (equality tests), e.g.

select * from Employees
where  job = 'Manager' and gender = 'M';

space = tuple-space queries (range tests), e.g.

select * from Employees
where 20 ≤ age ≤ 50 and 40K ≤ salary ≤ 60K

<< ∧ >>

❖ N-d Selection via Heaps

Heap files can handle pmr or space using standard method:

// select * from R where C
r = openRelation("R",READ);
for (p = 0; p < nPages(r); p++) {
    buf = getPage(file(r), p);
    for (i = 0; i < nTuples(buf); i++) {
        t = getTuple(buf,i);
        if (matches(t,C))
            add t to result set
    }
}

Cost_pmr = Cost_space = b

<< ∧ >>

❖ N-d Selection via Multiple Indexes

DBMSs already support building multiple indexes on a table.

Which indexes to build depends on which queries are asked.

create table R (a int, b int, c int);
create index Rax on R (a);
create index Rbx on R (b);
create index Rcx on R (c);
create index Rabx on R (a,b);
create index Racx on R (a,c);
create index Rbcx on R (b,c);
create index Rallx on R (a,b,c);

But more indexes ⇒ space + update overheads.

<< ∧ >>

❖ N-d Queries and Indexes

Generalised view of pmr and space queries:

select * from R
where  a₁ op₁ C₁ and ... and a_n op_n C_n

pmr : all op_i are equality tests. space : some op_i are range tests.

Possible approaches to handling such queries ...

use index on one a_i to reduce tuple tests
use indexes on all a_i, and intersect answer sets

<< ∧ >>

❖ N-d Queries and Indexes (cont)

If using just one of several indexes, which one to use?

select * from R
where  a₁ op₁ C₁ and ... and a_n op_n C_n

The one with best selectivity for a_i op_i C_i (i.e. fewest matches)

Factors determining selectivity of a_i op_i C_i

assume uniform distribution of values in dom(a_i)
equality test on primary key gives at most one match
equality test on larger dom(a_i) ⇒ less matches
range test over large part of dom(a_i) ⇒ many matches

<< ∧ >>

❖ N-d Queries and Indexes (cont)

Implementing selection using one of several indices:

// Query: select * from R where a₁op₁C₁ and ... and a_nop_nC_n
// choose a_i with best selectivity
TupleIDs = IndexLookup(R,a_i,op_i,C_i)
// gives { tid₁, tid₂, ...} for tuples satisfying a_iop_iC_i
PageIDs = { }
foreach tid in TupleIDs
   { PageIDs = PageIDs ∪ {pageOf(tid)} }

// PageIDs = a set of b_{q_ix} page numbers
...

Cost = Cost_index + b_{q_ix} (some pages do not contain answers, b_{q_ix} > b_q)

DBMSs typically maintain statistics to assist with determining selectivity

<< ∧ >>

❖ N-d Queries and Indexes (cont)

Implementing selection using multiple indices:

// Query: select * from R where a₁op₁C₁ and ... and a_nop_nC_n
// assumes an index on at least a_i
TupleIDs = IndexLookup(R,a₁,op₁,C₁)
foreach attribute a_i with an index {
   tids = IndexLookup(R,a_i,op_i,C_i)
   TupleIDs = TupleIDs ∩ tids
}
PageIDs = { }
foreach tid in TupleIDs
   { PageIDs = PageIDs ∪ {pageOf(tid)} }
// PageIDs = a set of b_q page numbers
...

Cost = k.Cost_index + b_q (assuming indexes on k of n attrs)

<< ∧ >>

❖ Bitmap Indexes

Alternative index structure, focussing on sets of tuples:

Index contains bit-strings of r bits, one for each value/range

<< ∧ >>

❖ Bitmap Indexes (cont)

Also useful to have a file of tids, giving file structures:

[Diagram:Pics/file-struct/bitmap-files.png]

<< ∧ >>

❖ Bitmap Indexes (cont)

Answering queries using bitmap index:


Matches = AllOnes(r)
foreach attribute A with index {
   // select i^th bit-string for attribute A
   // based on value associated with A in WHERE
   Matches = Matches & Bitmaps[A][i]
}
// Matches contains 1-bit for each matching tuple
foreach i in 0..r-1 {
   if (Matches[i] == 0) continue;
   Pages = Pages ∪ {pageOf(Tids[i])}
}
foreach pid in Pages {
   P = getPage(pid)
   extract matching tuples from P
}

<< ∧ >>

❖ Bitmap Indexes (cont)

Storage costs for bitmap indexes:

one bitmap for each value/range for each indexed attribute
each bitmap has length ceil(r/8) bytes
e.g. with 50K records and 8KB pages, bitmap fits in one page

Query execution costs for bitmap indexes:

read one bitmap for each indexed attribute in query
perform bitwise AND on bitmaps (in memory)
read pages containing matching tuples

Note: bitmaps could index pages rather than tuples (shorter bitmaps)

<< ∧ >>

❖ Hashing and pmr

For a pmr query like

select * from R where a₁ = C₁ and ... and a_n = C_n

if one a_i is the hash key, query is very efficient
if no a_i is the hash key, need to use linear scan

Can be alleviated using multi-attribute hashing (mah)

form a composite hash value involving all attributes
at query time, some components of composite hash are known
(allows us to limit the number of data pages which need to be checked)

MA.hashing works in conjunction with any dynamic hash scheme.

<< ∧ >>

❖ Hashing and pmr (cont)

Multi-attribute hashing parameters:

file size = b = 2^d pages ⇒ use d-bit hash values
relation has n attributes: a₁, a₂, ...a_n
attribute a_i has hash function h_i
attribute a_i contributes d_i bits (to the combined hash value)
total bits d = ∑_i=1ⁿ d_i
a choice vector (cv) specifies for all k ...
bit j from h_i(a_i) contributes bit k in combined hash value

<< ∧ >>

❖ MA.Hashing Example

Consider relation Deposit(branch,acctNo,name,amount)

Assume a small data file with 8 main data pages (plus overflows).

Hash parameters: d=3 d₁=1 d₂=1 d₃=1 d₄=0

Note that we ignore the amount attribute (d₄=0)

Assumes that nobody will want to ask queries like

select * from Deposit where amount=533

Choice vector is designed taking expected queries into account.

<< ∧ >>

❖ MA.Hashing Example (cont)

Choice vector:

This choice vector tells us:

bit 0 in hash comes from bit 0 of hash₁(a₁) ( b_1,0 )
bit 1 in hash comes from bit 0 of hash₂(a₂) ( b_2,0 )
bit 2 in hash comes from bit 0 of hash₃(a₃) ( b_3,0 )
bit 3 in hash comes from bit 1 of hash₁(a₁) ( b_1,1 )
etc. etc. etc. (up to as many bits of hashing as required, e.g. 32)

<< ∧ >>

❖ MA.Hashing Example (cont)

Consider the tuple:

branch acctNo name amount

Downtown 101 Johnston 512

Hash value (page address) is computed by:

<< ∧ >>

❖ MA.Hashing Hash Functions

Auxiliary definitions:

#define MaxHashSize 32
typedef unsigned int HashVal;

// extracts i'th bit from hash value h
#define bit(i,h) (((h) & (1 << (i))) >> (i))

// choice vector elems
typedef struct { int attr, int bit } CVelem;
typedef CVelem ChoiceVec[MaxHashSize];

// hash function for individual attributes
HashVal hash_any(char *val) { ... }

<< ∧ >>

❖ MA.Hashing Hash Functions (cont)

Produce combined d-bit hash value for tuple t :


HashVal hash(Tuple t, ChoiceVec cv, int d)
{
    HashVal h[nAttr(t)+1];  // hash for each attr
    HashVal res = 0, oneBit;
    int     i, a, b;
    for (i = 1; i <= nAttr(t); i++)
        h[i] = hash_any(attrVal(t,i));
    for (i = 0; i < d; i++) {
        a = cv[i].attr;
        b = cv[i].bit;
        oneBit = bit(b, h[a]);
        res = res | (oneBit << i);
    }
    return res;
}

<< ∧ >>

❖ Queries with MA.Hashing

In a partial match query:

values of some attributes are known
values of other attributes are unknown

E.g.

select amount
from   Deposit
where  branch = 'Brighton' and name = 'Green'

for which we use the shorthand (Brighton, ?, Green, ?)

<< ∧ >>

❖ Queries with MA.Hashing (cont)

Consider query: select amount from Deposit where name='Green'

Matching tuples must be in pages: 100, 101, 110, 111.

<< ∧ >>

❖ MA.Hashing Query Algorithm

// Builds the partial hash value (e.g. 10*0*1)
// Treats query like tuple with some attr values missing
nstars = 0;
for each attribute i in query Q {
    if (hasValue(Q,i)) {
        set d[i] bits in composite hash
            using choice vector and hash(Q,i)
    } else {
        set d[i] *'s in composite hash
            using choice vector
        nstars += d[i]
    }
}
...

<< ∧ >>

❖ MA.Hashing Query Algorithm (cont)

...
// Use the partial hash to find candidate pages

r = openRelation("R",READ);
for (i = 0; i < 2^nstars; i++) {
    pid = composite hash
    replace *'s in pid
        using i and choice vector
    Buf = readPage(file(r), pid);
    for each tuple T in Buf {
        if (T satisfies pmr query)
            add T to results
    }
}

<< ∧ >>

❖ Query Cost for MA.Hashing

Multi-attribute hashing handles a range of query types, e.g.

select * from R where a=1
select * from R where d=2
select * from R where b=3 and c=4
select * from R where a=5 and b=6 and c=7

A relation with n attributes has 2ⁿ different query types.

Different query types have different costs (different no. of *'s)

Cost(Q) = 2^s where s = ∑_{i ∉ Q} d_i (alternatively Cost(Q) = ∏_{i ∉ Q} 2^d_i)

Query distribution gives probability p_Q of asking each query type Q.

<< ∧ >>

❖ Query Cost for MA.Hashing (cont)

Min query cost occurs when all attributes are used in query

Min Cost_pmr = 1

Max query cost occurs when no attributes are specified

Max Cost_pmr = 2^d = b

Average cost is given by weighted sum over all query types:

Avg Cost_pmr = ∑_Q p_Q ∏_{i ∉ Q} 2^d_i

Aim to minimise the weighted average query cost over possible query types

<< ∧ >>

❖ Optimising MA.Hashing Cost

For a given application, useful to minimise Cost_pmr.

Can be achieved by choosing appropriate values for d_i (cv)

Heuristics:

distribution of query types (more bits to frequently used attributes)
size of attribute domain (≤ #bits to represent all values in domain)
discriminatory power (more bits to highly discriminating attributes)

Trade-off: making Q_j more efficient makes Q_k less efficient.

This is a combinatorial optimisation problem
(solve via standard optimisation techniques e.g. simulated annealing)

<< ∧ >>

❖ Multi-dimensional Tree Indexes

Over the last 20 years, from a range of problem areas

different multi-d tree index schemes have been proposed
varying primarily in how they partition tuple-space

Consider three popular schemes: kd-trees, Quad-trees, R-trees.

Example data for multi-d trees is based on the following relation:

create table Rel (
    X char(1) check (X between 'a' and 'z'),
    Y integer check (Y between 0 and 9)
);

<< ∧ >>

❖ Multi-dimensional Tree Indexes (cont)

Example tuples:

Rel('a',1)  Rel('a',5)  Rel('b',2)  Rel('d',1)
Rel('d',2)  Rel('d',4)  Rel('d',8)  Rel('g',3)
Rel('j',7)  Rel('m',1)  Rel('r',5)  Rel('z',9)

The tuple-space for the above tuples:

<< ∧ >>

❖ kd-Trees

kd-trees are multi-way search trees where

each level of the tree partitions on a different attribute
each node contains n-1 key values, pointers to n subtrees

<< ∧ >>

❖ kd-Trees (cont)

How this tree partitions the tuple space:

<< ∧ >>

❖ Searching in kd-Trees

// Started by Search(Q, R, 0, kdTreeRoot)
Search(Query Q, Relation R, Level L, Node N)
{
   if (isDataPage(N)) {
      Buf = getPage(fileOf(R),idOf(N))
      check Buf for matching tuples
   } else {
      a = attrLev[L]
      if (!hasValue(Q,a))
         nextNodes = all children of N
      else {
         val = getAttr(Q,a)
         nextNodes = find(N,Q,a,val)
      }
      for each C in nextNodes
         Search(Q, R, L+1, C)
}  }

<< ∧ >>

❖ Quad Trees

Quad trees use regular, disjoint partitioning of tuple space.

for 2d, partition space into quadrants (NW, NE, SW, SE)
each quadrant can be further subdivided into four, etc.

Example:

[Diagram:Pics/select/quad-tree-space.png]

<< ∧ >>

❖ Quad Trees (cont)

Basis for the partitioning:

a quadrant that has no sub-partitions is a leaf quadrant
each leaf quadrant maps to a single data page
subdivide until points in each quadrant fit into one data page
ideal: same number of points in each leaf quadrant (balanced)
point density varies over space
⇒ different regions require different levels of partitioning
this means that the tree is not necessarily balanced

Note: effective for d≤5, ok for 6≤d≤10, ineffective for d>10

<< ∧ >>

❖ Quad Trees (cont)

The previous partitioning gives this tree structure, e.g.

In this and following examples, we give coords of top-left,bottom-right of a region

<< ∧ >>

❖ Searching in Quad-tree

Space query example:

Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).

<< ∧ >>

❖ R-Trees

R-trees use a flexible, overlapping partitioning of tuple space.

each node in the tree represents a kd hypercube
its children represent (possibly overlapping) subregions
the child regions do not need to cover the entire parent region

Overlap and partial cover means:

can optimize space partitioning wrt data distribution
so that there are similar numbers of points in each region

Aim: height-balanced, partly-full index pages (cf. B-tree)

<< ∧ >>

❖ R-Trees (cont)

<< ∧ >>

❖ Insertion into R-tree

Insertion of an object R occurs as follows:

start at root, look for children that completely contain R
if no child completely contains R, choose one of the children and expand its boundaries so that it does contain R
if several children contain R, choose one and proceed to child
repeat above containment search in children of current node
once we reach data page, insert R if there is room
if no room in data page, replace by two data pages
partition existing objects between two data pages
update node pointing to data pages
(may cause B-tree-like propagation of node changes up into tree)

Note that R may be a point or a polygon.

<< ∧ >>

❖ Query with R-trees

Designed to handle space queries and "where-am-I" queries.

"Where-am-I" query: find all regions containing a given point P:

start at root, select all children whose subregions contain P
if there are zero such regions, search finishes with P not found
otherwise, recursively search within node for each subregion
once we reach a leaf, we know that region contains P

Space (region) queries are handled in a similar way

we traverse down any path that intersects the query region

<< ∧ >>

❖ Multi-d Trees in PostgreSQL

Up to version 8.2, PostgreSQL had R-tree implementation

Superseded by GiST = Generalized Search Trees

GiST indexes parameterise: data type, searching, splitting

via seven user-defined functions (e.g. picksplit())

GiST trees have the following structural constraints:

every node is at least fraction f full (e.g. 0.5)
the root node has at least two children (unless also a leaf)
all leaves appear at the same level

Details: src/backend/access/gist

<< ∧ >>

❖ Costs of Search in Multi-d Trees

Difficult to determine cost precisely.

Best case: pmr query where all attributes have known values

in kd-trees and quad-trees, follow single tree path
cost is equal to depth D of tree
in R-trees, may follow several paths (overlapping partitions)

Typical case: some attributes are unknown or defined by range

need to visit multiple sub-trees
how many depends on: range, choice-points in tree nodes

<< ∧ >>

❖ Similarity Selection

Relational selection is based on a boolean condition C

evaluate C for each tuple t
if C(t) is true, add t to result set
if C(t) is false, t is not part of solution
result is a set of tuples { t₁, t₂, ..., t_n } all of which satisfy C

Uses for relational selection:

precise matching on structured data
using individual attributes with known, exact values

<< ∧ >>

❖ Similarity Selection (cont)

Similarity selection is used in contexts where

cannot define a precise matching condition
can define a measure d of "distance" between tuples
d=0 is an exact match, d>0 is less accurate match
result is a list of pairs [ (t₁,d₁), (t₂,d₂), ..., (t_n,d_n) ] (ordered by d_i)

Uses for similarity matching:

text or multimedia (image/music) retrieval
ranked queries in conventional databases

<< ∧ >>

❖ Example: Content-based Image Retrieval

User supplies a description or sample of desired image.

System returns a ranked list of "matching" images from database.

<< ∧ >>

❖ Example: Content-based Image Retrieval (cont)

At the SQL level, this might appear as ...

// relational matching
create view Sunset as
select image from MyPhotos
where  title = 'Pittwater Sunset'
       and taken = '2012-01-01';
// similarity matching with threshold
create view SimilarSunsets as
select title, image
from   MyPhotos
where  (image ~~ (select * from Sunset)) < 0.05
order  by (image ~~ (select * from Sunset));

where (imaginary) ~~ operator measures how "alike" images are

<< ∧ >>

❖ Similarity-based Retrieval

Database contains media objects, but also tuples, e.g.

id to uniquely identify object (e.g. PostgreSQL oid)
metadata (e.g. artist, title, genre, date taken, ...)
value of object itself (e.g. PostgreSQL BLOB or bytea)

BLOB = Binary Large OBject

BLOB stored in separate file; tuple contains reference (cf. TOAST)
BLOBs are typically MB in size (1MB..2GB)

<< ∧ >>

❖ Similarity-based Retrieval (cont)

Similarity-based retrieval requires a distance measure

dist(x,y) ∈ 0..1, dist(x,x) = 0, dist(x,y) = dist(y,x)

where x and y are two objects (in the database)

Note: distance calculation often requires substantial computational effort

How to restrict solution set to only the "most similar" objects:

threshold d_max (only objects t such that dist(t,q) ≤ d_max)
count k (k closest objects (k nearest neighbours))

BUT both above methods require knowing distance between query object and all objects in DB

<< ∧ >>

❖ Similarity-based Retrieval (cont)

Naive approach to similarity-based retrieval

q = ...    // query object
dmax = ... // dmax > 0  =>  using threshold
knn = ...  // knn > 0   =>  using nearest-neighbours
Dists = [] // empty list
foreach tuple t in R {
    d = dist(t.val, q)
    insert (t.oid,d) into Dists  // sorted on d
}
n = 0;  Results = []
foreach (i,d) in Dists {
    if (dmax > 0 && d > dmax) break;
    if (knn > 0 && ++n > knn) break;
    insert (i,d) into Results  // sorted on d
}
return Results;

Cost = fetch all r objects + compute distance() for each

<< ∧ >>

❖ Similarity-based Retrieval (cont)

For some applications, Cost(dist(x,y)) is comparable to T_r

⇒ computing dist(t.val,q) for every tuple t is infeasible.

To improve this ...

compute feature vector to capture "critical" object properties
store feature vectors "in parallel" with objects (cf. signatures)
compute distance using feature vectors (not objects)

i.e. replace dist(t,q) by dist'(vec(t),vec(q)) in previous algorithm.

Further optimisation: dimension-reduction to make vectors smaller

<< ∧ >>

❖ Similarity-based Retrieval (cont)

Feature vectors ...

often use multiple features, concatenated into single vector
represent points in a very high-dimensional (vh-dim) space

Content of feature vectors depends on application ...

image ... colour histogram (e.g. 100's of values/dimensions)
music ... loudness/pitch/tone (e.g. 100's of values/dimensions)
text ... term frequencies (e.g. 1000's of values/dimensions)

Query: feature vector representing one point in vh-dim space

Answer: list of objects "near to" query object in this space

<< ∧ >>

❖ Similarity-based Retrieval (cont)

Inputs to content-based similarity-retrieval:

a database of r objects (obj₁, obj₂, ..., obj_r) plus associated ...
r × n-dimensional feature vectors (v_obj₁, v_obj₂, ..., v_{obj_r})
a query image q with associated n-dimensional vector (v_q)
a distance measure D(v_i,v_j) : [0..1) (D=0 → v_i=v_j)

Outputs from content-based similarity-retrieval:

a list of the k nearest objects in the database [a₁, a₂, ... a_k]
ordered by distance D(v_a₁,v_q) ≤ D(v_a₂,v_q) ≤ ... ≤ D(v_{a_k},v_q)

<< ∧ >>

❖ Approaches to kNN Retrieval

Partition-based

use auxiliary data structure to identify candidates
space/data-partitioning methods: e.g. k-d-B-tree, R-tree, ...
unfortunately, such methods "fail" when #dims > 10..20
absolute upper bound on d before linear scan is best d = 610

Approximation-based

use approximating data structure to identify candidates
signatures: VA-files
projections: iDistance, LSH, MedRank, CurveIX, Pyramid

<< ∧ >>

❖ Approaches to kNN Retrieval (cont)

Above approaches try to reduce number of objects considered.

cf. indexes in relational databases

Other optimisations to make kNN retrieval faster

reduce I/O by reducing size of vectors (compression, d-reduction)
reduce I/O by placing "similar" records together (clustering)
reduce I/O by remembering previous pages (caching)
reduce cpu by making distance computation faster

<< ∧ >>

❖ Similarity Retrieval in PostgreSQL

PostgreSQL has always supported simple "similarity" on strings

-- for most SQL implementations
select * from Students where name like '%oo%';
-- and PostgreSQL-specific
select * from Students where name ~ '[Ss]mit';

Also provides support for ranked similarity on text values

using tsvector data type (stemmed, stopped feature vector for text)
using tsquery data type (stemmed, stopped feature vector for strings)
using @@ similarity operator

<< ∧ >>

❖ Similarity Retrieval in PostgreSQL (cont)

Example of PostgreSQL text retrieval:

create table Docs
   ( id integer, title text, body text );
// add column to hold document feature vectors
alter table Docs add column features tsvector;
update Docs set features =
   to_tsvector('english', title||' '||body);
// ask query and get results in ranked order
select title, ts_rank(d.features, query) as rank
from   Docs d,
       to_tsquery('potter|(roger&rabbit)') as query
where  query @@ d.features
order  by rank desc
limit  10;

For more details, see PostgreSQL documentation, Chapter 12.

<< ∧ >>

❖ Indexing with Signatures

Signature-based indexing:

designed for pmr queries (conjunction of equalities)
does not try to achieve better than O(n) performance
attempts to provide an "efficient" linear scan

Each tuple is associated with a signature

a compact (lossy) descriptor for the tuple
formed by combining information from multiple attributes
stored in a signature file, parallel to data file

Instead of scanning/testing tuples, do pre-filtering via signatures.

<< ∧ >>

❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

<< ∧ >>

❖ Signatures

A signature "summarises" the data in one tuple

A tuple consists of N attribute values A₁ .. A_n

A codeword cw(A_i) is

a bit-string, m bits long, where k bits are set to 1 (k ≪ m)
derived from the value of a single attribute A_i

A tuple descriptor (signature) is built by combining cw(A_i), i=1..n

could combine by overlaying (or concatenating) codewords
aim to have roughly half of the bits set to 1

<< ∧ >>

❖ Indexing with Signatures

Signature-based indexing:

designed for pmr queries (conjunction of equalities)
does not try to achieve better than O(n) performance
attempts to provide an "efficient" linear scan

Each tuple is associated with a signature

a compact (lossy) descriptor for the tuple
formed by combining information from multiple attributes
stored in a signature file, parallel to data file

Instead of scanning/testing tuples, do pre-filtering via signatures.

<< ∧ >>

❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

<< ∧ >>

❖ Signatures

A signature "summarises" the data in one tuple

A tuple consists of N attribute values A₁ .. A_n

A codeword cw(A_i) is

a bit-string, m bits long, where k bits are set to 1 (k ≪ m)
derived from the value of a single attribute A_i

A tuple descriptor (signature) is built by combining cw(A_i), i=1..n

could combine by overlaying (or concatenating) codewords
aim to have roughly half of the bits set to 1

<< ∧ >>

❖ Generating Codewords

Generating a k-in-m codeword for attribute A_i

bits codeword(char *attr_value, int m, int k)
{
   int  nbits = 0;   // count of set bits
   Bits Codeword = 0;
   seed random number generator with hash of attr_value
   while (less than k bits set in Codeword) {
      i = generate random bit position
      if (i'th bit in Codeword not already set) {
         set i'th bit in Codeword
         note one more bit set
      }
   }
   return Codeword;
   // m-bits with k 1-bits and m-k 0-bits
}

<< ∧ >>

❖ Superimposed Codewords (SIMC)

In a superimposed codewords (simc) indexing scheme

a tuple descriptor is formed by overlaying attribute codewords

A tuple descriptor desc(r) is

a bit-string, m bits long, where j ≤ nk bits are set to 1
desc(r) = cw(A₁) OR cw(A₂) OR ... OR cw(A_n)

Method (assuming all n attributes are used in descriptor):

bits desc = 0 
for (i = 1; i <= n; i++) {
   bits cw = codeword(A[i])
   desc = desc | cw
}

<< ∧ >>

❖ SIMC Example

Consider the following tuple (from bank deposit database)

Branch	AcctNo	Name	Amount
Perryridge	102	Hayes	400

It has the following codewords/descriptor (for m = 12, k = 2 )

A_i	cw(A_i)
Perryridge	`010000000001`
102	`000000000011`
Hayes	`000001000100`
400	`000010000100`
desc(r)	`010011000111`

<< ∧ >>

❖ SIMC Queries

To answer query q in SIMC

first generate a query descriptor desc(q)
then use the query descriptor to search the signature file

desc(q) is formed by OR of codewords for known attributes.

E.g. consider the query (Perryridge, ?, ?, ?).

A_i	cw(A_i)
Perryridge	`010000000001`
`?`	`000000000000`
`?`	`000000000000`
`?`	`000000000000`
desc(q)	`010000000001`

<< ∧ >>

❖ SIMC Queries (cont)

Once we have a query descriptor, we search the signature file:

pagesToCheck = {}
for each descriptor D[i] in signature file {
    if (matches(D[i],desc(q))) {
        pid = pageOf(tupleID(i))
        pagesToCheck = pagesToCheck ∪ pid
    }
}
for each P in pagesToCheck {
    Buf = getPage(f,P)
    check tuples in Buf for answers
}
// where ...
#define matches(rdesc,qdesc)
               ((rdesc & qdesc) == qdesc)

<< ∧ >>

❖ Example SIMC Query

Consider the query and the example database:

Signature	Deposit Record
`010000000001`	(Perryridge,?,?,?)
`100101001001`	(Brighton,217,Green,750)
`010011000111`	(Perryridge,102,Hayes,400)
`101001001001`	(Downtown,101,Johnshon,512)
`101100000011`	(Mianus,215,Smith,700)
`010101010101`	(Clearview,117,Throggs,295)
`100101010011`	(Redwood,222,Lindsay,695)

Gives two matches: one true match, one false match.

<< ∧ >>

❖ SIMC Parameters

False match probablity p_F = likelihood of a false match

How to reduce likelihood of false matches?

use different hash function for each attribute (h_i for A_i)
increase descriptor size (m)
choose k so that ≅ half of bits are set

Larger m means reading more descriptor data.

Having k too high ⇒ increased overlapping.
Having k too low ⇒ increased hash collisions.

<< ∧ >>

❖ SIMC Parameters (cont)

How to determine "optimal" m and k?

start by choosing acceptable p_F
(e.g. p_F ≤ 10^-5 i.e. one false match in 10,000)
then choose m and k to achieve no more than this p_F.

Formulae to derive m and k given p_F and n:

k = 1/log_e2 . log_e ( 1/p_F )

m = ( 1/log_e 2 )² . n . log_e ( 1/p_F )

<< ∧ >>

❖ Query Cost for SIMC

Cost to answer pmr query: Cost_pmr = b_D + b_q

read r descriptors on b_D descriptor pages
then read b_q data pages and check for matches

b_D = ceil( r/c_D ) and c_D = floor(B/ceil(m/8))

E.g. m=64, B=8192, r=10⁴ ⇒ c_D = 1024, b_D=10

b_q includes pages with r_q matching tuples and r_F false matches

Expected false matches = r_F = (r - r_q).p_F ≅ r.p_F if r_q ≪ r

E.g. Worst b_q = r_q+r_F, Best b_q = 1, Avg b_q = ceil(b(r_q+r_F)/r)

<< ∧ >>

❖ Page-level SIMC

SIMC has one descriptor per tuple ... potentially inefficient.

Alternative approach: one descriptor for each data page.

Every attribute of every tuple in page contributes to descriptor.

Size of page descriptor (PD) (clearly larger than tuple descriptor):

use above formulae but with c.n "attributes"

E.g. n = 4, c = 128, p_F = 10^-3 ⇒ m ≅ 7000bits ≅ 900bytes

Typically, pages are 1..8KB ⇒ 1..9 PD/page (N_PD).

<< ∧ >>

❖ Page-Level SIMC Files

File organisation for page-level superimposed codeword index

<< ∧ >>

❖ Bit-sliced SIMC

Improvement: store b m-bit page descriptors as m b-bit "bit-slices"

<< ∧

❖ Bit-sliced SIMC (cont)

At query time

matches = ~0  //all ones
for each bit i set to 1 in desc(q) {
   slice = fetch bit-slice i
   matches = matches & slice
}
for each bit i set to 1 in matches {
   fetch page i
   scan page for matching records
}

Effective because desc(q) typically has less than half bits set to 1