Selection on Multiple Attributes


Multi-dimensional (Nd) Selection


Operations for Nd Select2/152

N-dimensional select queries = condition on several attributes.


Tuple Space3/152

One view of N-dimensional selection on a relation R...

E.g. if N=D, we are checking existence of a tuple at a point


Heaps4/152

Heap files can handle pmr or space using standard method:

// select * from R where Cond
f = openFile(fileName("R"),READ);
for (p = 0; p < nPages(f); p++) {
    buf = getPage(f, p);
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (matches(tup,Cond))
            add tup to result set
    }
}

Costpmr  =  Costspace  =  b     (i.e. O(n), worst case scenario)


Multiple Indexes5/152

DBMSs already support building multiple indexes on a table.

Which indexes to build, depends on which queries are asked.

If we don't know queries, index all attribute subsets:

create table R (a int, b int, c int);
create index Rax on R (a);
create index Rbx on R (b);
create index Rcx on R (c);
create index Rabx on R (a,b);
create index Racx on R (a,c);
create index Rbcx on R (b,c);
create index Rallx on R (a,b,c);


... Multiple Indexes6/152

Characteristics of indexes:

Since single-attribute indexes are common in DBMSs, consider first how to use these.


... Multiple Indexes7/152

Generalised view of pmr and space queries:

select * from R
where  a1 op1 C1 and a2 op1 C2
       and ... and an opn Cn 

where, for pmr, all opi are equality tests

Possible approaches to handling such queries ...

  1. use index on one ai to reduce tuple tests
  2. use indexes on all ai, and intersect answer sets


Selection using One Index8/152

Using index on one attribute:

If several attributes have indexes:


... Selection using One Index9/152

Abstract algorithm:

// select * from R where Cond
// Cond = a1op1C1 and a2op2C2 and a3op3C3 ...
Tids = IndexLookup(R, a[i], op[i], C[i])
// gives { tid1, tid2, ...} for tuples satisfying aiopiCi
Pages = { }
foreach tid in Tids {
    if (pageOf(tid)  Pages)
        Pages = Pages  {pageOf(tid)}
}
f = openFile(fileName("R"),READ)
foreach page in Pages {
    Buf = getPage(f,page);
    foreach tup in Buf {
        if (matches(tup, (a1 op1 C1 ...))
            add tup to Results
}   }


... Selection using One Index10/152

Cost of this approach to selection:

Note: some pages might contain


Selection using Many Indexes11/152

Using indexes on several attributes:

Mi = { tid(t) | t ∈ R ∧ t[ai] opi Ci }


... Selection using Many Indexes12/152

Abstract algorithm:

// select * from R where a1 op1 C1 and a2 op2 C2 ...
for (i = 1; i <= #QueryTerms; i++) {
    if (no index on a[i]) continue
    Tids = IndexLookup(R, a[i], op[i], C[i])
    PageSets[i] = { }
    foreach tid in Tids {
        if (pageOf(tid)  PageSets[i])
            PageSets[i] = PageSets[i]  pageOf(tid)
}   }
// PageSets[i] contains pages with tuples matching (ai opi Ci)
...


... Selection using Many Indexes13/152

...
// compute intersection of Pages[i]
Pages = PageSets[1]
for (i = 2; i <= #QueryTerms; i++) {
    Pages = Pages  PageSets[i];
}
// finally ... fetch data pages
f = openFile(fileName("R"),READ);
foreach page in Pages {
    Buf = getPage(f,page);
    // this page has at least 1 match
    foreach tup in Buf {
        if (matches(tup, (a1 op1 C1 ...))
            add tup to Results
}   }


Bitmap Indexes14/152

A bitmap index assists computation of result sets in pmr.

[Diagram:Pics/select/bitmap-index-small.png]


... Bitmap Indexes15/152

Structure of bitmap index:


... Bitmap Indexes16/152

Consider using bitmap index to solve query:

select * from Employees where dept='Sales' and gender='M'

Method:

tSales = bitmap[dept,'Sales']
tMales = bitmap[gender,'M']
matches = tSales & tMales
for (i in 0..r-1) {
   if (matches[i] == 1) {
      tid = slotToTupleId(i)
      fetch tuple t via TupleId tid
}  }


... Bitmap Indexes17/152

Also useful to have a file of tids, giving file structures:


[Diagram:Pics/file-struct/bitmap-files-small.png]


... Bitmap Indexes18/152

Answering queries using bitmap index:


Matches = AllOnes(r)
foreach attribute A with index {
   // select ith bit-string for attribute A
   // based on value associated with A in WHERE
   Matches = Matches & Bitmaps[A][i]
}
// Matches contains 1-bit for each matching tuple
foreach i in 0..r-1 {
   if (Matches[i] == 0) continue;
   Pages = Pages ∪ {pageOf(Tids[i])}
}
foreach pid in Pages {
   P = getPage(pid) 
   extract matching tuples from P
}


... Bitmap Indexes19/152

Costs associated with bitmap indexes ...

Storage costs:

Query-time costs:


N-dimensional Hashing


Hashing and pmr21/152

Consider these pmr queries on a table hashed using salary

Q1: select * from Employees
    where  salary = 60000 and gender = 'M'

Q2: select * from Employees
    where  dept = 'Sales' and salary = 60K

Q3: select * from Employees
    where  salary > 60000 and dept = 'Sales'

CostQ1 = 1+Ov,     CostQ2  =  b+bOv,     CostQ3  =  b+bOv


... Hashing and pmr22/152

Problem: hashing only effective if hash key used in query.

No problem if all queries involve conditions like key=val

But frequently tables are accessed in multiple ways.

Can we devise hashing that "involves" attributes?

Yes, using multi-attribute hashing (mah) ...


... Hashing and pmr23/152

Multi-attribute hashing ...

Multi-attribute hashing "parameterises" the indexing scheme: Not as good as hashing on one attribute; better than linear scan.


... Hashing and pmr24/152

Multi-attribute hashing parameters:


MA.Hashing Example25/152

Consider branch,acctNo,name,amount) table (+ hash values)

branch h(B) acctNo h(Ac) name h(N) amount
Brighton 1011 217 1001 Green 0101 750
Downtown 0000 101 0101 Johnson 1101 512
Mianus 1101 215 1011 Smith 0001 700
Perryridge 0100 102 0110 Hayes 0010 400
Redwood 0110 222 1110 Lindsay 1000 695
Round Hill 1111 305 0001 Turner 0110 350
Clearview 1110 117 0101 Throggs 0110 295

Note that we ignore the amount attribute; we are assuming, effectively, that nobody will want to ask queries like   select * from where amount=533


... MA.Hashing Example26/152

Hash parameters:    d=3     d1=1    d2=1    d3=1

Choice vector:

[Diagram:Pics/select/choice-vector-small.png]

This choice vector tells us:


... MA.Hashing Example27/152

Consider the tuple:

branch acctNo name amount
Brighton 217 Green 750

Hash value (page address) is computed by:

[Diagram:Pics/select/mahins1-small.png]


... MA.Hashing Example28/152

Consider the tuple:

branch acctNo name amount
Downtown 101 Johnston 512

Hash value (page address) is computed by:

[Diagram:Pics/select/mahins2-small.png]


... MA.Hashing Example29/152

Consider the tuple:

branch acctNo name amount
Round Hill 305 Turner 350

Hash value (page address) is computed by:

[Diagram:Pics/select/mahins3-small.png]


... MA.Hashing Example30/152

Hash values for all tuples:

tuple Hash
(Brighton,217,Green,750) 111
(Downtown,101,Johnshon,512) 110
(Mianus,215,Smith,700) 111
(Perryridge,102,Hayes,400) 000
(Redwood,222,Lindsay,695) 000
(Round Hill,305,Turner,350) 011
(Clearview,117,Throggs,295) 010


... MA.Hashing Example31/152

If inserted, in order of appearance, into a database containing 8 pages:

[Diagram:Pics/select/mahex1-small.png]


MA.Hashing Hash Functions32/152

Auxiliary definitions:

#define HashSize 32
typedef unsigned int HashVal;

// extracts i'th bit from hash value
#define bit(i,h) ((h) & (1 << (i)))

// choice vector elems
typedef struct { int attr, int bit } CVelem;
typedef CVelem ChoiceVec[HashSize];

// compute hash for i'th attribute of tuple t
HashVal hash(Tuple t, int i) { ... }


... MA.Hashing Hash Functions33/152

Produce combined d-bit hash value for tuple tup:

HashVal hash(Tuple tup, ChoiceVec cv, int d)
{
    HashVal h[nAttr(tup)];  // hash for each attr
    HashVal res = 0, oneBit;
    int     i, attr, bpos;
    for (i = 0; i < nAttr(tup); i++)
        h[i] = hash(tup,i);
    for (i = 0; i < d; i++) {
        attr = cv[i].attr;  bpos = cv[i].bit;
        oneBit = bit(bpos, h[attr]);
        res = res | (oneBit << (i-bpos));
    }
    return res;
}


Queries with MA.Hashing34/152

In a partial match query:

E.g.

select amount
from   Deposit
where  branch = 'Brighton' and name = 'Green'

for which we use the shorthand   (Brighton, ?, Green, ?)


... Queries with MA.Hashing35/152

If we try to form a composite hash for a query, we don't know values for some bits:

[Diagram:Pics/select/mahquery1-small.png]


... Queries with MA.Hashing36/152

How to answer this query?

query Hash
(Brighton, ?, Green, ?) 1*1

Any matching tuples must be in pages with addresses 101 or 111.

We need to read just these pages.


... Queries with MA.Hashing37/152

Consider the query:

select amount
from   Deposit
where  name = 'Green'

[Diagram:Pics/select/mahquery2-small.png]

Need to check pages: 100, 101, 110, 111.

(With original hashing scheme, would have read all pages)


... Queries with MA.Hashing38/152

Consider the query:

select amount
from   Deposit
where  branch='Brighton' and acctNo=217 and name='Green'

[Diagram:Pics/select/mahquery3-small.png]

Need to check only page 111.


MA.hashing Query Algorithm39/152

// Build the partial hash value (e.g. 10*0*1)
// Treats query like tuple with some attr values missing
nstars = 0;
for each attribute i in query Q {
    if (hasValue(Q,i)) {
        set d[i] bits in composite hash
            using choice vector and hash(Q,i)
    } else {
        set d[i] *'s in composite hash
            using choice vector
        nstars++;
    }
}
...


... MA.hashing Query Algorithm40/152

...
// Use the partial hash to find candidate pages
f = openFile(fileName("R"),READ);
for (i = 0; i < 2**nstars; i++) {
    P = composite hash
    replace *'s in P
        using i and choice vector
    Buf = getPage(f, P);
    for each tuple T in Buf {
        if (T satisfies pmr query)
            add T to results
    }
}


Query Cost for MA.Hashing41/152

Multi-attribute hashing handles a range of query types, e.g.

select * from R where a=1
select * from R where d=2
select * from R where b=3 and c=4
select * from R where a=5 and b=6 and c=7

A relation with n attributes has 2n different query types.

Different query types have different costs   (different no. of *'s)


... Query Cost for MA.Hashing42/152

A query type Q may be defined via a set of attribute numbers,
indicating which attributes have values specified.

Example Query Types:

Query Type
(v1, ?, ?, ?) {1}
(v1, ?, v3, ?) {1,3}
(?, ?, v3, v4) {3,4}
(v1, v2, v3, v4) {1,2,3,4}
(?, ?, ?, ?) {}   (open query)


... Query Cost for MA.Hashing43/152

In practice, different query types are not equally likely.

E.g. maybe most of the queries asked are of the form

select * from R where a=1

with occasionally a query of the form

select * from R where b=3 and c=4

For each application, we can define a query distribution
which gives the probability pQ of asking each query type Q.


... Query Cost for MA.Hashing44/152

Consider a query of type Q with m attributes unspecified.

Each unspecified Ai contributes di *'s.

Total number of *'s is   s  =  ∑i ∉ Q di.

Number of pages to read is   2s  =  ∏i ∉ Q 2di.

If we assume no overflow pages, Cost(Q) = 2s (where s is determined by Q)


... Query Cost for MA.Hashing45/152

Min query cost occurs when all attributes are used in query

Min Costpmr  =  1

Max query cost occurs when no attributes are specified

Max Costpmr  =  2d  =  b

Average cost is given by weighted sum over all query types:

Avg Costpmr  =  ∑Q pQi ∉ Q 2di

Since all query types are possible, aim to minimise average cost.


Optimising MA.Hashing Cost46/152

For a given application, the aim is to minimise Costpmr.

Can be achieved by choosing appropriate values for di   (cv)

Heuristics:

Trade-off: making query type Qj more efficient makes Qk less efficient.

This is a combinatorial optimisation problem, and can be handled by standard optimisation techniques e.g. simulated annealing.


MA.Hashing Cost Example47/152

Our example database has 16 possible query types:

Query type Cost pQ
(?, ?, ?, ?) 8 0
(br, ?, ?, ?) 4 0.25
(?, ac, ?, ?) 4 0
(br, ac, ?, ?) 2 0
(?, ?, nm, ?) 4 0
(br, ?, nm, ?) 2 0
(?, ac, nm, ?) 2 0.25
(br, ac, nm, ?) 1 0
(?, ?, ?, amt) 8 0
(br, ?, ?, amt) 4 0
(?, ac, ?, amt) 4 0
(br, ac, ?, amt) 2 0
(?, ?, nm, amt) 4 0
(br, ?, nm, amt) 2 0.5
(?, ac, nm, amt) 2 0
(br, ac, nm, amt) 1 0

Cost values are based on earlier choice vector   (dbr = dac = dnm = 1)
pQ values can be determined by observation of DB use.


... MA.Hashing Cost Example48/152

Consider r=106, Nr=100, b=104, d=14.

Attribute br occurs in 0.5+0.25 used query types
allocate many bits to it e.g. d1=6.

Attribute nm occurs in 0.5+0.25 of queries
allocate many bits to it e.g. d3=4.

Attribute amt occurs in 0.5 of queries
allocate less bits to it e.g. d4=2.

Attribute ac occurs in 0.25 of queries
allocate least bits to it e.g. d2=2.


... MA.Hashing Cost Example49/152

With bits distributed as: d1=6, d2=2, d3=4, d4=2

Query type Cost pQ
(br, ?, ?, ?) 28 = 256 0.25
(?, ac, nm, ?) 28 = 256 0.25
(br, ?, nm, amt) 22 = 4 0.5

Cost  =  0.5 × 22 + 0.25 × 28 + 0.25 × 28  =  130


MA.Hashing vs One-Attribute Hashing50/152

Multi-attribute hashing is basically just a different way of forming hash value for each tuple.

Hence, can be used with all flexible hashing schemes.

As the file grows, the di's are increased one by one, using the choice vector.

Difference between mah and standard hashing:
queries "suggest" a set of pages rather than a single page.

Cost for insertion, deletion, ordered scan is same as for underlying hashing scheme.


Grid Files51/152

The mah approach attempts to combine all attributes in the hash value.

Alternative approach is to keep hash values separate and use auxiliary structure to provide combination.

A generalisation of extendible hashing, called grid files provides this.

Ext.hashing uses a directory on one attribute.

Grid files use a k-dimensional directory to handle k attributes.


... Grid Files52/152


[Diagram:Pics/select/gridfile-small.png]

A simple 2-dimensional grid file


Selection with Grid Files53/152

Consider the following grid file directory:

[Diagram:Pics/select/gridquerys-small.png]

for a table R with two attributes a and b


... Selection with Grid Files54/152

If  d1=3  and  d2=2 ...

select...where a=C1 and b=C2 one cell
select...where a=C1 one row (four cells)
select...where b=C2 one column (eight cells)

Number of directory cells visited is similar to number of pages visted in mah.


... Selection with Grid Files55/152

Execution of select...where b=C2:

[Diagram:Pics/select/gridquery1-small.png]

where  hash(C1) = ...001  and  hash(C2) = ...110


... Selection with Grid Files56/152

Selection for pmr in a 2d grid file ( nAttrs = d )

for (i = 0; i < nAttr(tup); i++) {
    if (!hasValue(tup,i))
        { lo[i] = 0; hi[i] = 2**d-1 }
    else {
        val = bits(d[i], getAttr(tup,i))
        lo[i] = val; hi[i] = val
}   }
Pages = []
for (i = lo[1]; i <= hi[1]; i++) {
    for (j = lo[2]; j <= lo[2]; j++) {
        if (!(grid[i][j] in Pages))
            add grid[i][j] to Pages
}   }
for each P in Pages {
    Buf = getPage(f,P)
    check Buf for solutions
}


Query Cost for Grid Files57/152

Cost depends on:

For pmr with all attributes specified C(x,y,z)  =  2


... Query Cost for Grid Files58/152

For pmr with j < k attributes specified

C(?,y,?)  =  (gq + m)

Typically, 1 ≤ gq ≤ 50   (i.e. grid directories are relatively small)


Other Operations on Grid Files59/152

Insertion has several cases:

Deletion can be achieved by marking (so Costpmr + bq w)

As with all hashing schemes, grid file does not help ordered scan.


N-dimensional Tree Indexes


Multi-dimensional Tree Indexes61/152

We have seen how hashing can be extended to handle multiple attributes.

Can tree-based index structures be generalised as well?

Yes ... and a very large number of techniques have been suggested in research literature.

E.g.   BSP-trees, Cell-trees, LSD-trees, SS-trees, TV-trees, X-trees, ...

We consider three popular examples:   kd-trees, Quad-trees, R-trees.


Example 2d Relation62/152

As an example for multi-dimensional trees, consider the following relation:

create table Rel (
    A1 char(1),  # 'a'..'z'
    A2 integer   # 0..9
);

with example tuples:

Rel('a',1)  Rel('a',5)  Rel('b',2)
Rel('d',1)  Rel('d',2)  Rel('d',4)
Rel('d',8)  Rel('g',3)  Rel('j',7)
Rel('m',1)  Rel('r',5)  Rel('z',9)


... Example 2d Relation63/152

And the tuple-space for the above tuples:

[Diagram:Pics/select/2d-space-small.png]


kd-Trees64/152

kd-trees are multi-way search trees where

For a tree with L levels and k attributes: Partitions "tuple space" into smaller and smaller regions at each level.


Example kd-Tree65/152

[Diagram:Pics/select/kd-tree-small.png]

Each leaf node contains a reference to a bucket containing tuples from a small region of k-dimensional space.


... Example kd-Tree66/152

How this tree partititons the tuple space:

[Diagram:Pics/select/kd-tree-space-small.png]


Searching in kd-Trees67/152

Consider first how to answer pmr queries ...

If all attributes have values specified:

If e.g. attribute A2 is unspecified


... Searching in kd-Trees68/152

As a tree traversal, the algorithm is best specified recursively.

Parameters for search function:

Other available information: The search commences via   Search(Q, 0, kdTreeRoot)


... Searching in kd-Trees69/152

Search(Query Q, Level L, Node N)
{
    if (isLeaf(N)) {
        Buf = getPage(f,N.page)
        check Buf for matching tuples
    } else {
        ai = attrLev[L]
        if (hasValue(Q,ai)) {
            Val = getAttr(Q,ai)
            newN = search N for Val
            Search(Q, L+1, newN)
        } else {
            for each child newN of N
                Search(Q, L+1, newN)
}   }   }


... Searching in kd-Trees70/152

For space queries ...

We require a change to the above Search algorithm.

if (hasValue(Q,ai)) {
    Val = getAttr(Q,ai)
    newN = search N for Val
    Search(Q, L+1, newN)
}

becomes

if (isRange(Q,ai)) {
    for each node N on current level {
        for each value V in range(Q,ai)) {
            newN = childOf(N,V)
            Search(Q, L+1, newN)
}   }   }


Query cost for kd-Trees71/152

Different kinds of queries (i.e. different specified attributes)

The query cost will clearly also be affected by the depth of the tree.

If we include all n attributes as indexes, then tree has depth ≥ n.

Depth of tree is also affected by branching factor at each node.

Note: kd-trees are not necessarily balanced.


... Query cost for kd-Trees72/152

Query examples:

Open query:   (?,?,?,?,?,...)   (zero attributes specified)

Point query:   (a,b,c,d,e,...)   (all attributes specified) Generic pmr query:   (a,?,c,?,e,...)


Insertion into kd-Trees73/152

Input: tuple with all values specified.

Method:

traverse tree to leaf node
    (reaching data page P1)
if (not full P1)
    insert tuple
else {
    create new data page P2
    compute new partition point Part
    distribute tuples between P1 and P2
        (using Part)
    create new internal node N with Part
    link N into tree, link N to P1 and P2
}
write any modified pages (data or index)


... Insertion into kd-Trees74/152

Example insertion into non-full data page:

[Diagram:Pics/select/kd-tree1-small.png]


... Insertion into kd-Trees75/152

Example insertion into full data page:

[Diagram:Pics/select/kd-tree2-small.png]


kd-B-Trees76/152

kd-B-trees: a disk-based index structure using kd-tree ideas.

Aims to have a tree which is:

Basic idea: distribute kd-tree structure over a number of pages: Requires modification of insertion method to do re-balancing
(changes to insertion make it more complex and potentially much more expensive)


... kd-B-Trees77/152

Simple example kd-B-tree   (max 2 kd-tree levels in each node):

[Diagram:Pics/select/k-d-b-tree-small.png]


... kd-B-Trees78/152

What this produces:

Potential problems:


Selection in kd-B-Trees79/152

To answer a pmr query:

Pages = kdBsearch(query,root)
for each page P in Pages {
    Buf = getPage(f,P)
    scan Buf for matching tuples
}

kdbSearch(Query, Page)
{
    if (Page is a data page)
        Pages = Page
    else {
        Buf = getPage(kdbf,Page)
        search kd-tree in Buf
            using attributes from Query
        Pages = []
        for each indicated external subtree S {
            P = kdbSearch(Query, S)
            add P to Pages
    }   }
    return Pages
}


... Selection in kd-B-Trees80/152

Size of kd-B-tree:

Obviously, depth increases for not perfectly-balanced trees.


... Selection in kd-B-Trees81/152

If all attributes are specified (i.e. point search):

Costpmr  =  (d+1)

If some attributes unspecified ...

Can potentially search a large region of tree:

space queries can be handled similarly to pmr queries.


Insertion/Deletion in kd-B-Trees82/152

Conceptually, insertion is same as for kd-tree:

Added complication: kd-tree is spread over many index pages


... Insertion/Deletion in kd-B-Trees83/152

Minimum insert cost is 1w   (no split, no change to kd-tree)

Insert cost with split and kd-tree update: 3w
(need to write re-partitioned data pages and modified index page)

Worst case insert cost is high (tree/data re-organisation).

Deletion is straightforward (mark as deleted).

If data page occupancy falls too low, might consider


Quad Trees84/152

Quad trees use a regular, recursive partitioning of kd tuple space.

At each level, partition space into non-overlapping kd hypercubes.

For 2d case (others are much harder to visualise):

Aim: each "leaf" partition contains approx same number of tuples.


... Quad Trees85/152

Quad-tree on example 2d tuple space:

[Diagram:Pics/select/quad-tree-space-small.png]


... Quad Trees86/152

Basis for the partitioning:


... Quad Trees87/152

The previous partitioning gives this tree structure, e.g.

[Diagram:Pics/select/quad-tree-small.png]


... Quad Trees88/152

Each node of a k dimensional quad-tree has 2k children.

Quad-trees originally devised as memory based structures, for spatial data.

In this domain, k = 2 or k = 3 and branching is fairly small.

For index pages in a disk-based data structure:


Insertion into Quad Tree89/152

Inserting a new tuple into a 2d quad-tree indexed file involves:

traverse tree to leaf node
    (reaching data page P1)
if (not full P1)
    insert tuple
else {
    create new data pages P2, P3, P4
    distribute tuples between P1 .. P4
    create new internal node N
    link N into tree, link N to P1 .. P4
}
write any modified pages (data or index)


... Insertion into Quad Tree90/152

Example:

[Diagram:Pics/select/quad-insert-small.png]

 
Cost   =   traversal cost + update cost
  =   read ≥ 1 node pages + write ≥ 1 data pages


... Insertion into Quad Tree91/152

Potential problems with quad-trees:

The problems can be overcome by:


Query with Quad Tree92/152

For pmr query

For kd quad-tree, and query with n specified attributes:


... Query with Quad Tree93/152

For space query

Example: 2d quad-tree with query a ≤ A1 ≤ b ∧ c ≤ A2 ≤ d


... Query with Quad Tree94/152

Space query example:

[Diagram:Pics/select/quad-query-small.png]

Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).


R-Trees95/152

R-trees use a flexible, hierachical partitioning of kd tuple space.

Could be viewed as a more flexible version of quad-tree.


... R-Trees96/152

Example 2d R-tree node (with two levels of children):

[Diagram:Pics/select/r-tree-node-small.png]


... R-Trees97/152

[Diagram:Pics/select/r-tree-small.png]


... R-Trees98/152

The R-tree was initally designed for indexing polygons
  (via their minimum bounding rectangles (MBR))

It adapts naturally to tuple-space points (i.e. tuples)
  (each MBR represents a region of the tuple-space)

Goals in constructing R-trees (cf. B-trees)


Query with R-trees99/152

Designed to handle space queries and "where-am-I" queries.

"Where-am-I" query: find all regions containing a given point P:

Space (region) queries are handled in a similar way


... Query with R-trees100/152

Algorithm for "where am I" query in 2d R-tree:

Regions = RtreeSearch(a,b,root)

RtreeSearch(x,y,Node) {
    Regions = []
    if (leaf(Node)) {
        for each (x1,y1,x2,y2,P) in Node {
            if (x1,y1,x2,y2) contains (a,b) {
                add (x1,y1,x2,y2) to Regions
    }   }   }
    else {
        for each (x1,y1,x2,y2,Child) in Node {
            if (x1,y1,x2,y2) contains (a,b) {
                Rs = RtreeSearch(x,y,Child)
                add Rs to Regions
    }   }   }
    return Regions;
}


Insertion into R-tree101/152

Insertion of an object R occurs as follows:

Note that R may be a point or a polygon.


... Insertion into R-tree102/152

Heuristics in R-tree insertion ...

Choose child to expand to contain R

Choose a child (from several) to contain R Partition objects between old/new data pages


... Insertion into R-tree103/152

Quadratic split method to partition objects in S:

find objects a and b in S
    that produce the largest MBR
place a in S1
place b in S2
for every other object c in the S {
    s1 = current size of S1
    i1 = size of S1 with c added
    s2 = current size of S2
    i2 = size of S2 with c added
    if ((s1-i1) < (s2-i2))
        add c to S1
    else
        add c to S2
}


R-tree variants104/152

R+ tree does not permit overlapping regions.

R* tree uses delayed splitting on insertion. SS-tree uses (hyper)spherical regions rather than (hyper)cubes


R-trees in PostgreSQL105/152

Up to version 8.2, PostgreSQL had separate R-tree implementation

From version 8.2 on, R-trees implemented using GiST.


GiST in PostgreSQL106/152

PostgreSQL provides an implementation of Generalized Search Trees (GiST).

Above discussion of tree-based indexes shows commonalities:

Some examples:


... GiST in PostgreSQL107/152

GiST trees have the following structural constraints:

However, the "predicate" constraints are looser than those for any trees discussion so far.

Users are free to implement their own p to produce their own "flavour" of search tree.

Details: src/backend/access/gist


Signature-based Selection


Indexing with Signatures109/152

Signature-based indexing: an "efficient" linear scan.

Why abandon sub-linear complexity?   (e.g. O(1) for hashing)

Arising from work with high-d feature vectors in multimedia:

We consider two applications of signatures to indexing:


... Indexing with Signatures110/152

A signature is a compact (lossy) descriptor for a tuple.

Scanning the whole data file is too expensive, so

For this to be better than a linear scan of the data, require:


Signature Files111/152

File organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile-small.png]

(One signature slot per tuple slot in the data file; unused signature slots are zeroed)


... Signature Files112/152

Note that signatures do not determine tuple placement.

The only requirement is that

The data file could be organised via hashing, B-tree, curves, etc.

Thus, signatures can be used in conjunction with other indexing schemes.


Signatures113/152

Signatures (descriptors)

There are two main types of tuple signature: We use the following terminology consistently:


... Signatures114/152

A tuple r consists of n attributes A1..An.

A codeword cw(Ai) is

A tuple descriptor is built by combining n codewords.


Generating Codewords115/152

A method for generating a k-in-m codeword for attribute Ai

seed = hash(A[i])
nbits = 0
codeword = 0  /* m zero bits */
while (nbits < k) {
    i = random(0,m-1)
    if (bit i not set in codeword) {
        set bit i in codeword
        nbits++
    }
}

Requires bit-string operations and a pseudo-random number generator.


Superimposed Codewords (SIMC)116/152

In a superimposed codewords (simc) indexing scheme

A tuple descriptor desc(r) is


SIMC Example117/152

Consider the following tuple (from a bank deposit database)

branch acctNo name amount
Perryridge 102 Hayes 400

It has the following codewords/descriptor (for m = 12,   k = 2)

Ai cw(Ai)
Perryridge 010000000001
102 000000000011
Hayes 000001000100
400 000010000100
desc(r) 010011000111


... SIMC Example118/152

Consider a very small example database, with associated signatures:

Signature Tuple
100101001001 (Brighton,217,Green,750)
010101010101 (Clearview,117,Throggs,295)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010011000111 (Perryridge,102,Hayes,400)
100101010011 (Redwood,222,Lindsay,695)
011110111010 (Round Hill,305,Turner,350)


SIMC Queries119/152

To answer query q in SIMC

desc(q) is formed by bitwise OR of codewords for supplied attributes.

E.g. consider the query (Perryridge, ?, ?, ?).

Ai cw(Ai)
Perryridge 010000000001
? 000000000000
? 000000000000
? 000000000000
desc(q) 010000000001


... SIMC Queries120/152

Once we have a query descriptor, we search the signature file:

for each descriptor D[i] in signature file {
    if (D[i] matches desc(q))
        mark tuple R[i] as a match
}
for each marked tuple R {
    fetch tuple R
}

The critical assumption to make this approach viable:

scanning the signature file and comparing descriptors
is faster than scanning the tuple file and checking tuples


... SIMC Queries121/152

What sort of signature comparison is required to answer the example query?

Clearly, any matching tuple must have the value "Perryridge" for A1.

Any tuple with "Perryridge" must have these bits 010000000001 set.

So, checking whether each descriptor has these bits set, gives us the matches.

This can be generalised to multiple supplied values in the query:

In other words ...
for any tuple r that is a match for query q
the 1-bits in desc(q) are a subset of the 1-bits in desc(r)


... SIMC Queries122/152

Query/tuple descriptor matching can be implemented efficiently using logic operations on bit-strings.

Note that if   bits(A) ⊆ bits(B)   then   A AND B = A.

Thus, the query/tuple descriptor matching can be implemented as:

matches(Ri,q) if desc(q) AND desc(Ri) = desc(q)


Example SIMC Query123/152

Consider the query and the example database:

Signature Tuple
010000000001 (Perryridge,?,?,?)
100101001001 (Brighton,217,Green,750)
010101010101 (Clearview,117,Throggs,295)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010011000111 (Perryridge,102,Hayes,400)
100101010011 (Redwood,222,Lindsay,695)
011110111010 (Round Hill,305,Turner,350)


... Example SIMC Query124/152

The query has potential matches:

branch acctNo name amount
Clearview 117 Throggs 295
Perryridge 102 Hayes 400

The first is an example of a false match.

False matches are caused by:


False Matches125/152

Example:

Ai cw(Ai)
Perryridge 010000000001
102 000000000011
Hayes 000001000100
400 000010000100
desc(r) 010011000111
Ai cw(Ai)
Clearview 010000010000
117 000100000001
Throggs 000001000100
295 000100000100
desc(r) 010101010101


... False Matches126/152

How to reduce likelihood of false matches?

Increasing m helps, but at the expense of reading more descriptor data.

Having k too high    increased overlapping.

Having k too low    increased codeword collisions.

Thus, for SIMC schemes, there is the notion of choosing optimal m, k.


Design with SIMC127/152

SIMC schemes have the notion of false match probability pF.

As suggested above, pF is affected by m, k and n (#attributes)

In designing a SIMC index for a data file, the aim is to choose indexing parameters (m, k) to minimise the likelihood of false matches.

  1. start by choosing acceptable pF   (e.g. pF ≤ 10-5 i.e. one false match in 10,000)
  2. then choose m and k to achieve no more than this pF.
Formulae to derive m and k given pF and n:

k  =  1/loge2 . loge ( 1/pF )

m  =  ( 1/loge 2 )2 . n . loge ( 1/pF )


... Design with SIMC128/152

Design example:


Query Cost for SIMC129/152

A page-oriented view of how SIMC queries are answered:

determine query descriptor QD
for each page SB of tuple descriptors {
    for each tuple descriptor RD in page SB {
        if (RD matches QD) {
	    add tuple R to list of matches
            add pageOf(R) to Pages
        }
    }
}
for each page P in Pages {
    Buf = getPage(f,P)
    check Buf for matching tuples
}

With a buffer manager, the previous tuple-oriented version would behave like this.


... Query Cost for SIMC130/152

To answer pmr query, must read r signatures.

Since one signature per tuple, cannot feasibly store them in memory.

If signature contains m bits, can fit ND = 8B/m signatures per page.

Thus, must read bD = r/ND pages of signature data.

E.g. m=32,   B=1024,   r=105     ⇒     ND = 256,   bD=391

How many data pages do we read?


... Query Cost for SIMC131/152

Number of data pages read depends on:

Total tuples to read = rq + rF.

Expected false matches = rF  =  (r - rq).pF.

Assume matching tuples are uniformly distributed over the data pages.

Then total data pages read = bq  =  (rq+rF)/r × b

Costpmr  =  bD + bq


... Query Cost for SIMC132/152

For one queries, can use same optimisation as for Heaps.

That is, stop the search as soon as the single matching tuple is found.

The average cost in the case would be:

Costone :   Best = 2     Average = bD/2 + δ     Worst = bD + b


... Query Cost for SIMC133/152

SIMC provides no assistance in answering range, space queries, unless the data file is sorted.

If the data file is sorted,

SIMC provides no asistance at all for sim queries   (linear scan).


Applications of SIMC134/152

SIMC is useful for pmr, which allows a wide range of query types, e.g.

select * from R where a=1
select * from R where a=3 and c=1 and d=4
select * from R where a=2 and b=3 and c=4 and d=5

The above discussion has assumed that n attributes are indexed.

In fact, SIMC is reasonably flexible about how many attributes are involved

Thus, it also has applications in information retrieval:


Two-level SIMC135/152

Scanning one descriptor for every tuple is not efficient.

How to reduce number of descriptors?

Have a smaller number of larger descriptors.

E.g. one descriptor for each data page.

Every attribute of every tuple in page contributes to descriptor.

How large is a page descriptor (BD)?

Use formulae above, but with Nr.n ``attributes''.

Typically, pages are 1..8KB    8..64 BD/page (NBD).


Two-Level SIMC Files136/152

File organisation for two-level superimposed codeword index

[Diagram:Pics/select/simc-2lev-small.png]


... Two-Level SIMC Files137/152

Alternative file organisation:

[Diagram:Pics/select/simc-2lev1-small.png]

(Assumes that we need to read the data pages anyway, so may as well include tuple descriptors there ... or simply omit them altogether)


Queries with Two-level SIMC138/152

Method for answering queries:

form query descriptor QD
for each page descriptor BD {
    if (BD matches QD)
        add P to Pages
}
for each page P in Pages {
    read page P
    check for matching tuples
    // could do this using tuple descriptors
}

Costpmr  =  b/NBD + bq


Bit-sliced SIMC139/152

Reading all page descriptors is still not efficient, especially when only a few bits are set.

To improve efficiency, reconsider the matching process:

Signature Tuple
010000000001 (Perryridge,?,?,?)
100101001001 (Brighton,217,Green,750)
010101010101 (Clearview,117,Throggs,295)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010011000111 (Perryridge,102,Hayes,400)
100101010011 (Redwood,222,Lindsay,695)
011110111010 (Round Hill,305,Turner,350)


... Bit-sliced SIMC140/152

Rather than storing b m-bit page descriptors, store m b-bit descriptor slices.

[Diagram:Pics/select/bit-slice-small.png]


... Bit-sliced SIMC141/152

To answer queries:

form query descriptor QD
// Result is a bit-slice
Result = AllOnes
for each bit b set in QD {
    read bit-slice b
    Result = Result AND b
}

Result has bit i set if page i is candidate page.

Queries much more efficient because often only a few bits set in QD.

However, updates are expensive ... need to set k different slices for each insert.

Changing b is very expensive.


Query Cost for Bit-sliced SIMC142/152

A major cost in SIMC queries is reading the page-descriptor slices.

Each slice is b bits long (one bit for each data file page).

Thus, each slice occupies 8B/b bytes.

Generally, we can fit several page descriptor slices per page.

However, we make the pessimistic assumption ...

reading a slice means reading one page from the page descriptor slices file


... Query Cost for Bit-sliced SIMC143/152

Consider a query (val1, ?, val2, ?, ...) for an n attribute relation:

In fact, we can optimise further by always reading just the first j of these mk slices.

Thus, the descriptor-reading cost in queries can be made constant.

The downside of this trick is a (slight) increase in likelihood of false matches.

Costpmr   =   j + bq


Disjoint Codewords (DJC)144/152

In a disjoint codewords (djc) indexing scheme

A tuple descriptor desc(t) is


DJC example145/152

Consider the following tuple (from a bank deposit database)

branch acctNo name amount
Perryridge 102 Hayes 400

It has the following codewords/descriptor (for m = 4,   k = 2)

Ai cw(Ai)
Perryridge 0101
102 0011
Hayes 1001
400 0101
desc(t) 0101001110010101

(Note: length of tuple = ~28 bytes,   length of descriptor = 16 bits = 2 bytes)


DJC Queries146/152

Use a similar approach to SIMC:

desc(q) is formed by concatentation of codewords


... DJC Queries147/152

E.g. consider the query (Perryridge, ?, Hayes, ?)

Ai cw(Ai)
Perryridge 0101
? 0000
Hayes 1001
? 0000
desc(t) 0101000010010000

If the query has s supplied attributes, then desc(q) has sk bits set to 1.


... DJC Queries148/152

Searching the signature file is done in the same way as for SIMC:

Matches = {}
for each descriptor D[i] in sig file {
    if (D[i] matches desc(q))
        Matches = Matches  TupleId[i]
}
for each Tid in Matches
    fetch tuple t via Tid

The matching process is also the same as for SIMC

matches(Ti,q) if desc(q) AND desc(Ti) = desc(q)


Example DJC Query149/152

Consider the query and the example database:

Signature Tuple
0101 0000 1001 0000 (Perryridge,?,Hayes,?)
1010 0110 1001 0110 (Brighton,217,Green,750)
0101 0101 0101 0101 (Clearview,117,Throggs,295)
1010 1010 1001 1010 (Downtown,101,Johnson,512)
1010 0011 0011 0101 (Mianus,215,Smith,700)
0101 0011 1001 1010 (Perryridge,102,Hayes,400)
1001 0101 0011 1010 (Redwood,222,Lindsay,695)
0101 0110 1001 0110 (Round Hill,305,Turner,350)


False Matches150/152

The query has potential matches:

branch acctNo name amount
Perryridge 102 Hayes 400
Round Hill 305 Turner 350

DJC also suffers from the "false match" problem, due to:

Note:


... False Matches151/152

How to reduce likelihood of false matches?

Since descriptor is formed from concatentation of codewords: Thus, for each Ai, mi bits in the codeword, with mi/2 1-bits.

DJC can be "tuned" to the mix of pmr queries used on the relation.


Query Cost for DJC152/152

If we assume that

then the average query cost analysis is exactly the same as for SIMC.

However, we can make sure that certain query types are answered with less likelihood of false matches by adjusting the codeword lengths.

The optimisations from SIMC (two-level, bit-slice) can also be applied.

As for SIMC, DJC assists with one and pmr, but not range, space, sim.


Produced: 12 Jul 2019