Signature-based Indexing

COMP9315 21T1 ♢ Signature-based Indexing ♢ [0/12]
❖ Indexing with Signatures

Signature-based indexing:

Each tuple is associated with a signature

Instead of scanning/testing tuples, do pre-filtering via signatures.
COMP9315 21T1 ♢ Signature-based Indexing ♢ [1/12]
❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile1.png]

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

COMP9315 21T1 ♢ Signature-based Indexing ♢ [2/12]
❖ Signatures

A signature "summarises" the data from one tuple

A tuple consists of n attribute values A1 .. An

A codeword cw(Ai) is

A tuple descriptor (signature) is built Two strategies for building signatures: overlay, concatenate
COMP9315 21T1 ♢ Signature-based Indexing ♢ [3/12]
❖ Generating Codewords

Generating a k-in-m codeword for attribute Ai

bits codeword(char *attr_value, int m, int k)
{
   int  nbits = 0;   // count of set bits
   bits cword = 0;   // assuming m <= 32 bits
   srandom(hash(attr_value));
   while (nbits < k) {
      int i = random() % m;
      if (((1 << i) & cword) == 0) {
         cword |= (1 << i);
         nbits++;
      }
   }
   return cword;  // m-bits with k 1-bits and m-k 0-bits
}

COMP9315 21T1 ♢ Signature-based Indexing ♢ [4/12]
❖ Superimposed Codewords (SIMC)

In a superimposed codewords (simc) indexing scheme


[Diagram:Pics/select/simc-sig.png]

COMP9315 21T1 ♢ Signature-based Indexing ♢ [5/12]
❖ Superimposed Codewords (SIMC) (cont)

A SIMC tuple descriptor desc(t) is

Method (assuming all n attributes are used in descriptor):

Bits desc = 0 
for (i = 1; i <= n; i++) {
   bits cw = codeword(A[i],m,k)
   desc = desc | cw
}

COMP9315 21T1 ♢ Signature-based Indexing ♢ [6/12]
❖ Concatenated Codewords (CATC)

In a concatenated codewords (catc) indexing schema


[Diagram:Pics/select/catc-sig.png]

COMP9315 21T1 ♢ Signature-based Indexing ♢ [7/12]
❖ Concatenated Codewords (CATC) (cont)

A CATC tuple descriptor desc(t) is

Each codeword is p = m/n bits long, with k = p/2 bits set to 1

Method (assuming all n attributes are used in descriptor):

Bits desc = 0 ;  int p = m/n
for (i = 1; i <= n; i++) {
   bits cw = codeword(A[i],p,k)
   desc = desc | (cw << p*(n-i))
}

COMP9315 21T1 ♢ Signature-based Indexing ♢ [8/12]
❖ Queries using Signatures

To answer query q with a signature-based index

desc(q) is formed from codewords of known attributes.

Effectively, any unknown attribute Ai  has cw(Ai) = 0

E.g. for SIMC (a,?,c,?,e) = cw(a)  OR  0  OR  cw(c)  OR  0  OR  cw(e)

COMP9315 21T1 ♢ Signature-based Indexing ♢ [9/12]
❖ Queries using Signatures (cont)

Once we have a query descriptor, we search the signature file:

pagesToCheck = {}
// scan r descriptors
for each descriptor D[i] in signature file {
    if (matches(D[i],desc(q))) {
        pid = pageOf(tupleID(i))
        pagesToCheck = pagesToCheck ∪ pid
    }
}
// scan bq + δ data pages
for each pid in pagesToCheck {
    Buf = getPage(dataFile,pid)
    check tuples in Buf for answers
}

COMP9315 21T1 ♢ Signature-based Indexing ♢ [10/12]
❖ False Matches

Both SIMC and CATC can produce false matches

Why does this happen? To mitigate this, need to choose "good" m  and k
COMP9315 21T1 ♢ Signature-based Indexing ♢ [11/12]
❖ SIMC vs CATC

Both build m-bit wide signatures, with ~1/2 bits set to 1

Both have codewords with ~m/2n bits set to 1

CATC: codewords are m/n = p-bits wide

SIMC: codewords are also m-bits wide

CATC has option of different length codeword pi for each Ai    (∑pi = m)
COMP9315 21T1 ♢ Signature-based Indexing ♢ [12/12]


Produced: 20 Mar 2021