❖ Indexing with Signatures |
Signature-based indexing:
Each tuple is associated with a signature
❖ Indexing with Signatures (cont) |
File organisation for signature indexing (two files)
One signature slot per tuple slot; unused signature slots are zeroed.
Signatures do not determine record placement
❖ Signatures |
A signature "summarises" the data from one tuple
A tuple consists of n attribute values A1 .. An
A codeword cw(Ai) is
❖ Generating Codewords |
Generating a k-in-m codeword for attribute Ai
bits codeword(char *attr_value, int m, int k) { int nbits = 0; // count of set bits bits cword = 0; // assuming m <= 32 bits srandom(hash(attr_value)); while (nbits < k) { int i = random() % m; if (((1 << i) & cword) == 0) { cword |= (1 << i); nbits++; } } return cword; // m-bits with k 1-bits and m-k 0-bits }
❖ Superimposed Codewords (SIMC) |
In a superimposed codewords (simc) indexing scheme
❖ Superimposed Codewords (SIMC) (cont) |
A SIMC tuple descriptor desc(t) is
OR
OR
OR
Bits desc = 0 for (i = 1; i <= n; i++) { bits cw = codeword(A[i],m,k) desc = desc | cw }
❖ Concatenated Codewords (CATC) |
In a concatenated codewords (catc) indexing schema
❖ Concatenated Codewords (CATC) (cont) |
A CATC tuple descriptor desc(t) is
Method (assuming all n attributes are used in descriptor):
Bits desc = 0 ; int p = m/n for (i = 1; i <= n; i++) { bits cw = codeword(A[i],p,k) desc = desc | (cw << p*(n-i)) }
❖ Queries using Signatures |
To answer query q with a signature-based index
Effectively, any unknown attribute Ai has cw(Ai) = 0
E.g. for SIMC (a,?,c,?,e)
a
OR
OR
c
OR
OR
e
❖ Queries using Signatures (cont) |
Once we have a query descriptor, we search the signature file:
pagesToCheck = {} // scan r descriptors for each descriptor D[i] in signature file { if (matches(D[i],desc(q))) { pid = pageOf(tupleID(i)) pagesToCheck = pagesToCheck ∪ pid } } // scan bq + δ data pages for each pid in pagesToCheck { Buf = getPage(dataFile,pid) check tuples in Buf for answers }
❖ False Matches |
Both SIMC and CATC can produce false matches
matches(D[i],desc(q))
Tup[i]
q
hash(key1) == hash(key2)
key1 != key2
❖ SIMC vs CATC |
Both build m-bit wide signatures, with ~1/2 bits set to 1
Both have codewords with ~m/2n bits set to 1
CATC: codewords are m/n = p-bits wide
SIMC: codewords are also m-bits wide
Produced: 20 Mar 2021