❖ Indexing with Signatures |
Signature-based indexing:
Each tuple is associated with a signature
❖ Indexing with Signatures (cont) |
File organisation for signature indexing (two files)
One signature slot per tuple slot; unused signature slots are zeroed.
Signatures do not determine record placement
❖ Signatures |
A signature "summarises" the data from one tuple
A tuple consists of n attribute values A1 .. An
A codeword cw(Ai) is
❖ Generating Codewords |
Generating a k-in-m codeword for attribute Ai
bits codeword(char *attr_value, int m, int k)
{
int nbits = 0; // count of set bits
bits cword = 0; // assuming m <= 32 bits
srandom(hash(attr_value));
while (nbits < k) {
int i = random() % m;
if (((1 << i) & cword) == 0) {
cword |= (1 << i);
nbits++;
}
}
return cword; // m-bits with k 1-bits and m-k 0-bits
}
❖ Superimposed Codewords (SIMC) |
In a superimposed codewords (simc) indexing scheme
❖ Superimposed Codewords (SIMC) (cont) |
A SIMC tuple descriptor desc(t) is
OROROR
Bits desc = 0
for (i = 1; i <= n; i++) {
bits cw = codeword(A[i],m,k)
desc = desc | cw
}
❖ Concatenated Codewords (CATC) |
In a concatenated codewords (catc) indexing schema
❖ Concatenated Codewords (CATC) (cont) |
A CATC tuple descriptor desc(t) is
Method (assuming all n attributes are used in descriptor):
Bits desc = 0 ; int p = m/n
for (i = 1; i <= n; i++) {
bits cw = codeword(A[i],p,k)
desc = desc | (cw << p*(n-i))
}
❖ Queries using Signatures |
To answer query q with a signature-based index
Effectively, any unknown attribute Ai has cw(Ai) = 0
E.g. for SIMC (a,?,c,?,e)aORORcORORe
❖ Queries using Signatures (cont) |
Once we have a query descriptor, we search the signature file:
pagesToCheck = {}
// scan r descriptors
for each descriptor D[i] in signature file {
if (matches(D[i],desc(q))) {
pid = pageOf(tupleID(i))
pagesToCheck = pagesToCheck ∪ pid
}
}
// scan bq + δ data pages
for each pid in pagesToCheck {
Buf = getPage(dataFile,pid)
check tuples in Buf for answers
}
❖ False Matches |
Both SIMC and CATC can produce false matches
matches(D[i],desc(q))Tup[i]q
hash(key1) == hash(key2)key1 != key2
❖ SIMC vs CATC |
Both build m-bit wide signatures, with ~1/2 bits set to 1
Both have codewords with ~m/2n bits set to 1
CATC: codewords are m/n = p-bits wide
SIMC: codewords are also m-bits wide
Produced: 20 Mar 2021