❖ Hashing |
Basic idea: use key value to compute page address of tuple.
e.g. tuple with key = v is stored in page i
Requires: hash function h(v) that maps KeyDomain → [0..b-1].
❖ Hashing (cont) |
PostgreSQL hash function (simplified):
Datum hash_any(unsigned char *k, int keylen)
{
uint32 a, b, c, len, *ka = (uint32 *)k;
/* Set up the internal state */
len = keylen;
a = b = c = 0x9e3779b9+len+3923095;
/* handle most of the key */
while (len >= 12) {
a += ka[0]; b += ka[1]; c += ka[2];
mix(a, b, c);
ka += 3; len -= 12;
}
... collect data from remaining bytes into a,b,c ...
mix(a, b, c);
return UInt32GetDatum(c);
}
See backend/access/hash/hashfunc.cmix()
❖ Hashing (cont) |
hash_any()uint32
Two ways to map raw hash value into a page address:
uint32 hashToPageNum(uint32 hval) {
uint32 mask = 0xFFFFFFFF;
return (hval & (mask >> (32-k)));
}
uint32 hashToPageNum(uint32 hval) {
return (hval % b);
}
❖ Hashing Performance |
Aims:
Best case: every bucket contains same number of tuples.
Worst case: every tuple hashes to same bucket.
Average case: some buckets have more tuples than others.
Use overflow pages to handle "overfull" buckets (cf. sorted files)
All tuples in each bucket must have same hash value.
❖ Hashing Performance (cont) |
Two important measures for hash files:
| Case | L | Ov |
| Best | ≅ 1 | 0 |
| Worst | >> 1 | ** |
| Average | < 1 | 0<Ov<1 |
To achieve average case, aim for 0.75 ≤ L ≤ 0.9.
❖ Selection with Hashing |
Select via hashing on unique key k (one)
// select * from R where k = val
getPageViaHash(R, val, P)
for each tuple t in page P {
if (t.k == val) return t
}
for each overflow page Q of P {
for each tuple t in page Q {
if (t.k == val) return t
} }
Costone : Best = 1, Avg = 1+Ov/2 Worst = 1+max(OvLen)
❖ Selection with Hashing (cont) |
Working out which page, given a key ...
getPageViaHash(Reln R, Value key, Page p)
{
uint32 h = hash_any(key, len(key));
PageID pid = h % nPages(R);
get_page(R, pid, buf);
}
❖ Selection with Hashing (cont) |
Select via hashing on non-unique hash key nk (pmr)
// select * from R where nk = val
getPageViaHash(R, val, P)
for each tuple t in page P {
if (t.nk == val) add t to results
}
for each overflow page Q of P {
for each tuple t in page Q {
if (t.nk == val) add t to results
} }
return results
Costpmr = 1 + Ov
If Ov is small (e.g. 0 or 1), very good retrieval cost
❖ Selection with Hashing (cont) |
Hashing does not help with range queries** ...
Costrange = b + bov
Selection on attribute j which is not hash key ...
Costone, Costrange, Costpmr = b + bov
** unless the hash function is order-preserving (and most aren't)
❖ Insertion with Hashing |
Insertion uses similar process to one queries.
// insert tuple t with key=val into rel R
getPageViaHash(R, val, P)
if room in page P {
insert t into P; return
}
for each overflow page Q of P {
if room in page Q {
insert t into Q; return
} }
add new overflow page Q
link Q to previous page
insert t into Q
Costinsert : Best: 1r + 1w Worst: 1+max(OvLen))r + 2w
❖ Deletion with Hashing |
Similar performance to select on non-unique key:
// delete from R where k = val // f = data file ... ovf = ovflow file getPageViaHash(R, val, P) ndel = delTuples(P,k,val) if (ndel > 0) put_page(dataFile(R),P.pid,P) for each overflow page Q of P { ndel = delTuples(Q,k,val) if (ndel > 0) put_page(ovFile(R),Q.pid,Q) }
Extra cost over select is cost of writing back modified pages.
Method works for both unique and non-unique hash keys.
❖ Problem with Hashing... |
So far, discussion of hashing has assumed a fixed file size (b).
What size file to use?
Methods for hashing with files whose size changes:
❖ Flexible Hashing |
All flexible hashing methods ...
uint32 hash(unsigned char *val)
Require a function to extract d bits from bit-string:
unit32 bits(int d, uint32 val)
Use result of bits()
❖ Flexible Hashing (cont) |
Important concept for flexible hashing: splitting
1010101110101011011101101❖ Flexible Hashing (cont) |
Example of splitting:
Tuples only show key value; assume h(val) = val
Produced: 7 Mar 2021