❖ Hashing |
Key-indexed arrays had "perfect" search performance O(1)
❖ ... Hashing |
The ideal for key-indexed collections:
courses["COMP3311"] = "Database Systems";
printf("%s\n", courses["COMP3311"]);
Almost as good:
courses[h("COMP3311")] = "Database Systems";
printf("%s\n", courses[h("COMP3311")]);
In practice:
item = {"COMP3311","Database Systems"};
courses = insert(courses, item);
printf("%s\n", search(courses, "COMP3311"));
❖ ... Hashing |
To use arbitrary values as keys, we need three things:
KeyItemItemKeyKey❖ ... Hashing |
Generalised ADT for a collection of Item
Interface:
typedef struct CollectionRep *Collection; Collection newCollection(); // make new empty collection Item *search(Collection, Key); // find item with key void insert(Collection, Item); // add item into collection void delete(Collection, Key); // drop item with key
Implementation:
typedef struct CollectionRep {
... some data structure to hold multiple Items ...
} CollectionRep;
❖ ... Hashing |
For hash tables, we make one change to interface:
typedef struct HashTabRep *HashTable; // make new empty table of size N HashTable newHashTable(int); Item *search(HashTable, Key); // find item with key void insert(HashTable, Item); // add item into collection void delete(HashTable, Key); // drop item with key
❖ ... Hashing |
Example hash table implementation:
typedef struct HashTabRep {
int N; // size of array
Item **items; // array of (Item *)
} HashTabRep;
HashTable newHashTable(int N)
{
HashTable new = malloc(sizeof(HashTabRep));
new->items = malloc(N*sizeof(Item *));
new->N = N;
for (int i = 0; i < N; i++)
{ new->items[i] = NULL; }
return new;
}
❖ Hash Functions |
Points to note:
Keymod❖ ... Hash Functions |
Basic idea behind hash function
int hash(Key key, int N)
{
int val = convert key to int;
return val % N;
}
If keys are int
How to convert keys which are strings? (e.g. "COMP1927""9300035"
❖ Exercise : Hash Functions (i) |
Consider this potential hash function:
int hash(char *key, int N)
{
int h = 0; char *c;
for (c = key; *c != '\0'; c++)
h = h + *c;
return h % N;
}
How does this function convert strings to int
What are the deficiencies with this function and how can it be improved?
❖ ... Hash Functions |
A slightly more sophisticated hash function
int hash(char *key, int N)
{
int h = 0; char *c;
int a = 127; // a prime number
for (c = key; *c != '\0'; c++)
h = (a * h + *c) % N;
return h;
}
Converts strings into integers in table range.
But poor choice of a
❖ ... Hash Functions |
To use all of value in hash, with suitable "randomization":
int hash(char *key, int N)
{
int h = 0, a = 31415, b = 21783;
char *c;
for (c = key; *c != '\0'; c++) {
a = a*b % (N-1);
h = (a * h + *c) % N;
}
return h;
}
This approach is known as universal hashing.
❖ ... Hash Functions |
A real hash function (from PostgreSQL DBMS):
hash_any(unsigned char *k, register int keylen, int N)
{
register uint32 a, b, c, len;
// set up internal state
len = keylen;
a = b = 0x9e3779b9;
c = 3923095;
// handle most of the key, in 12-char chunks
while (len >= 12) {
a += (k[0] + (k[1] << 8) + (k[2] << 16) + (k[3] << 24));
b += (k[4] + (k[5] << 8) + (k[6] << 16) + (k[7] << 24));
c += (k[8] + (k[9] << 8) + (k[10] << 16) + (k[11] << 24));
mix(a, b, c);
k += 12; len -= 12;
}
// collect any data from remaining bytes into a,b,c
mix(a, b, c);
return c % N;
}
❖ ... Hash Functions |
Where mix
#define mix(a,b,c) \
{ \
a -= b; a -= c; a ^= (c>>13); \
b -= c; b -= a; b ^= (a<<8); \
c -= a; c -= b; c ^= (b>>13); \
a -= b; a -= c; a ^= (c>>12); \
b -= c; b -= a; b ^= (a<<16); \
c -= a; c -= b; c ^= (b>>5); \
a -= b; a -= c; a ^= (c>>3); \
b -= c; b -= a; b ^= (a<<10); \
c -= a; c -= b; c ^= (b>>15); \
}
i.e. scrambles all of the bits from the bytes of the key value
❖ Hash Table ADT |
Enhanced concrete data representation:
#include "Item.h" // Item has key and data
#define NoItem distinguished Item value
typedef struct HashTabRep {
Item *items; // array of Items
int nslots; // # elements in array (was called N)
int nitems; // # items stored in array
} HashTabRep;
typedef HashTabRep *HashTable;
❖ Exercise : NoItem values |
Suggest suitable NoItem
items[](Item *)❖ ... Hash Table ADT |
Hash table initialisation:
NoItem
HashTable newHashTable(int N)
{
HashTabRep *new = malloc(sizeof(HashTabRep));
assert(new != NULL);
new->items = malloc(N*sizeof(Item));
assert(new->items != NULL);
for (int i = 0; i < N; i++)
new->items[i] = NoItem;
new->nitems = 0; new->nslots = N;
return new;
}
❖ Problems with Hashing |
In ideal scenarios, search cost in hash table is O(1).
Problems with hashing:
Item
size(IndexSpace), collisions inevitable
nitemsnslots❖ Collision Resolution |
Three approaches to dealing with hash collisions:
Itemhash()❖ Separate Chaining |
Solve collisions by having multiple items per array entry.
Make each element the start of linked-list of Items.
❖ ... Separate Chaining |
Concrete data structure for hashing via chaining
typedef struct HashTabRep {
List *lists; // array of Lists of Items
int nslots; // # elements in array
int nitems; // # items stored in HashTable
} HashTabRep;
HashTable newHashTable(int N)
{
HashTabRep *new = malloc(sizeof(HashTabRep));
assert(new != NULL);
new->lists = malloc(N*sizeof(List));
assert(new->lists != NULL);
for (int i = 0; i < N; i++)
new->lists[i] = newList();
new->nslots = N; new->nitems = 0;
return new;
}
❖ ... Separate Chaining |
Using the List
#include "List.h"
Item *search(HashTable ht, Key k)
{
int i = hash(k, ht->nslots);
return ListSearch(ht->lists[i], k);
}
Even without List
Using sorted lists gives only small performance gain.
❖ ... Separate Chaining |
Other list operations are also simple:
#include "List.h"
void insert(HashTable ht, Item it) {
Key k = key(it);
int i = hash(k, ht->nslots);
ListInsert(ht->lists[i], it);
}
void delete(HashTable ht, Key k) {
int i = hash(k, ht->nslots);
ListDelete(ht->lists[i], k);
}
Essentially: select a list; operate on that list.
❖ ... Separate Chaining |
Cost analysis:
❖ Linear Probing |
Collision resolution by finding a new location for Item
❖ ... Linear Probing |
Insert function for linear probing:
void insert(HashTable ht, Item it)
{
assert(ht->nitems < ht->nslots);
int N = ht->nslots;
Item *a = ht->items;
Key k = key(it);
int i, j, h = hash(k,N);
for (j = 0; j < N; j++) {
i = (h+j)%N;
if (a[i] == NoItem) break;
if (eq(k,key(a[i]))) break;
}
if (a[i] == NoItem) ht->nitems++;
a[i] = it;
}
❖ ... Linear Probing |
Search function for linear probing:
Item *search(HashTable ht, Key k)
{
int N = ht->nslots;
Item *a = ht->items;
int i, j, h = hash(k,N);
for (j = 0; j < N; j++) {
i = (h+j)%N;
if (a[i] == NoItem) return NULL;
if (eq(k,key(a[i]))) return &(a[i]);
}
return NULL;
}
❖ ... Linear Probing |
Search cost analysis:
Item| load (α) | 1/2 | 2/3 | 3/4 | 9/10 |
| search hit | 1.5 | 2.0 | 3.0 | 5.5 |
| search miss | 2.5 | 5.0 | 8.5 | 55.5 |
❖ ... Linear Probing |
Deletion slightly tricky for linear probing.
Need to ensure no NoItem
(i.e. previously relocated items moved to appropriate location)
❖ ... Linear Probing |
Delete function for linear probing:
void delete(HashTable ht, Key k)
{
int N = ht->nslots;
Item *a = ht->items;
int i, j, h = hash(k,N);
for (j = 0; j < N; j++) {
i = (h+j)%N;
if (a[i] == NoItem) return; // k not in table
if (eq(k,key(a[i]))) break;
}
a[i] = NoItem;
ht->nitems--;
// clean up probe path
j = i+1;
while (a[j] != NoItem) {
Item it = a[j];
a[j] = NoItem; // remove 'it'
ht->nitems--;
insert(ht, it); // insert 'it' again
j = (j+1)%N;
}
}
❖ Exercise : Linear Probing Example |
Consider a linear-probed hash table
1, 2, 3, 4, 5, 6, 7, 8, 915, 6, 20, 3, 17, 14, 33, 5❖ ... Linear Probing |
A problem with linear probing: clusters
E.g. insert 5, 6, 15, 16, 7, 17, with hash = k%10
❖ Double Hashing |
Double hashing improves on linear probing:
hash2()❖ ... Double Hashing |
Concrete data structures for hashing via double hashing:
typedef struct HashTabRep {
Item *items; // array of Items
int nslots; // # elements in array
int nitems; // # items stored in HashTable
int nhash2; // second hash mod
} HashTabRep;
#define hash2(k,N2) (((k)%N2)+1)
HashTable newHashTable(int N)
{
HashTabRep *new = malloc(sizeof(HashTabRep));
assert(new != NULL);
new->items = malloc(N*sizeof(Item));
assert(new->items != NULL);
for (int i = 0; i < N; i++)
new->items[i] = NoItem;
new->nslots = N; new->nitems = 0;
new->nhash2 = findSuitablePrime(N);
return new;
}
❖ ... Double Hashing |
Search function for double hashing:
Item *search(HashTable ht, Key k)
{
int N = ht->nslots;
Item *data = ht->items;
int i, j, h = hash(k,N);
int incr = hash2(k,ht->nhash2);
for (j = 0, i = h; j < N; j++) {
if (eq(k,key(data[i]) == 0)
return &(data[i]);
i = (i+incr)%N;
}
return NULL;
}
❖ ... Double Hashing |
Insert function for double hashing:
void insert(HashTable ht, Item it)
{
int N = ht->nslots;
Item *data = ht->items;
Key k = key(it);
int i, j, h = hash(k,N);
int incr = hash2(k,ht->nhash2);
for (j = 0; j < N; j += incr) {
ix = (i+j)%N;
if (cmp(k,key(data[ix]) == 0)
break;
else if (data[ix] == NoItem)
break;
}
assert(j != N); // table full
if (data[ix] == NoItem) ht->nitems++;
data[ix] = it;
}
❖ ... Double Hashing |
Costs for double hashing:
| load (α) | 1/2 | 2/3 | 3/4 | 9/10 |
| search hit | 1.4 | 1.6 | 1.8 | 2.6 |
| search miss | 1.5 | 2.0 | 3.0 | 5.5 |
Can be significantly better than linear probing
❖ Hashing Summary |
Collision resolution approaches:
Once M exceeds initial choice of N,
Produced: 15 Mar 2019