Hashing <small>♢ COMP2521 ♢ (19T1)</small>

∧ >>

❖ Hashing

Key-indexed arrays had "perfect" search performance O(1)

but required a dense range of index values
used a fixed-size array (max size ever needed)
bigger array ⇒ more useful but wastes more space

Hashing allows us to approximate this performance, but

allows arbitrary types of keys
map (hash) keys into compact range of index values
store items in array, accessed by index value

<< ∧ >>

❖ ... Hashing

The ideal for key-indexed collections:

courses["COMP3311"] = "Database Systems";
printf("%s\n", courses["COMP3311"]);

Almost as good:

courses[h("COMP3311")] = "Database Systems";
printf("%s\n", courses[h("COMP3311")]);

[Diagram:Pics/hashing/hashing-review.png]

In practice:

item = {"COMP3311","Database Systems"};
courses = insert(courses, item);
printf("%s\n", search(courses, "COMP3311"));

<< ∧ >>

❖ ... Hashing

To use arbitrary values as keys, we need three things:

set of Key values, each key identifies one Item
an array (of size N) to store Items
a hash function h() of type Key→[0..N-1]
- requirement: if (x == y) then h(x) == h(y)
- requirement: h(x) always returns same value for given x
a collision resolution method
- collision = (x != y && h(x) == h(y))
- collisions are inevitable when dom(Key) >> N

<< ∧ >>

❖ ... Hashing

Generalised ADT for a collection of Items

Interface:

typedef struct CollectionRep *Collection;

Collection newCollection();    // make new empty collection
Item *search(Collection, Key); // find item with key
void insert(Collection, Item); // add item into collection
void delete(Collection, Key);  // drop item with key

Implementation:

typedef struct CollectionRep {
   ... some data structure to hold multiple Items ...
} CollectionRep;

<< ∧ >>

❖ ... Hashing

For hash tables, we make one change to interface:

typedef struct HashTabRep *HashTable;
// make new empty table of size N
HashTable newHashTable(int);
Item *search(HashTable, Key); // find item with key
void insert(HashTable, Item); // add item into collection
void delete(HashTable, Key);  // drop item with key

<< ∧ >>

❖ ... Hashing

Example hash table implementation:

typedef struct HashTabRep {
   int  N;       // size of array
   Item **items; // array of (Item *)
} HashTabRep;

HashTable newHashTable(int N)
{
   HashTable new = malloc(sizeof(HashTabRep));
   new->items = malloc(N*sizeof(Item *));
   new->N = N;
   for (int i = 0; i < N; i++)
      { new->items[i] = NULL; }
   return new;
}

<< ∧ >>

❖ Hash Functions

Points to note:

converts Key value to index value [0..N-1]
deterministic (key value k always maps to same value)
use mod function to map hash value to index value
spread key values uniformly over address range
(assumes that keys themselves are uniformly distributed)
as much as possible, h(k) ≠ h(j) if j ≠ k
cost of computing hash function must be cheap

<< ∧ >>

❖ ... Hash Functions

Basic idea behind hash function

int hash(Key key, int N)
{
   int val = convert key to int;
   return val % N;
}

If keys are ints, conversion is easy (identity function)

How to convert keys which are strings? (e.g. "COMP1927" or "9300035")

<< ∧ >>

❖ Exercise : Hash Functions (i)

Consider this potential hash function:

int hash(char *key, int N)
{
    int h = 0; char *c;
    for (c = key; *c != '\0'; c++)
        h = h + *c;
    return h % N;
}

How does this function convert strings to ints?

What are the deficiencies with this function and how can it be improved?

<< ∧ >>

❖ ... Hash Functions

A slightly more sophisticated hash function

int hash(char *key, int N)
{
   int h = 0;  char *c;
   int a = 127; // a prime number
   for (c = key; *c != '\0'; c++)
      h = (a * h + *c) % N;
   return h;
}

Converts strings into integers in table range.

But poor choice of a (e.g. 128) can result in poor hashing.

<< ∧ >>

❖ ... Hash Functions

To use all of value in hash, with suitable "randomization":

int hash(char *key, int N)
{
   int h = 0, a = 31415, b = 21783;
   char *c;
   for (c = key; *c != '\0'; c++) {
      a = a*b % (N-1);
      h = (a * h + *c) % N;
   }
   return h;
}

This approach is known as universal hashing.

<< ∧ >>

❖ ... Hash Functions

A real hash function (from PostgreSQL DBMS):

hash_any(unsigned char *k, register int keylen, int N)
{
    register uint32 a, b, c, len;
    // set up internal state
    len = keylen;
    a = b = 0x9e3779b9;
    c = 3923095;
    // handle most of the key, in 12-char chunks
    while (len >= 12) {
        a += (k[0] + (k[1] << 8) + (k[2] << 16) + (k[3] << 24));
        b += (k[4] + (k[5] << 8) + (k[6] << 16) + (k[7] << 24));
        c += (k[8] + (k[9] << 8) + (k[10] << 16) + (k[11] << 24));
        mix(a, b, c);
        k += 12; len -= 12;
    }
    // collect any data from remaining bytes into a,b,c
    mix(a, b, c);
    return c % N;
}

<< ∧ >>

❖ ... Hash Functions

Where mix is defined as:


#define mix(a,b,c) \
{ \
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8);  \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12); \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b>>5);  \
  a -= b; a -= c; a ^= (c>>3);  \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}

i.e. scrambles all of the bits from the bytes of the key value

<< ∧ >>

❖ Hash Table ADT

Enhanced concrete data representation:

#include "Item.h"  // Item has key and data

#define NoItem distinguished Item value

typedef struct HashTabRep {
   Item *items; // array of Items
   int  nslots; // # elements in array  (was called N)
   int  nitems; // # items stored in array
} HashTabRep;

typedef HashTabRep *HashTable;

<< ∧ >>

❖ Exercise : NoItem values

Suggest suitable NoItem values if

keys are integers
keys are strings
items[] is an array of (Item *)

<< ∧ >>

❖ ... Hash Table ADT

Hash table initialisation:

create a Rep and an array, fill array with NoItem

HashTable newHashTable(int N)
{
   HashTabRep *new = malloc(sizeof(HashTabRep));
   assert(new != NULL);
   new->items = malloc(N*sizeof(Item));
   assert(new->items != NULL);
   for (int i = 0; i < N; i++)
      new->items[i] = NoItem;
   new->nitems = 0; new->nslots = N;
   return new;
}

<< ∧ >>

❖ Problems with Hashing

In ideal scenarios, search cost in hash table is O(1).

Problems with hashing:

hash function relies on size of array (⇒ can't expand)
- changing the size of the array changes the hash function
- could make array larger, but would need to re-insert all Items
items are stored in (effectively) random order
if size(KeySpace) size(IndexSpace), collisions inevitable
- collision: k != j && hash(k,N) == hash(j,N)
if nitems > nslots, collisions inevitable

<< ∧ >>

❖ Collision Resolution

Three approaches to dealing with hash collisions:

allow multiple Items in a single array location
- e.g. array of linked lists (mix of O(1) and O(N))
systematically compute new indexes until find a free slot
- need strategies for computing new indexes (aka probing)
increase the size of the array
- needs a method to "adjust" hash() (e.g. linear hashing)

<< ∧ >>

❖ Separate Chaining

Solve collisions by having multiple items per array entry.

Make each element the start of linked-list of Items.

<< ∧ >>

❖ ... Separate Chaining

Concrete data structure for hashing via chaining

typedef struct HashTabRep {
   List *lists; // array of Lists of Items
   int  nslots; // # elements in array
   int  nitems; // # items stored in HashTable
} HashTabRep;

HashTable newHashTable(int N)
{
   HashTabRep *new = malloc(sizeof(HashTabRep));
   assert(new != NULL);
   new->lists = malloc(N*sizeof(List));
   assert(new->lists != NULL);
   for (int i = 0; i < N; i++)
      new->lists[i] = newList();
   new->nslots = N; new->nitems = 0;
   return new;
}

<< ∧ >>

❖ ... Separate Chaining

Using the List ADT, search becomes:

#include "List.h" 
Item *search(HashTable ht, Key k)
{
   int i = hash(k, ht->nslots);
   return ListSearch(ht->lists[i], k);
}

Even without List abstraction, easy to implement.

Using sorted lists gives only small performance gain.

<< ∧ >>

❖ ... Separate Chaining

Other list operations are also simple:

#include "List.h"

void insert(HashTable ht, Item it) {
   Key k = key(it);
   int i = hash(k, ht->nslots);
   ListInsert(ht->lists[i], it);
}
void delete(HashTable ht, Key k) {
   int i = hash(k, ht->nslots);
   ListDelete(ht->lists[i], k);
}

Essentially: select a list; operate on that list.

<< ∧ >>

❖ ... Separate Chaining

Cost analysis:

N array entries (slots), M stored items
average list length L = M/N
best case: all lists are same length L
worst case: h(k)=0, one list of length M
searching within a list of length n:
- best: 1, worst: n, average: n/2
if good hash and M≤N, cost is 1
if good hash and M>N, cost is (M/N)/2

Ratio of items/slots is called load α = M/N

<< ∧ >>

❖ Linear Probing

Collision resolution by finding a new location for Item

hash indicates slot i which is already used
try next slot, then next, until we find a free slot
insert item in available slot

Examples:

<< ∧ >>

❖ ... Linear Probing

Insert function for linear probing:

void insert(HashTable ht, Item it)
{
   assert(ht->nitems < ht->nslots);
   int N = ht->nslots;
   Item *a = ht->items;
   Key k = key(it);
   int i, j, h = hash(k,N);
   for (j = 0; j < N; j++) {
      i = (h+j)%N;
      if (a[i] == NoItem) break;
      if (eq(k,key(a[i]))) break;
   }
   if (a[i] == NoItem) ht->nitems++;
   a[i] = it;
}

<< ∧ >>

❖ ... Linear Probing

Search function for linear probing:

Item *search(HashTable ht, Key k)
{
   int N = ht->nslots;
   Item *a = ht->items;
   int i, j, h = hash(k,N);
   for (j = 0; j < N; j++) {
      i = (h+j)%N;
      if (a[i] == NoItem) return NULL;
      if (eq(k,key(a[i]))) return &(a[i]);
   }
   return NULL;
}

<< ∧ >>

❖ ... Linear Probing

Search cost analysis:

cost to reach first Item is O(1)
subsequent cost depends how much we need to scan
affected by load α = M/N (i.e. how "full" is the table)
Avg Cost for successful search = 0.5*(1 + 1/(1-α))
Avg Cost for unsuccessful search = 0.5*(1 + 1/(1-α)²)

Example costs:

load (α)	1/2	2/3	3/4	9/10
search hit	1.5	2.0	3.0	5.5
search miss	2.5	5.0	8.5	55.5

Assumes reasonably uniform data and good hash function.

<< ∧ >>

❖ ... Linear Probing

Deletion slightly tricky for linear probing.

Need to ensure no NoItem in middle of "probe path"
(i.e. previously relocated items moved to appropriate location)

[Diagram:Pics/hashing/hash-probe-delete.png]

<< ∧ >>

❖ ... Linear Probing

Delete function for linear probing:

void delete(HashTable ht, Key k)
{
   int N = ht->nslots;
   Item *a = ht->items;
   int i, j, h = hash(k,N);
   for (j = 0; j < N; j++) {
      i = (h+j)%N;
      if (a[i] == NoItem) return; // k not in table
      if (eq(k,key(a[i]))) break;
   }
   a[i] = NoItem;
   ht->nitems--;
   // clean up probe path
   j = i+1;
   while (a[j] != NoItem) {
      Item it = a[j];
      a[j] = NoItem; // remove 'it'
      ht->nitems--;
      insert(ht, it); // insert 'it' again
      j = (j+1)%N;
   }

}

<< ∧ >>

❖ Exercise : Linear Probing Example

Consider a linear-probed hash table

N = 10 table slots, hash(k) = k%10

Show the result of inserting items with these keys

1, 2, 3, 4, 5, 6, 7, 8, 9
15, 6, 20, 3, 17, 14, 33, 5

into an initially empty table

<< ∧ >>

❖ ... Linear Probing

A problem with linear probing: clusters

E.g. insert 5, 6, 15, 16, 7, 17, with hash = k%10

<< ∧ >>

❖ Double Hashing

Double hashing improves on linear probing:

by using an increment which ...
- is based on a secondary hash of the key
- ensures that all elements are visited
  (can be ensured by using an increment which is relatively prime to N)
tends to eliminate clusters ⇒ shorter probe paths

To generate relatively prime

set table size to prime e.g. N=127
hash2() in range [1..N1] where N1 < 127 and prime

<< ∧ >>

❖ ... Double Hashing

Concrete data structures for hashing via double hashing:

typedef struct HashTabRep {
   Item *items; // array of Items
   int  nslots; // # elements in array
   int  nitems; // # items stored in HashTable
   int  nhash2; // second hash mod
} HashTabRep;

#define hash2(k,N2) (((k)%N2)+1)

HashTable newHashTable(int N)
{
   HashTabRep *new = malloc(sizeof(HashTabRep));
   assert(new != NULL);
   new->items = malloc(N*sizeof(Item));
   assert(new->items != NULL);
   for (int i = 0; i < N; i++)
      new->items[i] = NoItem;
   new->nslots = N; new->nitems = 0;
   new->nhash2 = findSuitablePrime(N);
   return new;
}

<< ∧ >>

❖ ... Double Hashing

Search function for double hashing:

Item *search(HashTable ht, Key k)
{
   int N = ht->nslots;
   Item *data = ht->items;
   int i, j, h = hash(k,N);
   int incr = hash2(k,ht->nhash2);
   for (j = 0, i = h; j < N; j++) {
      if (eq(k,key(data[i]) == 0)
         return &(data[i]);
      i = (i+incr)%N;
   }
   return NULL;
}

<< ∧ >>

❖ ... Double Hashing

Insert function for double hashing:

void insert(HashTable ht, Item it)
{
   int N = ht->nslots;
   Item *data = ht->items;
   Key k = key(it);
   int i, j, h = hash(k,N);
   int incr = hash2(k,ht->nhash2);
   for (j = 0; j < N; j += incr) {
      ix = (i+j)%N;
      if (cmp(k,key(data[ix]) == 0)
         break;
      else if (data[ix] == NoItem)
         break;
   }
   assert(j != N); // table full
   if (data[ix] == NoItem) ht->nitems++;
   data[ix] = it;
}

<< ∧ >>

❖ ... Double Hashing

Costs for double hashing:

load (α)	1/2	2/3	3/4	9/10
search hit	1.4	1.6	1.8	2.6
search miss	1.5	2.0	3.0	5.5

Can be significantly better than linear probing

especially if table is heavily loaded

<< ∧

❖ Hashing Summary

Collision resolution approaches:

chaining: easy to implement, allows α > 1
linear probing: fast if α << 1, complex deletion
double hashing: faster than linear probing, esp for α ≅ 1

Only chaining allows α > 1, but performance degrades once α > 1

Once M exceeds initial choice of N,

need to expand size of array (N)
problem: hash function relies on N,
so changing array size potentially requires rebuiling whole table
dynamic hashing methods exist to avoid this