Hashing ♢ COMP2521 ♢ (19T1)

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [0/37]
❖ Hashing

Key-indexed arrays had "perfect" search performance O(1)

Hashing allows us to approximate this performance, but
Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [1/37]
❖ ... Hashing

The ideal for key-indexed collections:

courses["COMP3311"] = "Database Systems";
printf("%s\n", courses["COMP3311"]);

Almost as good:

courses[h("COMP3311")] = "Database Systems";
printf("%s\n", courses[h("COMP3311")]);

[Diagram:Pics/hashing/hashing-review.png]

In practice:

item = {"COMP3311","Database Systems"};
courses = insert(courses, item);
printf("%s\n", search(courses, "COMP3311"));

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [2/37]
❖ ... Hashing

To use arbitrary values as keys, we need three things:

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [3/37]
❖ ... Hashing

Generalised ADT for a collection of Items

Interface:

typedef struct CollectionRep *Collection;

Collection newCollection();    // make new empty collection
Item *search(Collection, Key); // find item with key
void insert(Collection, Item); // add item into collection
void delete(Collection, Key);  // drop item with key

Implementation:

typedef struct CollectionRep {
   ... some data structure to hold multiple Items ...
} CollectionRep;

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [4/37]
❖ ... Hashing

For hash tables, we make one change to interface:

typedef struct HashTabRep *HashTable;
// make new empty table of size N
HashTable newHashTable(int);
Item *search(HashTable, Key); // find item with key
void insert(HashTable, Item); // add item into collection
void delete(HashTable, Key);  // drop item with key

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [5/37]
❖ ... Hashing

Example hash table implementation:

typedef struct HashTabRep {
   int  N;       // size of array
   Item **items; // array of (Item *)
} HashTabRep;

HashTable newHashTable(int N)
{
   HashTable new = malloc(sizeof(HashTabRep));
   new->items = malloc(N*sizeof(Item *));
   new->N = N;
   for (int i = 0; i < N; i++)
      { new->items[i] = NULL; }
   return new;
}
Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [6/37]
❖ Hash Functions

Points to note:

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [7/37]
❖ ... Hash Functions

Basic idea behind hash function

int hash(Key key, int N)
{
   int val = convert key to int;
   return val % N;
}

If keys are ints, conversion is easy (identity function)

How to convert keys which are strings?   (e.g. "COMP1927" or "9300035")

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [8/37]
❖ Exercise : Hash Functions (i)

Consider this potential hash function:

int hash(char *key, int N)
{
    int h = 0; char *c;
    for (c = key; *c != '\0'; c++)
        h = h + *c;
    return h % N;
}

How does this function convert strings to ints?

What are the deficiencies with this function and how can it be improved?

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [9/37]
❖ ... Hash Functions

A slightly more sophisticated hash function

int hash(char *key, int N)
{
   int h = 0;  char *c;
   int a = 127; // a prime number
   for (c = key; *c != '\0'; c++)
      h = (a * h + *c) % N;
   return h;
}

Converts strings into integers in table range.

But poor choice of a (e.g. 128) can result in poor hashing.

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [10/37]
❖ ... Hash Functions

To use all of value in hash, with suitable "randomization":

int hash(char *key, int N)
{
   int h = 0, a = 31415, b = 21783;
   char *c;
   for (c = key; *c != '\0'; c++) {
      a = a*b % (N-1);
      h = (a * h + *c) % N;
   }
   return h;
}

This approach is known as universal hashing.

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [11/37]
❖ ... Hash Functions

A real hash function (from PostgreSQL DBMS):

hash_any(unsigned char *k, register int keylen, int N)
{
    register uint32 a, b, c, len;
    // set up internal state
    len = keylen;
    a = b = 0x9e3779b9;
    c = 3923095;
    // handle most of the key, in 12-char chunks
    while (len >= 12) {
        a += (k[0] + (k[1] << 8) + (k[2] << 16) + (k[3] << 24));
        b += (k[4] + (k[5] << 8) + (k[6] << 16) + (k[7] << 24));
        c += (k[8] + (k[9] << 8) + (k[10] << 16) + (k[11] << 24));
        mix(a, b, c);
        k += 12; len -= 12;
    }
    // collect any data from remaining bytes into a,b,c
    mix(a, b, c);
    return c % N;
}
Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [12/37]
❖ ... Hash Functions

Where mix is defined as:


#define mix(a,b,c) \
{ \
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8);  \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12); \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b>>5);  \
  a -= b; a -= c; a ^= (c>>3);  \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}

i.e. scrambles all of the bits from the bytes of the key value

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [13/37]
❖ Hash Table ADT

Enhanced concrete data representation:

#include "Item.h"  // Item has key and data

#define NoItem distinguished Item value

typedef struct HashTabRep {
   Item *items; // array of Items
   int  nslots; // # elements in array  (was called N)
   int  nitems; // # items stored in array
} HashTabRep;

typedef HashTabRep *HashTable;

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [14/37]
❖ Exercise : NoItem values

Suggest suitable NoItem values if

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [15/37]
❖ ... Hash Table ADT

Hash table initialisation:

HashTable newHashTable(int N)
{
   HashTabRep *new = malloc(sizeof(HashTabRep));
   assert(new != NULL);
   new->items = malloc(N*sizeof(Item));
   assert(new->items != NULL);
   for (int i = 0; i < N; i++)
      new->items[i] = NoItem;
   new->nitems = 0; new->nslots = N;
   return new;
}

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [16/37]
❖ Problems with Hashing

In ideal scenarios, search cost in hash table is O(1).

Problems with hashing:

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [17/37]
❖ Collision Resolution

Three approaches to dealing with hash collisions:

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [18/37]
❖ Separate Chaining

Solve collisions by having multiple items per array entry.

Make each element the start of linked-list of Items.

[Diagram:Pics/hashing/hash-linked.png]

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [19/37]
❖ ... Separate Chaining

Concrete data structure for hashing via chaining

typedef struct HashTabRep {
   List *lists; // array of Lists of Items
   int  nslots; // # elements in array
   int  nitems; // # items stored in HashTable
} HashTabRep;

HashTable newHashTable(int N)
{
   HashTabRep *new = malloc(sizeof(HashTabRep));
   assert(new != NULL);
   new->lists = malloc(N*sizeof(List));
   assert(new->lists != NULL);
   for (int i = 0; i < N; i++)
      new->lists[i] = newList();
   new->nslots = N; new->nitems = 0;
   return new;
}
Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [20/37]
❖ ... Separate Chaining

Using the List ADT, search becomes:

#include "List.h" 
Item *search(HashTable ht, Key k)
{
   int i = hash(k, ht->nslots);
   return ListSearch(ht->lists[i], k);
}

Even without List abstraction, easy to implement.

Using sorted lists gives only small performance gain.

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [21/37]
❖ ... Separate Chaining

Other list operations are also simple:

#include "List.h"

void insert(HashTable ht, Item it) {
   Key k = key(it);
   int i = hash(k, ht->nslots);
   ListInsert(ht->lists[i], it);
}
void delete(HashTable ht, Key k) {
   int i = hash(k, ht->nslots);
   ListDelete(ht->lists[i], k);
}

Essentially: select a list; operate on that list.

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [22/37]
❖ ... Separate Chaining

Cost analysis:

Ratio of items/slots is called load α = M/N
Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [23/37]
❖ Linear Probing

Collision resolution by finding a new location for Item

Examples:

[Diagram:Pics/hashing/hash-linear.png]

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [24/37]
❖ ... Linear Probing

Insert function for linear probing:

void insert(HashTable ht, Item it)
{
   assert(ht->nitems < ht->nslots);
   int N = ht->nslots;
   Item *a = ht->items;
   Key k = key(it);
   int i, j, h = hash(k,N);
   for (j = 0; j < N; j++) {
      i = (h+j)%N;
      if (a[i] == NoItem) break;
      if (eq(k,key(a[i]))) break;
   }
   if (a[i] == NoItem) ht->nitems++;
   a[i] = it;
}

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [25/37]
❖ ... Linear Probing

Search function for linear probing:

Item *search(HashTable ht, Key k)
{
   int N = ht->nslots;
   Item *a = ht->items;
   int i, j, h = hash(k,N);
   for (j = 0; j < N; j++) {
      i = (h+j)%N;
      if (a[i] == NoItem) return NULL;
      if (eq(k,key(a[i]))) return &(a[i]);
   }
   return NULL;
}

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [26/37]
❖ ... Linear Probing

Search cost analysis:

Example costs:
load (α)1/2  2/3  3/4  9/10
search hit1.52.03.05.5
search miss2.55.08.555.5
Assumes reasonably uniform data and good hash function.
Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [27/37]
❖ ... Linear Probing

Deletion slightly tricky for linear probing.

Need to ensure no NoItem in middle of "probe path"
(i.e. previously relocated items moved to appropriate location)

[Diagram:Pics/hashing/hash-probe-delete.png]

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [28/37]
❖ ... Linear Probing

Delete function for linear probing:

void delete(HashTable ht, Key k)
{
   int N = ht->nslots;
   Item *a = ht->items;
   int i, j, h = hash(k,N);
   for (j = 0; j < N; j++) {
      i = (h+j)%N;
      if (a[i] == NoItem) return; // k not in table
      if (eq(k,key(a[i]))) break;
   }
   a[i] = NoItem;
   ht->nitems--;
   // clean up probe path
   j = i+1;
   while (a[j] != NoItem) {
      Item it = a[j];
      a[j] = NoItem; // remove 'it'
      ht->nitems--;
      insert(ht, it); // insert 'it' again
      j = (j+1)%N;
   }

}

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [29/37]
❖ Exercise : Linear Probing Example

Consider a linear-probed hash table

Show the result of inserting items with these keys
  1. 1, 2, 3, 4, 5, 6, 7, 8, 9
  2. 15, 6, 20, 3, 17, 14, 33, 5
into an initially empty table
Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [30/37]
❖ ... Linear Probing

A problem with linear probing: clusters

E.g. insert 5, 6, 15, 16, 7, 17, with hash = k%10

[Diagram:Pics/hashing/clustering.png]

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [31/37]
❖ Double Hashing

Double hashing improves on linear probing:

To generate relatively prime
Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [32/37]
❖ ... Double Hashing

Concrete data structures for hashing via double hashing:

typedef struct HashTabRep {
   Item *items; // array of Items
   int  nslots; // # elements in array
   int  nitems; // # items stored in HashTable
   int  nhash2; // second hash mod
} HashTabRep;

#define hash2(k,N2) (((k)%N2)+1)

HashTable newHashTable(int N)
{
   HashTabRep *new = malloc(sizeof(HashTabRep));
   assert(new != NULL);
   new->items = malloc(N*sizeof(Item));
   assert(new->items != NULL);
   for (int i = 0; i < N; i++)
      new->items[i] = NoItem;
   new->nslots = N; new->nitems = 0;
   new->nhash2 = findSuitablePrime(N);
   return new;
}

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [33/37]
❖ ... Double Hashing

Search function for double hashing:

Item *search(HashTable ht, Key k)
{
   int N = ht->nslots;
   Item *data = ht->items;
   int i, j, h = hash(k,N);
   int incr = hash2(k,ht->nhash2);
   for (j = 0, i = h; j < N; j++) {
      if (eq(k,key(data[i]) == 0)
         return &(data[i]);
      i = (i+incr)%N;
   }
   return NULL;
}

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [34/37]
❖ ... Double Hashing

Insert function for double hashing:

void insert(HashTable ht, Item it)
{
   int N = ht->nslots;
   Item *data = ht->items;
   Key k = key(it);
   int i, j, h = hash(k,N);
   int incr = hash2(k,ht->nhash2);
   for (j = 0; j < N; j += incr) {
      ix = (i+j)%N;
      if (cmp(k,key(data[ix]) == 0)
         break;
      else if (data[ix] == NoItem)
         break;
   }
   assert(j != N); // table full
   if (data[ix] == NoItem) ht->nitems++;
   data[ix] = it;
}

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [35/37]
❖ ... Double Hashing

Costs for double hashing:

load (α)1/2  2/3  3/4  9/10
search hit1.41.61.82.6
search miss1.52.03.05.5

Can be significantly better than linear probing

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [36/37]
❖ Hashing Summary

Collision resolution approaches:

Only chaining allows α > 1, but performance degrades once α > 1

Once M exceeds initial choice of N,

Hashing ♢ COMP2521 (19T1) ♢ Week 06b ♢ [37/37]


Produced: 15 Mar 2019