Week 09 Tutorial Answers
Hash Tables
Resources
-
Consider this potential hash function:
int hash(char *key, int N) { int h = 0; char *c; for (c = key; *c != '\0'; c++) { h = h + *c; } return h % N; }
How does this function convert strings to
int
s?What are the deficiencies with this function and how can it be improved?
Answer:
It simply adds up the ASCII values of all of the characters in the string.
A problem with such a simplistic approach is that anagrams all hash to the same value (e.g. "cat"/"act", "listen"/"silent"). This can be improved by incorporating the position of the character into the hash computation, e.g.
int hash(char *key, int N) { int h = 0; char *c; int i; for (c = key, i = 0; *c != '\0'; c++, i++) { h = h + (*c * i); } return h % N; }
-
For a hash table of size \( N \) that uses separate chaining for collision resolution, with the chains sorted in ascending order by key, what are the best case and worst case costs for inserting \( k = 2N \) items into the table? Assume that insertion cost is measured in terms of number of key comparisons. What will be average search cost after all the insertions are done?
Answer:
The worst case occurs when the hash function returns a constant value and the keys arrive in ascending order. In this case, there will be a single chain from the one entry in the hash table that is "hit" by the hash function. Each search of that chain will require us to go right to the end. The cost for the first insertion is \( 0 \), since we do no key comparisons, but discover that the chain is empty. On the second insertion, we do one key comparison; on the third insertion we do two key comparisons; and so on. This will give \( 0 + 1 + 2 + \dots + (k - 1) \) comparisons which is \( k \times (k - 1) / 2 \) total comparisons.
After insertion, each search requires a scan of the single chain of length \( k \), so the average search cost will be \( k / 2 \).
The best case occurs when we have a hash function that distributes the items uniformly across the table. In this case, we end up with a chain of length \( \lceil k / N \rceil \) attached to each hash table entry. In building these chains, the first entry requires no key comparisons (just a check for an empty chain). When adding the second item to each chain, only one comparison is needed, giving a total of \( k / 2 \) comparisons.
After insertion, each search requires a scan of a chain of length \( 2 \), which means half the time, one comparison will be made, and the other half of the time, two comparisons will be made. Thus, the average search cost will be \( 1.5 \).
-
Consider a very small hash table with only 11 entries, and a simple hash function \( h(x) = x\ \%\ 11 \). If we start with an empty table and insert the following keys in the order shown
11 16 27 35 22 20 15 24 29 19 13
give the final state of the table for each of the following collision resolution strategies
- separate chaining, with chains in ascending order
- linear probing
- double hashing, with \( h_2(x) = (x\ \%\ 3) + 1 \)
Answer:
The hash values for each of the keys is given in the following table:
Key 11 16 27 35 22 20 15 24 29 19 13 \( h(x) \) 0 5 5 2 0 9 4 2 7 8 2 \( h_2(x) \) 3 2 1 3 2 3 1 1 3 2 2 Working through the insertions item-by-item for each case, gives:
-
Consider the following hash table with primary hash function \(h(x) = x\ \%\ 10\), that uses double hashing for collision resolution with secondary hash function \(h_2(x) = x\ \%\ 3 + 1\):
Index 0 1 2 3 4 5 6 7 8 9 Key 0 12 24 36 48 What happens if we try to insert the key 10? Explain why this happens.
Answer:
When we try to insert the key 10, \(h(10) = 0\), but as index 0 is occupied, we use the secondary hash function, \(h_2(10) = 2\), so the increment is 2. Repeatedly incrementing by 2 (and wrapping if the index goes beyond the array bounds) causes us to check indices 2, 4, 6, 8, 0, 2, 4, ... and we never reach one of the empty slots.
This happens because the increment returned by the secondary hash function is not relatively prime to the size of the hash table.
When using double hashing, the secondary hash function should always return a value which is relatively prime to the size of the hash table. A simple way to ensure this is to make the size of the hash table prime, and then make sure that the secondary hash function returns a value smaller than the size, because if \(p\) is prime, then all positive integers less than \(p\) are relatively prime to \(p\).
For example, if the size of the hash table is 11 (which is prime), and the secondary hash function is \(h_2(x) = x\ \%\ 10 + 1\), the above problem would not occur.
-
Both hash tables and balanced search trees (e.g., AVL trees) are efficient data structures that can be used for the insertion, searching and deletion of key-value pairs. Compare the advantages and disadvantages of each.
Answer:
Some advantages and disadvantages of hash tables and balanced search trees:Advantages Disadvantages Hash tables Insertion, search and deletion are \(O(1)\) on average Worst case complexity of \(O(n)\) (but can be avoided with a good hash function)
Needs to be resized if more items are inserted
Can waste memory if very few items are inserted and most of the table goes unusedBalanced search trees Items are ordered (unlike in a hash table), so it is possible to perform queries such as finding the smallest/greatest element efficiently
Allocates memory as needed (but requires memory for pointers)Slower than hash tables (but still very efficient) - insertion search and deletion are \(O(\log n)\) in the worst case -
In lectures, we discussed how to solve the famous two-sum problem efficiently using a hash table. Taking inspiration from that, let's see if you can now solve the three-sum problem!
Here's the problem - given an array of integers and a target sum \(S\), determine whether the array contains three integers that sum to \(S\). Try to find a solution that is \(O(n^2)\) on average.
bool threeSum(int arr[], int size, int sum);
As a reminder, here are the hash table operations:
HashTable HashTableNew(void); void HashTableFree(HashTable ht); void HashTableInsert(HashTable ht, int key, int value); bool HashTableContains(HashTable ht, int key); int HashTableGet(HashTable ht, int key); void HashTableDelete(HashTable ht, int key); int HashTableSize(HashTable ht);
Answer:
The idea is to use a hash table as a set to store sums of pairs of integers from the array.
For each integer \(a\) in the array, check if \(S - a\) exists in the hash table - if it does, then the answer is yes, if it doesn't, then for each element \(b\) to the left of \(a\), add \(a + b\) to the hash table.
bool threeSum(int arr[], int size, int sum) { HashTable ht = HashTableNew(); for (int i = 0; i < size; i++) { if (HashTableContains(ht, sum - arr[i])) { HashTableFree(ht); return true; } for (int j = 0; j < i; j++) { HashTableInsert(ht, arr[i] + arr[j], 0); } } HashTableFree(ht); return false; }
This solution is \(O(n^2)\) on average as there are two nested loops, and the operations
HashTableContains
andHashTableInsert
are \(O(1)\) on average. -
Fun fact: 2521 is a prime number! If a hash table which stores integer keys had 2521 slots and used double hashing for collision resolution, define a suitable secondary hash function \(h_2\) for the table. Briefly comment on what makes it suitable.
Answer:
A suitable function is \(h_2(k) = k\ \%\ 2520 + 1\), which maps keys into the range \([1, 2520]\). This ensures that the probe increment is always relatively prime to the table size. During collision resolution, this property is advantageous as it means we won't revisit a table slot until all other slots have been visited first, reducing key clustering.
-
Calls to functions (e.g. some recursive functions) are sometimes very expensive to compute. One often-employed trick when working with such functions is memoisation: the first time the function is called with some input, we cache this result (i.e. store it somewhere for use later on), and then for subsequent calls to the function with the same input, we consult the cache rather than actually evaluating the call.
- Design an ADT for a memoiser that supports memoisation of an expensive single-argument function.
- One potential problem with memoisation is that the cache's memory usage can grow quite large when the function is called for lots of inputs. This is often solved by limiting the size of the cache and having a policy which can be used to pick entries to remove from the cache to make more room. What are some possible policies you could use?
- Can you think of how you would actually implement some of these policies?
Answer:
- Externally, the ADT can be quite simple: aside from the standard creation and destruction operations, expose a
call(x)
operation to call the memoised function with the argumentx
. Internally, we can use a hash table to store key-value pairs(x, f(x))
representing these function calls. To implementcall(x)
, we first look for the value with keyx
in the hash table, and if it's not there we computef(x)
, put it into the hash table for next time and return the value. - There are lots of possible options here. Here are a couple:
- Evict the oldest result (i.e. the one inserted first among all currently-present results).
- Evict the least recently-used result (i.e. the one which has gone the longest time without being requested among all currently-present results).
- Evict the least frequently-used result (i.e. the one which has had the least number of requests among all currently-present results).
-
The first policy can be done by keeping a queue of key-value pairs alongside the hash table. When you need to do an eviction, dequeue a pair and then remove it from the hash table.
For the third policy, use a separate data structure alongside the hash table to store the number of times
call
has been used for each input. When you need to do an eviction, search for the input with the least number of calls.The second policy is a bit more elaborate. Internally, keep a running count of the number of times the
call
function has been used with any input. Values of this count over time can then be used to act like a timestamp of whencall
was last used for a specific input. Alongside the main hash table, use some separate data structure to store the timestamp of the last timecall
was used with each input. When you need to do an eviction, search for the input with the smallest timestamp and then remove that key from the hash table.There are more implementations for the second and third policy that reduce the cost of the required search in each case, but they're much more complex. Doing them well is a challenge!