Extra Lab Exercise

Hashing and Profiling

Objectives

  • To learn about hash tables and the effectiveness of hash functions
  • To learn about analysis of program performance via profiling

Admin

This lab is not marked and there is no submission for it.

Background

In lectures, we examined hashing as a key-based search/storage data structure. Under ideal circumstances, hashing can give \( O(1) \) access to stored items via a key value. Two critical components in a hashing scheme were: the hash function (which converts key values into indexes), and the collision resolution method (which deals with several keys hashing to the same index value). In this lab, we deal with a hash table implementation that uses Sedgewick's "universal" hashing function on character strings, and uses the separate chaining mechanism for collision resolution.

Setting Up

Set up a directory for this lab, change into that directory, and run the following command:

unzip /web/cs2521/24T1/labs/week17/downloads/files.zip

If you're working at home, download files.zip by clicking on the above link and then run the unzip command on the downloaded file.

If you've done the above correctly, you should now have the following files:

Makefile
a set of dependencies used to control compilation
words.c
a main program that reads words and builds a hash table from the words
HashTable.h
the interface for the Hash Table ADT
HashTable.c
an incomplete implementation of the Hash Table ADT
List.h
the interface for the List ADT
List.c
a complete implementation of the List ADT
Item.h
the interface for the Item ADT
Item.c
a complete implementation of the Item ADT
mkwords.c
a main program that generates random words

Compiling with make will produce two executables: mkwords and words. The mkwords program is fully functional and produces sequences of words using a random number generator with an (optional) seed. For example:

./mkwords 10 3
allpnl
ahebpveeloatic
ualoubyy
hssaif
rywt
tiehaelsh
oheom
vmoe
jeabzsa
zqa

The example above produces 10 "words" using a random number generator seed of 3. If you want to generate really random (i.e. unreproduceable) word lists, don't supply a seed parameter and mkwords will use an effectively random seed.

Note that mkwords will inevitably produce duplicate words for any reasonable-sized N; the insertion function for the hash table checks for and handles duplicates.

The words program also executes to completion, but since the HashTableStats() function is incomplete, you won't get particularly interesting output. See below for examples of what the output from words looks like.

You should begin by looking primarily at the code in these files:

words.c

This contains a program that reads the words in a specified file file, loads it into a hash table, and then prints statistics about the hash table. After inserting each word into the hash table, the program immediately searches for it, to make sure that it was actually inserted. It then searches for several "words" not in the input; these should not be found in the hash table. Finally, it clears all of the memory used by the hash table and exits.

The words program takes two command-line parameters: the name of the file to read from; the number of slots in the hash table (best if this is a prime number).

The words program can also read from its standard input if the filename is given as a single minus sign. However, when reading from standard input, it only performs the tests for words not in the input.

HashTable.c

This provides an implementation of a hash table that uses separate chaining for collision resolution. The hash table consists of an indexed array of lists, which are based on the List data type. The core hash table functions (insert, delete, search) are all quite simple, and consist of using the hash function to find the appropriate list, and then carrying out the relevant operation using the appropriate List function. All of the hash table function are complete except for HashTableStats().

Other relevant code is in List.c which provides a standard implementation of a linked list of items, which you can assume is correct. Similarly, the file Item.c contains an implementation for items; normally, this would be done simply as a set of macros, but we have used functions for some of the operations to create entries in the profile.

Task 1

Your first task is to complete the HashTableStats() function in HashTable.c. This function should print the following information about the hash table:

  • the number of slots in the hash table array
  • the number of items stored in the (lists of the) hash table
  • information about the lengths of chains in a table containing
    • the length of chains (use zero length for unused array slots)
    • the number (frequency) of chains with this length

The table should have a row for each chain length from 0 up to the maximum chain length. You'll need to work out the maximum chain length, and then allocate an array of counters of the appropriate size.

When functioning correctly, the program should behave as follows:

./mkwords 2000 13 > wordsfile
./words wordsfile 1777
Reading words from wordsfile
Hash Table Stats:
Number of slots = 1777
Number of items = 1970
Chain length distribution
  Length  #Chains
       0      585
       1      657
       2      351
       3      139
       4       33
       5       11
       7        1
Testing completed OK

Note that you could produce the same output, without needing an intermediate file, using the command:

./mkwords 2000 13 | ./words - 1777

The above commands insert 2000 words (1970 distinct words) into a hash table with 1777 slots. The output tells us that there are 585 unused slots in the hash table (chain length 0), and 657 slots with chains of length 1, etc. If there are no chains of a given length, then nothing is written for that length, e.g.

./mkwords 1000 7 | ./words - 101
Reading words from stdin
Hash Table Stats:
Number of slots = 101
Number of items = 991
Chain length distribution
  Length  #Chains
       2        1
       4        2
       5        2
       6        6
       7       16
       8       12
       9       13
      10       15
      11        8
      12        6
      13        5
      14        5
      15        3
      16        3
      17        1
      18        2
      19        1
Testing completed OK
$

This output is for a small hash table with just 101 slots. Since there is no entry for 0, all of the slots in the hash tables have been used (i.e. no slots with zero items). Since there are no entries for 1 and 3, this tells us there are no chains of length 1 or 3; this is not in itself interesting, and is just way that it works out for this data. Many of the slots have overflow chains of length 10; hits on any of these hash table entries will result in us examining up to 10 items. We can also see that the longest chain contains 19 items. Ideally, we would like each chain to contain a small number of items; this means that the hash table needs to have a number of slots that is proportional to the number of items (e.g. if the has function works perfectly and we have n items and n/3 slots, we'd expect each chain to be of length ~3).

Execution Profiling

Execution profilers measure statistics like the number of times each function is called and the average time spent executing each function, in order to identify which functions are contributing most to the cost of executing the program. Raw timing data (e.g. using the time command) gives you an overview of how fast a program is, but an execution profile gives you detailed information about where it's spending all of its time, and thus where you should focus any attempts at improving its performance.

The standard execution profiler on Linux is gprof. In order to profile a C program, the program needs to be compiled and linked with the -pg flag. When you compiled the words program with make earlier, you may have noticed that -pg appeared on each line with gcc.

Programs compiled with -pg include additional code that monitors the execution of the program and measures:

  • overall execution time
  • how many times each function is called
  • which functions call which other functions
  • how much time it takes to execute each function

To generate an execution profile, run commands like the following:

./mkwords 100000 3 | ./words - 50033
....
gprof words | less

The mkwords program outputs 100000 words, and passes them to the words program which inserts them into a hash table with 50033 slots. Since mkwords may produce duplicates when generating large numbers of words, the actual number of distinct words added to the hash table may be less than the number of words requested (in this case, 90893 distinct items will be inserted). Remember that for each word in the input, there will be one insert operation (to add it to the hash table) and one search operation (to check that it was added).

Since gprof produces quite a lot of output, it is useful to pipe it through the less command, which allows us to scroll through output one screenful at a time.

The output from gprof has two parts:

  • flat profile: how much time was spent in each funtion; how many times it was called
  • graph profile: how many times was each function called from which other functions

For this lab, we consider only the flat profile, although you might want to check the graph profile to see if it gives you any information that might be useful to tune the program.

If you execute the words program as above, then the flat profile will look approximately like:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 44.50      0.04     0.04   200004   200.24   200.24  hash
 33.37      0.07     0.03   200004   150.18   150.18  ListSearch
 11.12      0.08     0.01   100004   100.12   450.53  HashTableSearch
 11.12      0.09     0.01   100000   100.12   450.53  HashTableInsert
  0.00      0.09     0.00   378051     0.00     0.00  cmp
  0.00      0.09     0.00   100001     0.00     0.00  ItemGet
  0.00      0.09     0.00   100000     0.00     0.00  newItem
  0.00      0.09     0.00    90936     0.00     0.00  ListInsert
  0.00      0.09     0.00    90936     0.00     0.00  dropItem
  0.00      0.09     0.00    50033     0.00     0.00  ListLength
  0.00      0.09     0.00    50033     0.00     0.00  dropList
  0.00      0.09     0.00    50033     0.00     0.00  newList
  0.00      0.09     0.00        1     0.00     0.00  HashTableStats
  0.00      0.09     0.00        1     0.00     0.00  dropHashTable
  0.00      0.09     0.00        1     0.00     0.00  newHashTable

Warning: Note that, for the same input, the counts of function calls should be consistent across all machines, and you should be able to explain (from looking at the code) where the counts come from. However, the timing (being sampling-based) may differ from machine to machine, and even differ between executions of the program on the same machine. This may result in significantly different times, different percentages, and even a different ordering of functions. Given such variations, is this kind of profile even useful? Yes, because the most time-consuming functions will be consistently higher than all of the others, and thus give you a basis for tuning.

Each line gives some statistics for one function in the program. The "% time" column indicates the percentage of execution time that was spent in the function. The (approximate) total running time of the program can be obtained by reading down the "cumulative seconds" column; the final value is the total time. The "self seconds" column gives the total time spent executing that function, during the entire course of program execution; the values in this column are added to obtain the "cumulative seconds" column. The "calls" column indicates the total number of times that the function was called during program execution. The "self ms/call" gives the average time spent in each call of the function, while the "total ms/call" gives the time spent in this function plus any of the functions it calls.

In the above profile, you can see that the program takes a total of 0.06 seconds to run (the last value in the "cumulative seconds" column). Function-wise, most of the time is spent in the hash() function. The next most expensive consumers of execution time are ListSearch() and ListLength(). (However, as noted above, you may observe different functions and quite different percentages.)

You might be surprised to see that most of the functions appear to cost 0.00ms to run. The most likely explanation here is that the cost of executing the function is, on average, less than 0.005 ms, which is rounded down to zero. Such a small cost may mean that the function itself is not inherently inefficient; if it features prominently in the cumulative time, you need to consider why it's being called so many times (which is where the graph profile helps).

In the above example, the hash() function is called many times, but can't be called less times (why not?), so if we want to tune the performance of this program, we would need to improve the speed of the hash() function (but without sacrificing its good performance in spreading hash values across the index range). If we make the hashing distribution worse while making the hash() function faster, we might simply end up moving the cost from the hash() function to some other function such as ListSearch().

Task 2

For testing the words program, you will want to use some reasonably large inputs. Try running your program as follows:

./mkwords 1000000 | ./words - 424247

Note that the above command inserts 857424 distinct words into a hash table with 424247 slots. Clearly, since there are more words than slots, chains of length greater than one will occur frequently in this table. You could try adding more slots to the table to see how this improves the average/maximum chain length. You could also try inserting more words, to do some serious stress testing. We suggest keeping the ratio of words to slots at less than 2:1, and ideally closer to 1:1 (which is aiming at one slot per word).

Remember that the whole point of minimising chain lengths is that the worst case lookup cost is proportional to the maximum chain length, and the average case cost is proportional to the average chain length. You will quickly notice, while running with large inputs, that different slot numbers produce different average chain lengths, and those examples with shorter average chain lengths run much faster than those with longer average chain lengths.

If you want an alternative word set, there is a dictionary of 90,000 English words and a dictionary of 5 million place names (from the GeoNames database) which you could use on the CSE workstations. Try the following commands:

./words /web/cs2521/24T1/labs/week17/data/dict 49999
./words /web/cs2521/24T1/labs/week17/data/dict 50000
./words /web/cs2521/24T1/labs/week17/data/places 1048576
./words /web/cs2521/24T1/labs/week17/data/places 1048573

Don't copy these files to your own directory as they are quite large.

Consider the questions below, using the above command and variations on it (see below). Put your answers in a text file called answers.txt.

Your answers file should (ultimately) contain:

  • answers to the questions below

  • output from words to illustrate your answers (where appropriate)

  • flat profiles to illustrate your answers (where appropriate)

  • analyses/explanations for all answers

The precise format of the answers file is not important. What is important is that you explain your answers, using relevant evidence from both the from the profile output and the output of the words program.

  1. The mkwords 1000000 3 command produces 857424 distinct words. What is the maximum chain length if a hash table size of 85711 is used? How does the chain length distribution change if the hash table size is 100000? 214283? 400000? 400837? 857144? 857137?

  2. Every other number above (i.e. 85711, 214283, 400837, 857137) is prime. It is often noted that using prime numbers appropriately in the hash function leads to a better distribution of hash values, and thus generally shorter chains. Does this appear to be the case for the hash table sizes in the previous question?

  3. An "optimal" hash table would have all slots occupied and have all chains of length roughly (nwords/nslots). In practice, this is impossible to achieve in the general case, and what we want is a table with relatively short chains, with as few slots as possible (small size of hash table), and not too many empty slots. Can you find a suitable hash table size that keeps the maximum chain length under 10, and has most chains with length 1 or 2, but also uses more than 2/3 of the slots?

  4. Compare both the outputs and the profiles for the two commands:

    $ ./words /web/cs2521/24T1/labs/week17/data/places 1048576
    $ ./words /web/cs2521/24T1/labs/week17/data/places 1048573
    

    What does this tell you about hash table search performance when the hash function is significantly sub-optimal?

  5. Examine the profiles from running the command:

    $ ./mkwords 1000000 | ./words - N
    

    For a number of different values of N, what are the most costly functions (in terms of overall time)?

  6. Suggest how the individual functions might be improved. Suggest how the overall performance might be improved.

  7. Implement your suggestions and then give a new profile to show the improvement, and explain how the profile shows the improvement.

If you want some prime numbers to use for testing different table sizes, there are a large number of them in:

/web/cs2521/24T1/labs/week17/data/primes