Laboratory

DRAFT This lab exercise is subject to change.

Objectives

Learn about balanced trees
Practise tree implementations
Learn about hash tables
Learn about analysis of program performance via profiling

Assessment

Deadline: 3 February 2019, 23:59:59
Marks: 5
Submission: give cs2521 lab08

This lab should be done individually.

Setting up

Create a directory for this lab, change into it, and retrieve the files with the fetch command:

2521 fetch lab08

Or, download all the files.

Compiling with make will produce three executables: mkwords, words, and words2. The mkwords program is fully functional, and produces sequences of N words using a random number generator with (optional) Seed.

For example:

./mkwords 10 3
allpnl
ahebpveeloatic
ualoubyy
hssaif
rywt
tiehaelsh
oheom
vmoe
jeabzsa
zqa

produces 10 “words” using a random number generator seed of 3.

If you want to generate really random (i.e. unreproduceable) word lists, don’t supply a seed parameter, and mkwords will use an effectively random seed. Note that mkwords will inevitably produce duplicate words for any reasonable-sized N.

words and words2 don’t currently work. words reads the dictionary words into a tree structure for fast lookup. words2 reads the dictionary words into a Hashtable structure for fast lookup.

Exercise 1: Rotations (1 mark)

rotate_left and rotate_right in Tree.c are broken. They do not update the size information in the nodes. This stops other functions that rely on the size field, such as partition, from working properly.

You must fix these and write some white box tests to check they work correctly.

Exercise 2: Balancing Trees (1 mark)

You must get Exercise 1 working before you start this, otherwise the balancing functions used in this task will not work!

You should begin by looking primarily at the code in words.c.

a) Understanding the `words` program

This contains a program that reads the words in from a specified file, loads them into a tree, and then prints statistics about the tree.

After inserting each word into the tree, the program immediately searches for them, to make sure that they were actually inserted.

After inserting all words, it then searches for several “words” not in the input; these should not be found in the tree.

The program then prints out the number of nodes and the height of the tree used to store the words. It also prints out what kind of rebalancing strategy was used (the different rebalancing strategies are explained below.)

Finally, it clears all of the memory used by the tree and exits.

b) Running the `words` program

The words program takes two command-line parameters:

The name of the file to read from
An integer that represents what tree balancing strategy should be used by the tree.

The words program can also read from its standard input if the filename is given as a single minus sign.

Make the file and confirm that it runs using balance strategy 0 (No rebalancing) by running it as follows:

./mkwords 10 | ./words - 0

You can run other balancing strategies (once you have implemented them) by using different command line arguments.

For example, once you have implemented REBALANCE_1 strategy you could also try the approach where the tree has been rebalanced after every insertion by running:

./mkwords 10 | ./words - 1

To run experiments with sorted data you could do

./mkwords 10 | sort | ./words - 1

Or for reverse sorted data

./mkwords 10 | sort -r | ./words - 1

To time your experiments you could do something like

./mkwords 10 | sort | time ./words - 1

Note: You may want to comment out the line of code that actually draws the tree once you start running experiments. But is handy to use to check and understand what all the different trees do and make sure your program is working.

c) Completing `tree_insert`

We are going to investigate/implement the following approaches for rebalancing a BST.

NO_REBALANCE: No rebalancing - normal BST insertion
REBALANCE_1: Global rebalancing after every insertion
REBALANCE_100: Global rebalancing after every 100 items are inserted
REBALANCE_1000: Global rebalancing after every 1000 items are inserted
RANDOMISED: Using randomised BST insertion
SPLAY: Using splay insertion

Your task is to complete a function called tree_insert which is partially implemented in the file Tree.c

void tree_insert(Tree tree, Item it);

The function should call the appropriate insert_recursive and balance functions according to the specified rebalancing approach. The function already handles the first balancing strategies (not balancing the tree at all), you need to add code to handle the rest.

Note: All the necessary functions for standard insertion, balancing, random insertion and splay insertion are provided. You just need to call them appropriately.

d) Recording your results

Once you have implemented the above strategies, run all approaches on the different sized inputs and record in a file called lab08.txt:

the heights of the resulting trees
the amount of user time it takes to insert and search for all keys
the amount of user time it takes to insert all keys (comment out the line in the main function that calls the search functions)

Note: Make sure you run the randomised BST approach a number of times to get an average (as this is a randomised approach and heights will differ ‘randomly’ each time).

Note: Our implementation allows duplicates to be inserted into the tree, but searching will just return the first occurrence found

e) Discussion questions

Discuss (in your lab08.txt file) the reasons for the differences in height and run times between these different approaches and algorithms.

How does each algorithm influence the time to insert vs the time to search vs the height of the resulting tree?

What other tests could you run to compare these implementations?

Exercise 3: HashTable Implementation (1 mark)

a) Understanding the `words2` program

words2.c is a program that is similar to words.c, but instead of using trees to store the words it stores the words in a hash table.

The program is run in a similar way, however this time, the second command line argument indicates what size to make the hash table (note: it’s best if this is a prime number).

For example:

./mkwords 10 | ./words2 - 100

reads in 10 words and stores them in a hash table of size 100.

b) Understanding `HashTable.c`

Inside the HashTable.c file, there is an implementation of a hash table that uses chaining to handle collisions.

The hash table consists of an indexed array of lists, which are based on the List data type.

The core hash table functions (insert, delete, search) are all quite simple, and consist of using the hash function to find the appropriate list, and then carrying out the relevant operation using the appropriate List function.

All of the hash table function are complete except for hash_table_stats().

`List.c`

Other relevant code is in List.c which provides a standard implementation of a linked list of items, which you can assume is correct.

`Item.c`

Similarly, the file Item.c contains an implementation for items; normally, this would be done simply as a set of macros, but we have used functions for some of the operations to create entries in the profile.

c) Completing `hash_table_stats`

Your first task is to complete the hash_table_stats() function in HashTable.c This function should print the following information about the hash table:

the number of slots in the hash table array
the number of items stored in the (lists of the) hash table
information about the lengths of chains in a table containing
- the length of chains (use zero length for unused array slots)
- the number (frequency) of chains with this length

The table should have a row for each chain length from 0 up to the maximum chain length. Note that you’ll need to work out the maximum chain length, and then allocate an array of counters of the appropriate size.

`Output`

When functioning correctly, the program should behave as follows:

./mkwords 2000 13 > wordsfile
./words2 wordsfile 1777
Reading words from wordsfile
Hash Table Stats:
Number of slots = 1777
Number of items = 1970
Chain length distribution
  Length  #Chains
       0      585
       1      657
       2      351
       3      139
       4       33
       5       11
       7        1

More output

Note that you could produce the same output, without needing an intermediate file, using the command:

./mkwords 2000 13 | ./words2 - 1777

The above commands insert 2000 words (1970 distinct words) into a hash table with 1777 slots. The output tells us that there are 585 unused slots in the hash table (chain length 0), and 657 slots with chains of length 1, etc.

Chain length

If there are no chains of a given length, then nothing is written for that length, e.g.

./mkwords 1000 7 | ./words2 - 101
Reading words from stdin
Hash Table Stats:
Number of slots = 101
Number of items = 991
Chain length distribution
  Length  #Chains
       2        1
       4        2
       5        2
       6        6
       7       16
       8       12
       9       13
      10       15
      11        8
      12        6
      13        5
      14        5
      15        3
      16        3
      17        1
      18        2
      19        1

This output is for a small hash table with just 101 slots. Since there is no entry for 0, all of the slots in the hash tables have been used (i.e. no slots with zero items).

Since there are no entries for 1 and 3, this tells us there are no chains of length 1 or 3; this is not in itself interesting, and is just way that it works out for this data.

Overflow chains

Many of the slots have overflow chains of length 10; hits on any of these hash table entries will result in us examining up to 10 items. We can also see that the longest chain contains 19 items.

Ideally, we would like each chain to contain a small number of items; this means that the hash table needs to have a number of slots that is proportional to the number of items (e.g. if the has function works perfectly and we have n items and n/3 slots, we’d expect each chain to be of length ~3).

Note: Our implementation does not allow duplicates to be inserted into the hashtable

Exercise 4: HashTable Analysis (1 mark)

Background: Execution Profiling

Execution profilers measure statistics like the number of times each function is called and the average time spent executing each function, in order to identify which functions are contributing most to the cost of executing the program. Raw timing data (e.g. using the time command) gives you an overview of how fast a program is, but an execution profile gives you detailed information about where it’s spending all of its time, and thus where you should focus any attempts at improving its performance.

The standard execution profiler on Linux is gprof. In order to profile a C program, the program needs to be compiled and linked with the -pg flag. When you compiled the words program with make earlier, you may have noticed that -pg appeared on each line with 3c.

Programs compiled with -pg include additional code that monitors the execution of the program and measures:

overall execution time
how many times each function is called
which functions call which other functions
how much time it takes to execute each function

Generating an execution profile using `gprof`

To generate an execution profile, run e.g. the following two commands:

./mkwords 100000 3 | ./words2 - 50033
....
gprof words2 | less

The mkwords program outputs 100000 words, and passes them to the words program which inserts them into a hash table with 50033 slots. Since mkwords may produce duplicates when generating large numbers of words, the actual number of distinct words added to the hash table may be less than the number of words requested (in this case, 90893 distinct items will be inserted).

Remember that for each word in the input, there will be one insert operation (to add it to the hash table) and one search operation (to check that it was added).

Since gprof produces quite a lot of output, it is useful to pipe it through the less command, which allows us to scroll through output one screenful at a time.

Output from `gprof`

The output from gprof has two parts:

flat profile: how much time was spent in each funtion; how many times it was called
graph profile: how many times was each function called from which other functions

For this lab, we consider only the flat profile, although you might want to check the graph profile to see if it gives you any information that might be useful to tune the program.

If you execute the words program as above, then the flat profile will look approximately like:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
00      0.04     0.04   200004   200.00   200.00  hash
00      0.04     0.00   278050     0.00     0.00  cmp
00      0.04     0.00   200004     0.00     0.00  list_search
00      0.04     0.00   100004     0.00   200.00  hash_table_search
00      0.04     0.00   100000     0.00   200.00  hash_table_insert
00      0.04     0.00    90936     0.00     0.00  list_insert
00      0.04     0.00    90936     0.00     0.00  item_drop
00      0.04     0.00    50033     0.00     0.00  list_length
00      0.04     0.00    50033     0.00     0.00  list_drop
00      0.04     0.00    50033     0.00     0.00  list_new
00      0.04     0.00        1     0.00     0.00  hash_table_print_stats
00      0.04     0.00        1     0.00     0.00  hash_table_drop
00      0.04     0.00        1     0.00     0.00  hash_table_new

Output variance

Warning: Note that, for the same input, the counts of function calls should be consistent across all machines, and you should be able to explain (from looking at the code) where the counts come from.

However, the timing (being sampling-based) may differ from machine to machine, and even differ between executions of the program on the same machine. This may result in significantly different times, different percentages, and even a different ordering of functions.

Given such variations, is this kind of profile even useful? Yes, because the most time-consuming functions will be consistently higher than all of the others, and thus give you a basis for tuning.

Understanding the output

Each line gives some statistics for one function in the program. The “% time” column indicates the percentage of execution time that was spent in the function. The (approximate) total running time of the program can be obtained by reading down the “cumulative seconds” column; the final value is the total time. The “self seconds” column gives the total time spent executing that function, during the entire course of program execution; the values in this column are added to obtain the “cumulative seconds” column. The “calls” column indicates the total number of times that the function was called during program execution. The “self ms/call” gives the average time spent in each call of the function, while the “total ms/call” gives the time spent in this function plus any of the functions it calls.

In the above profile, you can see that the program takes a total of 0.04 seconds to run (the last value in the “cumulative seconds” column).

Function-wise, most of the time is spent in the hash() function. (However, as noted above, you may observe different functions and quite different percentages.)

Improving performance

You might be surprised to see that most of the functions appear to cost 0.00ms to run. The most likely explanation here is that the cost of executing the function is, on average, less than 0.005 ms, which is rounded down to zero.

Such a small cost may mean that the function itself is not inherently inefficient; if it features prominently in the cumulative time, you need to consider why it’s being called so many times (which is where the graph profile helps).

In the above example, the hash() function is called many times, but can’t be called less times (why not?), so if we want to tune the performance of this program, we would need to improve the speed of the hash() function (but without sacrificing its good performance in spreading hash values across the index range).

If we make the hashing distribution worse while making the hash() function faster, we might simply end up moving the cost from the hash() function to some other function such as list_search().

Analysis

For testing the words2 program, you will want to use some reasonably large inputs. Try running your program as follows:

./mkwords 1000000 | ./words2 - 424247

Note that the above command inserts 857424 distinct words into a hash table with 424247 slots. Clearly, since there are more words than slots, chains of length greater than one will occur frequently in this table. You could try adding more slots to the table to see how this improves the average/maximum chain length. You could also try inserting more words, to do some serious stress testing. We suggest keeping the ratio of words to slots at less than 2:1, and ideally closer to 1:1 (which is aiming at one slot per word).

Remember that the whole point of minimising chain lengths is that the worst case lookup cost is proportional to the maximum chain length, and the average case cost is proportional to the average chain length.

You will quickly notice, while running with large inputs, that different slot numbers produce different average chain lengths, and those examples with shorter average chain lengths run much faster than those with longer average chain lengths.

If you want an alternative word set, there is a dictionary of 90,000 English words and a dictionary of 5 million place names (from the GeoNames database) which you could use on the CSE workstations.

Try the following commands:

./words2 /web/cs2521/19t0/week08/lab/dict 49999
./words2 /web/cs2521/19t0/week08/lab/dict 50000
./words2 /web/cs2521/19t0/week08/lab/places 1048576
./words2 /web/cs2521/19t0/week08/lab/places 1048573

Don’t copy these files to your own directory as they are quite large.

Discussion questions

Consider the questions below, using the above command and variations on it (see below).

Put your answers in a text file called lab08.txt.

Your answers file should (ultimately) contain:

answers to the questions below
output from words2 to illustrate your answers (where appropriate)
flat profiles to illustrate your answers (where appropriate) analyses/explanations for all answers

The precise format of the answers file is not important. What is important is that you explain your answers, using relevant evidence from both the from the profile output and the output of the words program.

The mkwords 1000000 3 command produces 857424 distinct words.
- What is the maximum chain length if a hash table size of 85711 is used?
- How does the chain length distribution change if the hash table size is 100000? 214283? 400000? 400837? 857144? 857137?
Every other number above (i.e. 85711, 214283, 400837, 857137) is prime. It is often noted that using prime numbers appropriately in the hash function leads to a better distribution of hash values, and thus generally shorter chains. Does this appear to be the case for the hash table sizes in the previous question?
An "optimal" hash table would have all slots occupied and have all chains of length roughly (nwords/nslots). In practice, this is impossible to achieve in the general case, and what we want is a table with relatively short chains, with as few slots as possible (small size of hash table), and not too many empty slots. Can you find a suitable hash table size that keeps the maximum chain length under 10, and has most chains with length 1 or 2, but also uses more than 2/3 of the slots?
Compare both the outputs and the profiles for the two commands:
```
./words2 /web/cs2521/19t0/week08/lab/places 1048576
./words2 /web/cs2521/19t0/week08/lab/places 1048573
```
What does this tell you about hash table search performance when the hash function is significantly sub-optimal?
Examine the profiles from running the command:
```
./mkwords 1000000 | ./words2 - N
```
for a number of different values of N. What are the most costly functions (in terms of overall time)?
How could the individual functions be improved? How could the overall performance be improved?
Implement your suggestions and then give a new profile to show the improvement, and explain how the profile shows the improvement.

If you want some prime numbers to use for testing different table sizes, check out a handy list of primes at primes.utm.edu/lists.

Bonus: Search in a Splay Tree (1 mark)

In lectures, we discussed inserting an item into a splay tree (balanced tree examples). The amortised cost of insert and search operations on splay trees is only valid if we also perform splay rotations during searching for an item. An partial of implementation of btree_search_splay has been provided in btree.c to get you started. Its prototype is as follows:

static tree_node *search_splay (tree_node *n, Key k, bool *found);

It searches for an item and performs splay rotations on the traversed path, much like the function insert_splay() does.

NOTE: once an item has been found, it is moved to the root of tree. If the key does not exist in the tree, the last node on the search path should be moved to the root.

You can look at insert_splay() to help you implement this function.

NOTE: because the search function changes the root of the tree, it needs to return a pointer to the new root, so the actual result of the search should be stored in the variable pointed to by found. This should be true if the key was in the tree and false otherwise.

The tree_search function needs to be modified so that it calls the appropriate version of seach, depending what balancing strategy is used: if the balance strategy is SPLAY, it uses search_splay, or the standard searchRecursive otherwise.

Devise and run some experiments to compare your splay tree to the previous splay tree that only performed splay operations on insertion. Write it up in your lab08.txt file.

If you have time, try to factor the common splay logic out of insert_splay and search_splay.

Submitting

You must get the lab marked by your tutor in your lab.

Submit your work with the give command, like so:

give cs2521 lab08 Tree.c lab08.txt HashTable.c

Or, if you are working from home, upload the relevant file(s) to the lab08 activity on WebCMS3 or on Give Online.