Laboratory
DRAFT This lab exercise is subject to change.
Objectives
- Learn about balanced trees
- Practise tree implementations
- Learn about hash tables
- Learn about analysis of program performance via profiling
Assessment
- Deadline
- 3 February 2019, 23:59:59
- Marks
- 5
- Submission
give cs2521 lab08
This lab should be done individually.
Setting up
Create a directory for this lab, change into it,
and retrieve the files with the fetch
command:
2521 fetch lab08
Compiling with make
will produce three executables:
mkwords
, words
, and words2
.
The mkwords
program is fully functional,
and produces sequences of N words
using a random number generator with (optional) Seed.
For example:
./mkwords 10 3 allpnl ahebpveeloatic ualoubyy hssaif rywt tiehaelsh oheom vmoe jeabzsa zqa
produces 10 “words” using a random number generator seed of 3.
If you want to generate really random
(i.e. unreproduceable) word lists,
don’t supply a seed parameter,
and mkwords
will use an effectively random seed.
Note that mkwords
will inevitably produce
duplicate words for any reasonable-sized N.
words
and words2
don’t currently work.
words
reads the dictionary words into
a tree structure for fast lookup.
words2
reads the dictionary words into
a Hashtable structure for fast lookup.
Exercise 1: Rotations (1 mark)
rotate_left
and rotate_right
in Tree.c
are broken.
They do not update the size information in the nodes.
This stops other functions that rely on the size field,
such as partition
, from working properly.
You must fix these and write some white box tests to check they work correctly.
Exercise 2: Balancing Trees (1 mark)
You must get Exercise 1 working before you start this, otherwise the balancing functions used in this task will not work!
You should begin by looking primarily at the code in words.c
.
a) Understanding the words
program
This contains a program that reads the words in from a specified file, loads them into a tree, and then prints statistics about the tree.
After inserting each word into the tree, the program immediately searches for them, to make sure that they were actually inserted.
After inserting all words, it then searches for several “words” not in the input; these should not be found in the tree.
The program then prints out the number of nodes and the height of the tree used to store the words. It also prints out what kind of rebalancing strategy was used (the different rebalancing strategies are explained below.)
Finally, it clears all of the memory used by the tree and exits.
b) Running the words
program
The words
program takes two command-line parameters:
- The name of the file to read from
- An integer that represents what tree balancing strategy should be used by the tree.
The words
program can also read from its standard input if the
filename is given as a single minus sign.
Make
the file and confirm that it runs using balance strategy 0
(No rebalancing) by running it as follows:
./mkwords 10 | ./words - 0
You can run other balancing strategies (once you have implemented them) by using different command line arguments.
For example, once you have implemented REBALANCE_1
strategy you could
also try the approach where the tree has been rebalanced after every
insertion by running:
./mkwords 10 | ./words - 1
To run experiments with sorted data you could do
./mkwords 10 | sort | ./words - 1
Or for reverse sorted data
./mkwords 10 | sort -r | ./words - 1
To time your experiments you could do something like
./mkwords 10 | sort | time ./words - 1
Note: You may want to comment out the line of code that actually draws the tree once you start running experiments. But is handy to use to check and understand what all the different trees do and make sure your program is working.
c) Completing tree_insert
We are going to investigate/implement the following approaches for rebalancing a BST.
NO_REBALANCE
: No rebalancing - normal BST insertionREBALANCE_1
: Global rebalancing after every insertionREBALANCE_100
: Global rebalancing after every 100 items are insertedREBALANCE_1000
: Global rebalancing after every 1000 items are insertedRANDOMISED
: Using randomised BST insertionSPLAY
: Using splay insertion
Your task is to complete a function called tree_insert
which is partially implemented in the file Tree.c
void tree_insert(Tree tree, Item it);
The function should call the appropriate insert_recursive
and
balance
functions according to the specified rebalancing approach.
The function already handles the first balancing strategies (not
balancing the tree at all), you need to add code to handle the rest.
Note: All the necessary functions for standard insertion, balancing, random insertion and splay insertion are provided. You just need to call them appropriately.
d) Recording your results
Once you have implemented the above strategies, run all approaches on the different sized inputs and record in a file called lab08.txt:
- the heights of the resulting trees
- the amount of user time it takes to insert and search for all keys
- the amount of user time it takes to insert all keys (comment out the line in the main function that calls the search functions)
Note: Make sure you run the randomised BST approach a number of times to get an average (as this is a randomised approach and heights will differ ‘randomly’ each time).
Note: Our implementation allows duplicates to be inserted into the tree, but searching will just return the first occurrence found
e) Discussion questions
Discuss (in your lab08.txt
file) the reasons for the differences in
height and run times between these different approaches and algorithms.
How does each algorithm influence the time to insert vs the time to search vs the height of the resulting tree?
What other tests could you run to compare these implementations?
Exercise 3: HashTable Implementation (1 mark)
a) Understanding the words2
program
words2.c
is a program that is similar to words.c
, but instead of
using trees to store the words it stores the words in a hash table.
The program is run in a similar way, however this time, the second command line argument indicates what size to make the hash table (note: it’s best if this is a prime number).
For example:
./mkwords 10 | ./words2 - 100
reads in 10 words and stores them in a hash table of size 100.
b) Understanding HashTable.c
Inside the HashTable.c
file, there is an implementation of a hash table
that uses chaining to handle collisions.
The hash table consists of an indexed array of lists, which are based on
the List
data type.
The core hash table functions (insert, delete, search) are all quite
simple, and consist of using the hash function to find the appropriate
list, and then carrying out the relevant operation using the appropriate
List
function.
All of the hash table function are complete except for hash_table_stats()
.
List.c
Other relevant code is in List.c
which provides a standard
implementation of a linked list of items, which you can assume is
correct.
Item.c
Similarly, the file Item.c
contains an implementation for items;
normally, this would be done simply as a set of macros, but we have used
functions for some of the operations to create entries in the profile.
c) Completing hash_table_stats
Your first task is to complete the hash_table_stats()
function in
HashTable.c This function should print the following information about
the hash table:
- the number of slots in the hash table array
- the number of items stored in the (lists of the) hash table
- information about the lengths of chains in a table containing
- the length of chains (use zero length for unused array slots)
- the number (frequency) of chains with this length
The table should have a row for each chain length from 0 up to the maximum chain length. Note that you’ll need to work out the maximum chain length, and then allocate an array of counters of the appropriate size.
Output
When functioning correctly, the program should behave as follows:
./mkwords 2000 13 > wordsfile ./words2 wordsfile 1777 Reading words from wordsfile Hash Table Stats: Number of slots = 1777 Number of items = 1970 Chain length distribution Length #Chains 0 585 1 657 2 351 3 139 4 33 5 11 7 1
More output
Note that you could produce the same output, without needing an intermediate file, using the command:
./mkwords 2000 13 | ./words2 - 1777
The above commands insert 2000 words (1970 distinct words) into a hash table with 1777 slots. The output tells us that there are 585 unused slots in the hash table (chain length 0), and 657 slots with chains of length 1, etc.
Chain length
If there are no chains of a given length, then nothing is written for that length, e.g.
./mkwords 1000 7 | ./words2 - 101 Reading words from stdin Hash Table Stats: Number of slots = 101 Number of items = 991 Chain length distribution Length #Chains 2 1 4 2 5 2 6 6 7 16 8 12 9 13 10 15 11 8 12 6 13 5 14 5 15 3 16 3 17 1 18 2 19 1
This output is for a small hash table with just 101 slots. Since there is no entry for 0, all of the slots in the hash tables have been used (i.e. no slots with zero items).
Since there are no entries for 1 and 3, this tells us there are no chains of length 1 or 3; this is not in itself interesting, and is just way that it works out for this data.
Overflow chains
Many of the slots have overflow chains of length 10; hits on any of these hash table entries will result in us examining up to 10 items. We can also see that the longest chain contains 19 items.
Ideally, we would like each chain to contain a small number of items; this means that the hash table needs to have a number of slots that is proportional to the number of items (e.g. if the has function works perfectly and we have n items and n/3 slots, we’d expect each chain to be of length ~3).
Note: Our implementation does not allow duplicates to be inserted into the hashtable
Exercise 4: HashTable Analysis (1 mark)
Background: Execution Profiling
Execution profilers measure statistics like the number of times each function is called and the average time spent executing each function, in order to identify which functions are contributing most to the cost of executing the program. Raw timing data (e.g. using the time command) gives you an overview of how fast a program is, but an execution profile gives you detailed information about where it’s spending all of its time, and thus where you should focus any attempts at improving its performance.
The standard execution profiler on Linux is gprof
. In order to profile
a C program, the program needs to be compiled and linked with the -pg
flag.
When you compiled the words
program with make
earlier, you may have
noticed that -pg
appeared on each line with 3c
.
Programs compiled with -pg
include additional code that monitors the
execution of the program and measures:
- overall execution time
- how many times each function is called
- which functions call which other functions
- how much time it takes to execute each function
Generating an execution profile using gprof
To generate an execution profile, run e.g. the following two commands:
./mkwords 100000 3 | ./words2 - 50033 .... gprof words2 | less
The mkwords
program outputs 100000 words, and passes them to the words
program which inserts them into a hash table with 50033 slots.
Since mkwords
may produce duplicates when generating large numbers of
words, the actual number of distinct words added to the hash table may
be less than the number of words requested (in this case, 90893 distinct
items will be inserted).
Remember that for each word in the input, there will be one insert operation (to add it to the hash table) and one search operation (to check that it was added).
Since gprof
produces quite a lot of output, it is useful to pipe it
through the less
command, which allows us to scroll through output one
screenful at a time.
Output from gprof
The output from gprof
has two parts:
- flat profile: how much time was spent in each funtion; how many times it was called
- graph profile: how many times was each function called from which other functions
For this lab, we consider only the flat profile, although you might want to check the graph profile to see if it gives you any information that might be useful to tune the program.
If you execute the words
program as above, then the flat profile will
look approximately like:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
100.00 0.04 0.04 200004 200.00 200.00 hash
0.00 0.04 0.00 278050 0.00 0.00 cmp
0.00 0.04 0.00 200004 0.00 0.00 list_search
0.00 0.04 0.00 100004 0.00 200.00 hash_table_search
0.00 0.04 0.00 100000 0.00 200.00 hash_table_insert
0.00 0.04 0.00 90936 0.00 0.00 list_insert
0.00 0.04 0.00 90936 0.00 0.00 item_drop
0.00 0.04 0.00 50033 0.00 0.00 list_length
0.00 0.04 0.00 50033 0.00 0.00 list_drop
0.00 0.04 0.00 50033 0.00 0.00 list_new
0.00 0.04 0.00 1 0.00 0.00 hash_table_print_stats
0.00 0.04 0.00 1 0.00 0.00 hash_table_drop
0.00 0.04 0.00 1 0.00 0.00 hash_table_new
Output variance
Warning: Note that, for the same input, the counts of function calls should be consistent across all machines, and you should be able to explain (from looking at the code) where the counts come from.
However, the timing (being sampling-based) may differ from machine to machine, and even differ between executions of the program on the same machine. This may result in significantly different times, different percentages, and even a different ordering of functions.
Given such variations, is this kind of profile even useful? Yes, because the most time-consuming functions will be consistently higher than all of the others, and thus give you a basis for tuning.
Understanding the output
Each line gives some statistics for one function in the program. The “% time” column indicates the percentage of execution time that was spent in the function. The (approximate) total running time of the program can be obtained by reading down the “cumulative seconds” column; the final value is the total time. The “self seconds” column gives the total time spent executing that function, during the entire course of program execution; the values in this column are added to obtain the “cumulative seconds” column. The “calls” column indicates the total number of times that the function was called during program execution. The “self ms/call” gives the average time spent in each call of the function, while the “total ms/call” gives the time spent in this function plus any of the functions it calls.
In the above profile, you can see that the program takes a total of 0.04 seconds to run (the last value in the “cumulative seconds” column).
Function-wise, most of the time is spent in the hash()
function.
(However, as noted above, you may observe different functions and quite
different percentages.)
Improving performance
You might be surprised to see that most of the functions appear to cost 0.00ms to run. The most likely explanation here is that the cost of executing the function is, on average, less than 0.005 ms, which is rounded down to zero.
Such a small cost may mean that the function itself is not inherently inefficient; if it features prominently in the cumulative time, you need to consider why it’s being called so many times (which is where the graph profile helps).
In the above example, the hash()
function is called many times, but
can’t be called less times (why not?), so if we want to tune the
performance of this program, we would need to improve the speed of the
hash()
function (but without sacrificing its good performance in
spreading hash values across the index range).
If we make the hashing distribution worse while making the
hash()
function faster, we might simply end up moving the cost from the
hash()
function to some other function such as list_search()
.
Analysis
For testing the words2
program, you will want to use some reasonably
large inputs. Try running your program as follows:
./mkwords 1000000 | ./words2 - 424247
Note that the above command inserts 857424 distinct words into a hash table with 424247 slots. Clearly, since there are more words than slots, chains of length greater than one will occur frequently in this table. You could try adding more slots to the table to see how this improves the average/maximum chain length. You could also try inserting more words, to do some serious stress testing. We suggest keeping the ratio of words to slots at less than 2:1, and ideally closer to 1:1 (which is aiming at one slot per word).
Remember that the whole point of minimising chain lengths is that the worst case lookup cost is proportional to the maximum chain length, and the average case cost is proportional to the average chain length.
You will quickly notice, while running with large inputs, that different slot numbers produce different average chain lengths, and those examples with shorter average chain lengths run much faster than those with longer average chain lengths.
If you want an alternative word set, there is a dictionary of 90,000 English words and a dictionary of 5 million place names (from the GeoNames database) which you could use on the CSE workstations.
Try the following commands:
./words2 /web/cs2521/19t0/week08/lab/dict 49999 ./words2 /web/cs2521/19t0/week08/lab/dict 50000 ./words2 /web/cs2521/19t0/week08/lab/places 1048576 ./words2 /web/cs2521/19t0/week08/lab/places 1048573
Don’t copy these files to your own directory as they are quite large.
Discussion questions
Consider the questions below, using the above command and variations on it (see below).
Put your answers in a text file called lab08.txt.
Your answers file should (ultimately) contain:
- answers to the questions below
- output from words2 to illustrate your answers (where appropriate)
- flat profiles to illustrate your answers (where appropriate) analyses/explanations for all answers
The precise format of the answers file is not important.
What is important is that you explain your answers, using relevant
evidence from both the from the profile output and the output of the
words
program.
-
The mkwords 1000000 3 command produces 857424 distinct words.
- What is the maximum chain length if a hash table size of 85711 is used?
- How does the chain length distribution change if the hash table size is 100000? 214283? 400000? 400837? 857144? 857137?
- Every other number above (i.e. 85711, 214283, 400837, 857137) is prime. It is often noted that using prime numbers appropriately in the hash function leads to a better distribution of hash values, and thus generally shorter chains. Does this appear to be the case for the hash table sizes in the previous question?
- An "optimal" hash table would have all slots occupied and have all chains of length roughly (nwords/nslots). In practice, this is impossible to achieve in the general case, and what we want is a table with relatively short chains, with as few slots as possible (small size of hash table), and not too many empty slots. Can you find a suitable hash table size that keeps the maximum chain length under 10, and has most chains with length 1 or 2, but also uses more than 2/3 of the slots?
-
Compare both the outputs and the profiles for the two commands:
./words2 /web/cs2521/19t0/week08/lab/places 1048576 ./words2 /web/cs2521/19t0/week08/lab/places 1048573
What does this tell you about hash table search performance when the hash function is significantly sub-optimal? -
Examine the profiles from running the command:
./mkwords 1000000 | ./words2 - N
for a number of different values of N. What are the most costly functions (in terms of overall time)? - How could the individual functions be improved? How could the overall performance be improved?
- Implement your suggestions and then give a new profile to show the improvement, and explain how the profile shows the improvement.
If you want some prime numbers to use for testing different table sizes, check out a handy list of primes at primes.utm.edu/lists.
Bonus: Search in a Splay Tree (1 mark)
In lectures, we discussed
inserting an item into a splay tree
(balanced tree examples).
The amortised cost of insert and search operations
on splay trees is only valid
if we also perform splay rotations
during searching for an item.
An partial of implementation
of btree_search_splay
has been provided in btree.c
to get you started.
Its prototype is as follows:
static tree_node *search_splay (tree_node *n, Key k, bool *found);
It searches for an item and performs splay rotations on the traversed
path, much like the function insert_splay()
does.
NOTE: once an item has been found, it is moved to the root of tree. If the key does not exist in the tree, the last node on the search path should be moved to the root.
You can look at insert_splay()
to help you implement this function.
NOTE:
because the search function
changes the root of the tree,
it needs to return a pointer to the new root,
so the actual result of the search
should be stored in
the variable pointed to by found.
This should be true
if the key was in the tree
and false
otherwise.
The tree_search
function
needs to be modified
so that it calls the appropriate version of seach,
depending what balancing strategy is used:
if the balance strategy is SPLAY
,
it uses search_splay
,
or the standard searchRecursive otherwise.
Devise and run some experiments to compare your splay tree to the previous splay tree that only performed splay operations on insertion. Write it up in your lab08.txt file.
If you have time,
try to factor the common splay logic
out of insert_splay
and search_splay
.
Submitting
You must get the lab marked by your tutor in your lab.
Submit your work with the give command, like so:
give cs2521 lab08 Tree.c lab08.txt HashTable.c
Or, if you are working from home, upload the relevant file(s) to the lab08 activity on WebCMS3 or on Give Online.