Week 03 Lab Exercise

        Sort Detective        

Objectives

  • To familiarise yourself with practical aspects of computational complexity
  • To practice a systematic approach to problem solving
  • To apply sound scientific reasoning in reaching conclusions
  • To hone your analysis skills
  • To identify algorithms from their behaviour

Admin

Marks
5 (see the Assessment section for more details)
Demo
in the Week 3, 4 or 5 lab session
Submit
see the Submission section
Deadline to submit to give
5pm Monday of Week 4
Late penalty
0.2% per hour or part thereof deducted from the attained mark, submissions later than 5 days not accepted

Background

A very long time ago, in a burst of enthusiasm, Richard Buckland wrote a collection of sort programs. Sadly, he forgot to give the programs meaningful names, so when he later passed them on to COMP2521 lecturers to use for this lab, he couldn't tell us which program used which algorithm. All that he could remember is that the sorting algorithms he used came from the following list:

  • Bubble sort
    • bubble sort with early exit
  • Insertion sort
    • standard insertion sort
  • Selection sort
    • standard selection sort
  • Merge sort
    • standard merge sort
  • Naive quicksort
    • uses the leftmost element as the pivot
  • Median-of-three quicksort
    • uses the median of the first, middle and last elements as the pivot
  • Randomised quicksort
    • uses a randomly chosen element as the pivot
  • Bogosort
    • repeatedly generates permutations of its input until one is found that is sorted

Despite not knowing which program uses which algorithm, Richard did remember a few things about the programs:

  • All of the programs read their input from stdin and write the sorted version of the data to stdout
  • Sorting happens line-by-line, like the Unix sort program
  • There is a limit on the size of the input the programs can process (100,000,000 lines), because they read their input into a fixed size array and sort it there, before writing out the sorted result
  • The programs all expect each line to start with a number, which acts as the sorting key. The sorting is numeric, which means that the programs all behave something like the Unix sort program run with the -n option. If no number is present at the start of a line, it will be treated as a zero value.
./sortX < data > sorted_data
# behaves like
sort -n < data > sorted_data

Your task is to help us identify the specific sorting algorithms used in two of these programs. You will not be able to view the source code. Instead, you will have to try to identify the algorithms using only the programs' observable behaviour when sorting different data. Note that since the programs are only available in binary format, they will most likely only run on the CSE machines, so you'll have to do your work there.

In the setup phase, we will give you access to two different sort programs; each lab pair or individual gets a different (randomly chosen) pair of programs. The first phase of the task is to design and write up the experimental framework that you plan to use to solve the problem. In the second phase, you should gather data on the execution behaviour of the two programs according to your exerimental setup. You then add the data to your report and analyse it to reach a conclusion on which algorithm each of the supplied programs contains.

To make your task a little simpler, we've supplied a program to generate data in the correct format for the sorting programs to use.

/web/cs2521/24T1/labs/week03/utils/gen 5 R
1 nwl
5 arz
4 hcd
2 rbb
3 mqb
/web/cs2521/24T1/labs/week03/utils/gen
Not enough arguments
Usage: /web/cs2521/24T1/labs/week03/utils/gen  N  A|D|R  [seed]
       N = number of lines
       A|D|R = Ascending|Descending|Random

Use the gen program however you like, or not at all. Note that the gen program always generates a unique set of keys (from 1..N). This won't enable you to test stability, so you'll need to come up with some more data of your own. Note that the seed argument allows you to generate the same sequence of "random" numbers; if you want to test both sort programs on the same random sequence, use the same seed (note: the seed must be an integer).

Note that the setup script (below) will put a copy of the gen executable into your lab directory, so you can run it as ./gen rather than having to type the long file name.

Setting Up

Create a directory for this lab, change into it, and run the following command:

/web/cs2521/24T1/labs/week03/setup/setup

This must be run on the CSE machines, either via VLAB or SSH. This command will set up two symbolic links in your directory called sortA and sortB which reference executable programs under the class account. The setup command also gives you a copy of the gen program and timeit shell script.

Working in a pair?

If you are working in a pair and you are using your partner's sort programs, run the setup command above, and then delete your own sort programs with the rm command:

rm sortA sortB

Then get your partner to run the following command in their lab directory:

ls -l sortA sortB
lrwxrwxrwx 1 ... sortA -> long-path-ending-with-sort...A
lrwxrwxrwx 1 ... sortB -> long-path-ending-with-sort...B

Finally, copy the long paths that you obtained to create symlinks (shortcuts) to the sort programs:

ln -s long-path-ending-with-sort...A sortA
ln -s long-path-ending-with-sort...B sortB

This will give you the same symbolic link as your partner so you can both investigate the same sort programs. Make sure you remember whose sort programs you are using.

You can check that the sortA and sortB programs actually do sorting by running something like the following:

./gen 5 R
... unsorted output ...
./gen 5 R | ./sortA
... sorted output ...
./gen 5 R | ./sortB
... sorted output ...

Phase 1: Designing your Experiment

Plan the set of tests which you will use to deduce which sorting algorithm is used by each sort program. This is to ensure that by the time you begin investigating and experimenting with the actual sort programs you have thoroughly thought about what kind of behaviour to look for and what further experimentation might be necessary when analysing your findings.

We expect this will involve coming up with numerous sequences of test data to use, and what differences (and why) you expect to be able to observe from different types of sorting algorithms. Typical properties to look for are execution time and output stability.

Of course, when designing tests you cannot anticipate all possible results which might occur during your experiment. This is the nature of scientific experimentation. But by formalising what you expect to occur and how you will respond, you can better account for unexpected behavior and sensibly revise your design or create new tests once the experiment is under way.

Your experimental design should detail the tests you have devised and explain, with clear reasons, how you will be able to distinguish each algorithm you might be given. You do not need to include all the input data you intend to use, only a description or small sample of it (you may put this in the appendix if you wish).

Write up the experimental design as Part 1 of your report. You can produce the report using whatever tools you want (e.g., OpenOffice, Google Docs, raw HTML, etc.), but it must eventually be submitted as a PDF. Most document processing systems and Web browsers can produce PDF.

There is no size requirement for the report; it is the quality of the report which matters. If you want to include detailed reporting of timing results, then put these in an appendix. Your report should be clear, scientific/systematic in approach, and all reasoning and assumptions should be explicit. Make sure you ask your tutor if you are unclear about what is expected.

To help you get started, a template for the report is available. Note that a fault in many of the reports in the past is that they simply report observations without attempting to analyze them or explain why these results occurred. For this lab try to get beyond just stating observations and explain them.

Phase 2: Run Experiment and Analyse Results

The setup command has given you two sort programs to identify. As noted earlier, each sort program reads from standard input and writes to standard output, and assumes that each input line contains a numeric key (first field) and an arbitrary string following the key. The output should be in ascending order, based on the numeric ordering of the keys. All programs should produce the same output for a given input when the keys are unique.

The following examples show some useful ways of running the sort programs, and auxiliary commands to help collect useful data on their behaviour:

# generate some data, put in a file called "mydata"
./gen 100 R > mydata
# sort the data using sortA, put the result in "sortedA"
./sortA < mydata > sortedA
# sort the data using sortB, put the result in "sortedB"
./sortB < mydata > sortedB
# sort the data using Unix sort
sort -n < mydata > sorted
# check that the sortA and sortB programs actaully sorted
diff sorted sortedA # should show no diffs
diff sorted sortedB # should show no diffs
# run a large sort and throw away the result
./gen 100000 R | ./sortA > /dev/null
# repeat the above, but get timing data on sortA
./gen 100000 R | time ./sortA > /dev/null
# repeat the timing, but with better output format
./gen 100000 R | /usr/bin/time --format="%U seconds" ./sortA > /dev/null

You should now carry out the experiment you designed in Phase 1. Collect and record all of the data, and then summarize it in your report. You can use whatever tools you like to produce useful summaries (e.g. plot graphs of time vs data size). Then analyze the data, draw conclusions, and explain them.

To help with the experiments, we have provided a shell script called timeit to collect timing data. As supplied, this script times the built-in sort program, which is not helpful for you, so you'll need to modify it to use one of your sortA or sortB programs. After you have modified it, you can run it like so:

sh timeit

Note that some tests will take a long time to run with large data. You can remove the large data sizes from the outer for loop if you can't wait, but you should probably add more smaller sizes to get more data points to try to determine execution cost trends.

On the other hand, some tests will take a very short time to run, and you can add larger data sizes to the outer for loop. Remember that the sort programs can sort up to 100,000,000 items.

Tips for measuring: the Unix time command works by sampling and will likely produce different results for the same program run multiple times (the timeit script will do this for you). Take an average over a number of timings to account for this. Also, beware of claiming too much accuracy. You can't really claim more than one or two significant digits on an average from the time command.

The precise format of your report is up to you, but it must include:

  • a summary of the results for each program
  • an argument, based on the observed behaviour, for what sorting algorithm you think each program is using

Submission

Your submission for this lab is a report containing an experimental design and results/analysis from carrying out the experiment. It should be submitted as a PDF file called report.pdf. You can submit via the command line using the give command:

give cs2521 lab03 report.pdf

You can also submit via give's web interface. You can submit multiple times. Only your last submission will be marked. You can check the files you have submitted here.

Assessment

There is no automarking for this lab. To receive a mark, you must show your report to your tutor during your Week 3, 4 or 5 lab session. You will be marked based on the following criteria:

Methodology (2 marks)
Your report should contain an easy-to-understand description and justification of your methodology. What experiments and tests did you run, how did you run them, and why?
Results and Analysis (2 marks)
Your report should include the results you obtained from your experiments, including a table of timing results (and possibly some graphs) from your timing experiments, sample input and output from your stability tests, and results from any other experiments you ran. You should also include an analysis of these results: what did the results tell you about what the sorting algorithms could/couldn't be?
Accuracy (1 mark)
This mark is for whether you were able to accurately identify the sorting algorithms used by your sort programs. Your conclusions must logically follow from your results - if they are inconsistent with your results or if multiple sorting algorithms were possible based on your results and you didn't properly narrow it down to one algorithm, you may be penalised. If you worked with someone else, make sure to let your tutor know which one of you ran the setup command at the start of the lab (and hence whose sort programs you used), since each student is given a different pair of sort programs.