Week 09 Laboratory Exercises

Objectives

  • learning how to access file metadata via stat
  • learning how to use file metadata
  • practicing file operations generally
  • practicing working with the utf-8 format

Preparation

Before the lab you should re-read the relevant lecture slides and their accompanying examples.

Getting Started

Set up for the lab by creating a new directory called lab09 and changing to this directory.
mkdir lab09
cd lab09

There are some provided files for this lab which you can fetch with this command:

1092 fetch lab09

If you're not working at CSE, you can download the provided files as a zip file or a tar file.

Exercise — individual:
Print Select Bytes from a File

Write a C program, print_select_bytes.c, which is given a filename and one or more possitions as command line arguments. It should print the byte that is located at each given possition within the file. It should print each byte in decimal, two digit hex, and (if possible) the character itself.

Assume ASCII printable characters are those for which the ctype.h function isprint(3) returns a non-zero value.

Follow the output format below.

Do not read the entire file - use fseek(3) to move around the file.

dcc print_select_bytes.c -o print_select_bytes
./print_select_bytes lorem-ipsum.txt 0 1 2 3 4
76 - 0x4C - 'L'
111 - 0x6F - 'o'
114 - 0x72 - 'r'
101 - 0x65 - 'e'
109 - 0x6D - 'm'
./print_select_bytes lorem-ipsum.txt 542 361 840 97 546 1612 2508 96 85 1382
116 - 0x74 - 't'
101 - 0x65 - 'e'
117 - 0x75 - 'u'
101 - 0x65 - 'e'
108 - 0x6C - 'l'
32 - 0x20 - ' '
111 - 0x6F - 'o'
116 - 0x74 - 't'
114 - 0x72 - 'r'
116 - 0x74 - 't'

When you think your program is working, you can use autotest to run some simple automated tests:

1092 autotest print_select_bytes 

When you are finished working on this exercise, you must submit your work by running give:

give dp1092 lab09_print_select_bytes print_select_bytes.c

You must run give before Tuesday 29 October 09:00 (2024-10-29 09:00:00) to obtain the marks for this lab exercise. Note that this is an individual exercise, the work you submit with give must be entirely your own.

Exercise — individual:
Pack a file containing a bitstring

You have been given matchbox.c, with a stub C function.

Your task is to add code to this function in matchbox.c:

struct packed_matchbox pack_matchbox(char *filename) {
    // TODO: complete this function!
    // You may find the definitions in matchbox.h useful.

    struct packed_matchbox matchbox = {
        .sequence_length = 0,
        .packed_bytes = NULL
    };
    
    return matchbox;
}

Add code to the function pack_matchbox so that, given a filename, specifying the path to a file using the matchbox file format, it returns a struct packed_matchbox, consisting of the following fields:

  • uint16_t sequence_length, containing the length of the original character sequence.
  • uint8_t *packed_bytes, a pointer to an array of bytes, containing the packed representation of the matchbox.

A matchbox file consists of a two-byte value defining the length of the sequence that follows, followed by some sequence of '0's and '1's.

The format is as followed:

name length type description
sequence length B unsigned, 16-bit, little-endian integer A two-byte integer indicating the length of the character sequence it precedes.
value sequence-length character sequence A variable-length sequence of characters, consisting of the character '0' or the character '1'.

Your program should read the file, and return a struct packed_matchbox containing the packed representation of the matchbox. The packed representation should be stored in the bytes field of the struct. For each '0' in the file, the corresponding bit in the packed representation should be set to 0. For each '1' in the file, the corresponding bit in the packed representation should be set to 1.

You can store up to 8 bits in a single uint8_t value. For example, the packed representation of the sequence "00010010" is 0b00010010, which is 0x12 in hexadecimal.

For every every eight bytes in the character sequence, you should allocate one uint8_t in the packed representation. Should the number of bytes in the character sequence not be a multiple of eight, you should use the last uint8_t to store the remaining bits, and pad any unused bits with 0s.

For example, if the only bytes in the character sequence were "111", then the packed representation would be 0b11100000, which is 0xe0 in hexadecimal.

An additional function, num_packed_bytes has been provided for you, which takes the length of the character sequence, and returns the number of bytes required to store the packed representation.

For example:
unzip matchbox_examples.zip
[... a bunch of output ...]
make
[... a bunch of output ...]
xxd examples/exactly_one_byte.matchbox Note that 0b10100001 is 0xa1
00000000: 0800 3130 3130 3030 3031                 ..10100001
./matchbox examples/exactly_one_byte.matchbox
Sequence length: 8
a1
xxd examples/padded_byte.matchbox Note that this value would produce 0b10100000 which is 0xa0
00000000: 0300 3130 31                             ..101
./matchbox examples/padded_byte.matchbox
Sequence length: 3
a0

Your program should use malloc(3) to allocate the appropriate amount of memory for the packed representation, stored as an array.

You may assume that the file is well-formed, and that the sequence length is less than or equal to 65535. We won't test your program's behaviour on malformed files, or supply paths to non-existent files/files with incorrect permissions, but you should still perform basic error checking.

When you think your program is working, you can use autotest to run some simple automated tests:

1092 autotest matchbox 

When you are finished working on this exercise, you must submit your work by running give:

give dp1092 lab09_matchbox matchbox.c

You must run give before Tuesday 29 October 09:00 (2024-10-29 09:00:00) to obtain the marks for this lab exercise. Note that this is an individual exercise, the work you submit with give must be entirely your own.

Exercise — individual:
Find the First Invalid UTF-8 Byte

You have been given invalid_utf8_byte.c, which contains a C function invalid_utf8_byte, that takes a string and returns 42.

Add code to the function invalid_utf8_byte so that, given a string containing one or more invalid UTF-8 sequences, it returns the index of the first unexpected byte in the string, and -1 if the string if the string contains no invalid UTF-8 sequences.

For example:

unzip invalid_utf8_byte_examples.zip
(several lines of output)
xxd invalid_utf8_byte_examples/hello_world.txt
00000000: 6865 6c6c 6f20 776f 726c 64              hello world
./invalid_utf8_byte < invalid_utf8_byte_examples/hello_world.txt
No invalid bytes found.
xxd invalid_utf8_byte_examples/bad_hello_world.txt
00000000: 6865 6c6c 6fa1 776f 726c 64              hello.world
./invalid_utf8_byte < invalid_utf8_byte_examples/bad_hello_world.txt
Invalid byte found at index 5.
xxd invalid_utf8_byte_examples/too_few_continuation_bytes.txt
00000000: f09f 98                                  ...
./invalid_utf8_byte < invalid_utf8_byte_examples/too_few_continuation_bytes.txt
Invalid byte found at index 3.
xxd invalid_utf8_byte_examples/too_many_continuation_bytes.txt
00000000: f09f 98b3 98                             .....
./invalid_utf8_byte < invalid_utf8_byte_examples/too_many_continuation_bytes.txt
Invalid byte found at index 4.
xxd invalid_utf8_byte_examples/valid_with_emoji.txt
00000000: 796f 7520 6172 6520 646f 696e 6720 6772  you are doing gr
00000010: 6561 7420 f09f 918d f09f 918d f09f 918d  eat ............
./invalid_utf8_byte < invalid_utf8_byte_examples/valid_with_emoji.txt
No invalid bytes found.
xxd invalid_utf8_byte_examples/invalid_with_emoji.txt
00000000: 7468 6973 f0a1 3923 7374 7269 6e67 a269  this..9#string.i
00000010: 73e7 8936 6e6f 74a0 646f 696e 67c2 c273  s..6not.doing..s
00000020: 6fa0 6772 6561 74f1 3242 a068 6f77 6576  o.great.2B.howev
00000030: 6572 20f0 9f98 ad91                      er .....
./invalid_utf8_byte < invalid_utf8_byte_examples/invalid_with_emoji.txt
Invalid byte found at index 6.
Further example inputs are provided in the invalid_utf8_byte_examples.zip archive.

When you think your program is working, you can use autotest to run some simple automated tests:

1092 autotest invalid_utf8_byte 

When you are finished working on this exercise, you must submit your work by running give:

give dp1092 lab09_invalid_utf8_byte invalid_utf8_byte.c

You must run give before Tuesday 29 October 09:00 (2024-10-29 09:00:00) to obtain the marks for this lab exercise. Note that this is an individual exercise, the work you submit with give must be entirely your own.

Exercise — individual:
Print Files Sizes

We are worried about disk usage and would like to know how much space is used used by a set of files

Write a C program, file_sizes.c, which is given one or more filenames as command line arguments. It should print one line for each filename which gives the size in bytes of the file. It should also print a line giving the combined number of bytes in the files.

Follow the output format below.

Do not read the file - obtain the file size from the function stat.

dcc file_sizes.c -o file_sizes
./file_sizes bubblesort.c print_bigger.c swap_numbers.c unordered.c
bubblesort.c: 667 bytes
print_bigger.c: 461 bytes
swap_numbers.c: 565 bytes
unordered.c: 486 bytes
Total: 2179 bytes
./file_sizes bubblesort.s print_bigger.s swap_numbers.s unordered.s numbers1.txt numbers2.txt sorted.txt
bubblesort.s: 1142 bytes
print_bigger.s: 1140 bytes
swap_numbers.s: 1173 bytes
unordered.s: 791 bytes
numbers1.txt: 59 bytes
numbers2.txt: 55 bytes
sorted.txt: 21 bytes
Total: 4381 bytes

When you think your program is working, you can use autotest to run some simple automated tests:

1092 autotest file_sizes 

When you are finished working on this exercise, you must submit your work by running give:

give dp1092 lab09_file_sizes file_sizes.c

You must run give before Tuesday 29 October 09:00 (2024-10-29 09:00:00) to obtain the marks for this lab exercise. Note that this is an individual exercise, the work you submit with give must be entirely your own.

Exercise — individual:
Print File Modes

We would like to print the access permissions for a set of files

Write a C program, file_modes.c, which is given one or more pathnames as command line arguments. It should print one line for each pathnames which gives the permissions of the file or directory.

Follow the output format below.

dcc file_modes.c -o file_modes
ls -ld file_modes.c file_modes diary.c diary
-rwxr-xr-x 1 z5555555 z5555555 116744 Nov  2 13:00 diary
-rw-r--r-- 1 z5555555 z5555555    604 Nov  2 12:58 diary.c
-rwxr-xr-x 1 z5555555 z5555555 222672 Nov  2 13:00 file_modes
-rw-r--r-- 1 z5555555 z5555555   2934 Nov  2 12:59 file_modes.c
./file_modes file_modes file_modes.c diary diary.c
-rwxr-xr-x file_modes
-rw-r--r-- file_modes.c
-rwxr-xr-x diary
-rw-r--r-- diary.c
chmod 700 file_modes
chmod 640 diary.c
chmod 600 file_modes.c
ls -ld file_modes.c file_modes diary.c diary
-rwxr-xr-x 1 z5555555 z5555555 116744 Nov  2 13:00 diary
-rw-r----- 1 z5555555 z5555555    604 Nov  2 12:58 diary.c
-rwx------ 1 z5555555 z5555555 222672 Nov  2 13:00 file_modes
-rw------- 1 z5555555 z5555555   2934 Nov  2 12:59 file_modes.c
./file_modes file_modes file_modes.c diary diary.c
-rwx------ file_modes
-rw------- file_modes.c
-rwxr-xr-x diary
-rw-r----- diary.c
The first character on each line should be '-' for ordinary files and 'd' for directories. For example:
./file_modes /tmp /web/cs1521/index.html
drwxrwxrwx /tmp
-rw-r--r-- /web/cs1521/index.html

When you think your program is working, you can use autotest to run some simple automated tests:

1092 autotest file_modes 

When you are finished working on this exercise, you must submit your work by running give:

give dp1092 lab09_file_modes file_modes.c

You must run give before Tuesday 29 October 09:00 (2024-10-29 09:00:00) to obtain the marks for this lab exercise. Note that this is an individual exercise, the work you submit with give must be entirely your own.

Challenge Exercise — individual:
ls -ld

We need clean room implementation of the standard Unix program ls.

Write a C program, lsld.c, which is given zero or more pathnames as command line arguments produces exactly the same output as ls -ld given the same pathnames as arguments.

Except ls -ld sorts its output lines. You do not have to match this.

Follow the output format below.

dcc lsld.c -o lsld
ls -ld lsld.c lsld file_sizes.c file_sizes /home/cs1521/public_html
drwxr-xr-x 6 cs1511   cs1511      128 Sep 16 08:02 /home/cs1521/public_html
-rwxr-xr-x 1 z5555555 z5555555 116744 Nov  2 13:00 file_sizes
-rw-r--r-- 1 z5555555 z5555555    604 Nov  2 12:58 file_sizes.c
-rwxr-xr-x 1 z5555555 z5555555 222672 Nov  2 13:00 lsld
-rw-r--r-- 1 z5555555 z5555555   2934 Nov  2 12:59 lsld.c
./lsld lsld.c lsld file_sizes.c file_sizes /home/cs1521/public_html
-rw-r--r-- 1 z5555555 z5555555   2934 Nov  2 12:59 lsld.c
-rwxr-xr-x 1 z5555555 z5555555 222672 Nov  2 13:00 lsld
-rw-r--r-- 1 z5555555 z5555555    604 Nov  2 12:58 file_sizes.c
-rwxr-xr-x 1 z5555555 z5555555 116744 Nov  2 13:00 file_sizes
drwxr-xr-x 6 cs1511   cs1511      128 Sep 16 08:02 /home/cs1521/public_html

When you think your program is working, you can use autotest to run some simple automated tests:

1092 autotest lsld 

When you are finished working on this exercise, you must submit your work by running give:

give dp1092 lab09_lsld lsld.c

You must run give before Tuesday 29 October 09:00 (2024-10-29 09:00:00) to obtain the marks for this lab exercise. Note that this is an individual exercise, the work you submit with give must be entirely your own.

Submission

When you are finished each exercises make sure you submit your work by running give.

You can run give multiple times. Only your last submission will be marked.

Don't submit any exercises you haven't attempted.

If you are working at home, you may find it more convenient to upload your work via give's web interface.

Remember you have until Week 10 Tuesday 09:00:00 to submit your work.

You cannot obtain marks by e-mailing your code to tutors or lecturers.

You check the files you have submitted here.

Automarking will be run by the lecturer several days after the submission deadline, using test cases different to those autotest runs for you. (Hint: do your own testing as well as running autotest.)

After automarking is run by the lecturer you can view your results here. The resulting mark will also be available via give's web interface.

Lab Marks

When all components of a lab are automarked you should be able to view the the marks via give's web interface or by running this command on a CSE machine:

1092 classrun -sturec