COMP9315 24T1

Exercises 03
Pages and Tuples

We assume that all of the relevant .h files have been included.

#define PAGE_SIZE 4096

char *getTuple(int inFile, int pageNumber, int recNumber)
{
	// position file at start of page

	off_t pageAddr = pageNumber * PAGE_SIZE;
	if (lseek(inFile, pageAddr, SEEK_SET) < 0)
		return NULL;

	// re-position the file to the start of the tuple directory entry

	off_t dirOffset = recNumber * 3; // 3 bytes per directory entry
	if (lseek(inFile, dirOffset, SEEK_CUR) < 0)
		return NULL;

	// read 3-byte directory entry for this tuple

	unsigned int dirEntry;
	if (read(inFile, &dirEntry, 3) != 3)
		return NULL;

	// extract tuple offset and length from directory entry

	unsigned int tupOffset, tupLength;
	unsigned int lengthMask = 0x00000fff; // low-order 12 bits
	unsigned int offsetMask = 0x00fff000; // high-order 12 bits

	tupOffset = (dirEntry & offsetMask) >> 12;
	tupLength = dirEntry & lengthMask;

	// allocate memory buffer to hold tuple data

	char *tupBuf;
	if ((tupBuf = malloc(tupLength)) == NULL)
		return NULL;

	// position file at tuple location

	off_t tupAddr = pageAddr + tupOffset;
	if (lseek(inFile, tupAddr, SEEK_SET) < 0)
		return NULL;

	// read tuple data into buffer

	if (read(inFile, tupBuf, tupLength) != tupLength)
		return NULL;

	return tupBuf;
}

The minimum number of tuples is zero (trick question)
Maximum occurs when all tuples are minimum size (or close to it). Each page uses 300 bytes for the tuple directory, leaving 4096-300=3796 bytes of space for tuples. In theory, this amount of space could hold floor(3796/32) = 118 tuples; however, the page directory only has space for 100 tuples, so 100 tuples is the maximum number of tuples per page. Since we have 100 pages, the file can hold 100*100=10000 tuples.

The maximum number of tuples still occurs when all tuples are minimum size. However, in this case we need to balance the tuple space against the directory space. For example, if we have 100 tuples, then the top 3200 bytes of the page are occupied by tuple data, leaving 896 (4096-3200) bytes for the slot directory. We can clearly add more tuples, since we have space for them and space to hold their directory entries. Eventually, though, there will be enough tuples that there is no more room to add directory entries for them and the page is full. Since each tuple requires space for its data (32 bytes) plus 3 bytes for a directory entry, we can compute the maximum tuples that will fit in a page by finding the maximum N such that (3*N + 32*N) < 4096. In this case, N=117 and so this the file can hold at most 11700 tuples.

This scenario is not totally implausible since some common tables have fixed-size tuples (consider a "link" table with just two foreign keys). Of course, in such a case, we wouldn't need the tuple directory either, since we could simply compute the tuple offset based on its number in the page.

Assume that the record has the following structure:

The fixed storage cost includes:

the record length (4 bytes)
offsets for each of the fields (4 bytes times 6)
fixed-length fields id (4 bytes), ssn (20 bytes), born (4 bytes for date)

This gives a total fixed storage cost of 4+24+4+20+4 = 56 bytes

For the John Smith record, add additional bytes for

name (12 bytes ... 10 bytes rounded up for alignment)
addr (28 bytes ... no rounding up needed)
name (48 bytes ... no rounding up needed)

giving a total of 56+12+28+48 = 144 bytes

For the Jane Brown record, add additional bytes for

name (12 bytes ... 10 bytes, rounded up for alignment)
addr (32 bytes ... 31 bytes, rounded up)
name (44 bytes ... 41 bytes, rounded up)

giving a total of 56+12+32+44 = 144 bytes

It is a coincidence that both records come out with the same length.

Every possible collection of bits/bytes represents a valid Datum value (e.g. you can't simply user zero to represent NULL, because zero is a perfectly useful integer value). Since there is no way to represent NULL as a Datum, we clearly can't include NULL values in the Datum array. This means that we need a separate representation for NULLs; it makes sense to simply use a bit-string, with one bit for each attribute, where a value of 1 means "this attribute is NULL", and a value of 0 means "this attribute has a value; look for it in the Datum array".

fixed-length records with a presence bit-vector ...
1. Show the internal record structure and compute the (average) size of a record ...
  
  Each record has something like the following structure, where fields are arranged to ensure that no numeric field begins on a non-aligned address boundary.
  While character fields don't need to be aligned on 4-byte addresses, they do need to be as large as the maximum number of characters that might be stored in them (e.g. varchar(10) always occupies 10-bytes, regardless of the actual size of the string).
  The size of each record is thus:
  - 4 bytes for the id# field
  - 8 bytes for the birth field
  - 8 bytes for the score field
  - 1 bytes for the gender field
  - 30 bytes for the name field
  - 10 bytes for the degree field
  giving a total of 4 + 8 + 8 + 1 + 30 + 10 = 61 bytes.
  
  This will need to be padded to 64 bytes to ensure that the next record in the page also begins on a 4-byte address.
  
  Solution: R = 64
2. Compute how many blocks are needed to store the whole relation
  
  If each record is 52 bytes long, and there are 1024 bytes in a block, then we could potentially store N_r = floor(1024/64) = 16 records in a block. However, we also need to store the presence vector to indicate which record slots are actually filled. This requires at last N_r bits, thus reducing the effective number of records per block to 15. The block contains 15*64-byte records along with a 15-bit (= 2-byte) presence vector. This "wastes" 62 bytes in each block, which is unfortunate but unavoidable. Thus, N_r = 15
  If there are 15 records in each block, then we need b = ceil(20,000/15) = 1334 blocks to store all of the records.
  Solution: b = 1334
3. Compute how long it takes to answer a query on id# if the file is sorted on this field (worst case value)
  
  Performing a binary search requires us to examine at most ceil(log₂b) = ceil(log₂1334) = 11 blocks. Since the cost of reading each block is T_r=10ms, then the total i/o cost is 110 ms
  Solution: T_BinarySearch = 110ms
Variable-length records with a fixed-size directory ...
1. Show the internal record structure and compute the (average) size of a record
  
  Each record has something like the following structure, where fields are arranged to ensure that no numeric field begins on a non-aligned address boundary.
  
  In this case, one byte of storage is required for each field to hold the offset of the field. Since there are 6 fields, this will require 6 bytes, which then needs to be padded to 8 bytes to ensure that the first numeric field starts on a 4-byte address boundary.
  
  The offset block will be followed by four fixed-size fields:
  - 4 bytes for the id# field
  - 8 bytes for the birth field
  - 8 bytes for the score field
  - 1 bytes for the gender field
  These will be followed by the variable-length fields:
  - name, with an average of 15 characters (15 bytes)
  - degree, with an average of 5 characters (5 bytes)
  On average, this gives a total record size of 8 + 4 + 8 + 8 + 1 + 15 + 5 = 49 bytes. This will need to be padded to a multiple of 4 bytes, and so we would expect an effective average record size of 52 bytes.
  Solution: R = 52
2. Compute how many blocks are needed to store the whole relation
  
  If each record is 64 bytes long, and there are 1024 bytes in a block, then we could potentially store N_r = floor(1024/52) = 19 records in a block. However, we also need to store a directory to indicate where each record is located. This requires at least N_r bytes. If the block contains 19*52-byte records, then the amount of space available for dictionary is 1024-19*52 = 36 bytes, so there is room for both the dictionary and all N_r records. presence vector. This "wastes" 17 bytes in each block (on average), which is unfortunate but unavoidable. Thus, N_r = 19
  
  If there are 19 records in each block, then we need b = ceil(20,000/19) = 1053 blocks to store all of the records.
  
  Solution: b = 1053
3. Compute how long it takes to answer a query on id# if the file is sorted on this field (worst case value)
  
  Performing a binary search requires us to examine at most ceil(log₂b) = ceil(log₂1053) = 11 blocks. Since the cost of reading each block is T_r=10ms, then the total i/o cost is 110 ms
  
  Solution: T_BinarySearch = 110ms