COMP9315 23T1

Assignment 2
Relational Operators and Memory Buffer Management

Last updated: Monday 24th April 12:28pm
Most recent changes are shown in red ... older changes are shown in brown.

Recent Updates

[2023-04-07 22:40]

A minor update for test4.

[2023-04-07 15:00]

We have prepared four additional test cases. You can download the updated code from the same address. Note that the updated version is only about test cases. No code is changed. Please let us know if you find some errors in the source code.

We also provides more marking details. In real applications, you cannot make any assumption about the input. Therefore, around 10--15 test cases (including the provided ones) will be used in marking. If you implement the buffer correctly, your output should pass all these cases. To facilitate the verification of your solution in marking, db.c may be modified at two places. First, information in log_read_page, log_release_page, log_open_file, and log_close_file will be printed as described in code comments. Second, as you can see in init_db, the current version simply assigns page id by increasing an integer from 0. In marking, a different way will be used to produce page id. That means the only way you get the page id is from the first INT64 in each page, and it can be any possible value. The only guaranteed thing is that each page id is unique for each table (i.e., page id of pages from different tables may be same).

We also provide some tips to do assignment. As we introduced in lectures, you do not need to consider the data in current memory buffers when analyzing IOs (i.e., you can assume the memory buffer is empty). A reference can be found in cost models. For example, when join two tables via nested for loop, you need to choose one table for outer for loop to minimize IOs. Some table pages may already in the memory buffer, but you do not need to consider them. The clock-sweep strategy is an important in this assignment. To help you do the assignment, we present the following pseudocode about the process when requesting a page from the buffer. You can compare it with your implementation.

... initialization ...
nvb = 0
foreach slot in buffer[]: slot.pin=0, slot.usage=0

... request a page from buffer ...
function request_page(page)
    if(page in buffer)
        buffer[page index].usage++
        buffer[page index].pin=1
        return the page
    while(true)
        if(buffer[nvb].pin == 0 && buffer[nvb].usage == 0)
            read page from disk to buffer[nvb]
            buffer[nvb].pin = 1
            buffer[nvb].usage = 1
            nvb = (nvb+1) % buffer_size
            return the page
        else
            if(buffer[nvb].usage > 0) buffer[nvb].usage --
            nvb = (nvb+1) % buffer_size

... notify buffer stop using the page ...
function release_page(page)
    buffer[page index].pin = 0

[2023-04-03]

We provide some tips for doing assignment and marking. To simplify the assignment, you do not need to leave an output buffer for both select and join. For example, assume that N represent the number of buffer slots. To conduct a nested for-loop join, you can use N-1 slots to store N-1 pages of one table and use 1 slot to store one page of the other table (note that in lecture notes, you need to allocate one slot as an output page). You can also use additional space (e.g., arrays) for sorting or building hash table. All additional space use in each join need to be freed. The other tip for marking the join operation is that you need to consider the performance of the join order. For instance, in the nested for-loop join, using a different table in the outer loop may produce different read IOs. Therefore, you need to choose the reasonable table in the outer for-loop. As we discussed in lectures, when analyzing the performance (number of IOs) of an operation, you do not need to consider the data in the memory buffer slots. If you are not clear about any requirement of the assignment, feel free to ask questions on the forum.

[2023-03-28]

Fix the bug of writing unexpected values in db.c. The bug would not affect you, and the update is just for rigor.

Aims

This assignment aims to give you

an understanding of how data is organized inside a DBMS
practice in implementing simple relational operators
practice in implementing the memory buffer

The goal is to implement two relational operators, selection and join. Given the memory buffer slots and the page size, you need to implement your own memory buffer to read/write data files from hard drive.

Summary

Deadline: Monday 17 April, 9pm
Pre-requisites:: Buffer Pools, File Management, and Relational Operations
Late Penalty:: 5% of the max assessment mark per-day reduction, for up to 5 days
Marks:: This assignment contributes 20 marks toward your total mark for this course.
Submission:: Moodle > Assignment > Assignment 2

Make sure that you read this assignment specification carefully and completely before starting work on the assignment. Questions which indicate that you haven't done this will simply get the response "Please read the spec". We will provide an extension for all students if there is a major code update in the last three days. Let us know if there is any error in the assignment specification or in the source code.

Introduction

The course is DBMS implementation. It is time to implement a database. The database is somehow a software or a system to retrieve data from files in hard drive. Users communicate with the database via SQL, and SQL is intepreted into the combination of several relational operators. In this assignment, you are required to implement two typical relational operators, select and join. To implement the opeartors, you mush read data from hard drive and manage the memory buffer given that there is a limited number of buffer slots. Related lecture notes can be found as follows. Undoubtedly, a real model database involves a great amount of features and techniques. We simulate a mini database for this assignment. To simplify the problem, we make the folowing assumptions.

The server contains only one database.
The database never updates.
All attributes are integers, i.e., the size of each tuple is fixed.
Each table takes one file, i.e., there is no overflow files.
The select condition is always the equality test for a single attribute.

Get Ready

You do not need the PostgreSQL server for this assignment. All tasks are about C programming. That means you may do the assignment in any Linux environemnt (windows is not supported since linux system functions are used). Remember to test your code on vxdb server before submission since we will test your code there. The following process assumes that you are on the vxdb server. You may change some paths according your local environment. You should start by getting the template codes.

... assume you are on the vxdb server ...
$ cd /localstorage/$USER
$ unzip /web/cs9315/23T1/assignment/2/ass2.zip

You should see a new directory called ass2 and the following files inside.

ass2/
|--- README.md         // log for recent updates
|--- db.c              // global database information
|--- db.h              // definitions for all data types
|--- main.c            // main entry
|--- run.sh            // script to run the code
|--- ro.c              // relational operators
|--- ro.h              // definitions for ro.c
|--- Makefile          // compile rules
|--- test1/            // directory containing testing files
     |--- data_1.txt   // testing data
     |--- query_1.txt  // testing queries
     |--- expected_log_1.txt // expected results

Next, you can complile and run the code.

$ cd /localstorage/$USER/ass2
$ make
... should no error here ...
$ ./run.sh
... execture the program and examine the output ...

After running the code, a log file called log.txt will be created under the test1/ directory.

Code structure

A great amount of comments are provided in each file. Read all codes to fully understand the data structure. You can start from reading db.h and then main.c. For instance, db.h contains a set of important data types to represent the database, the table, and the global configuration. Certain code snippets will give you inspirations to complete the task. We briefly describe what main.c does below.

load several input arguments.
init_db() reads an input .txt data file and produce all table files in a certain folder.
init() is an interface for your implementation.
run() loads and inteprets queries from a .txt query file and invoke your implementations of sel() and join() to process queries. Query results will be written into a log file.
release() is an interface for your implementation to release your data structure used for query processing.

An example of table file structure. The following figure gives an example of the file structure for each table stored in hard drive. Each table file is named as the table's oid, which is exactly same as PostgreSQL. The oid of a table can be identified from Table.oid (see db.h for details). Since we are dealing with fixed-size tuples, we do not need to record the meta information in each table. For each page, we first add an INT64 to represent the page id. Then, we sequentially add each tuples until the sapce is not enough to hold a tuple. We add 0 to the end and get the whole page. The example shows a table file for three attributes. You can identify the details by reading the implementation of init_db() in db.c.

Find more details by reading the code. All following content in this specification assumes that you have already read the code and at least understood the role of each file.

Tasks

In this assignment, you need to implement two functions, sel() and join(), which represent select and join relational operators, respectively. You are also free to add any additional functions and files. In your implementation, you are given the page size (Conf.page_size), the number of available memory buffer slots (Conf.buf_slots), and the maximum allowed number of opend files (Conf.file_limit) at the same time.

To support the two operators, you need to write your own code for memory buffer management based on the input configuration, which means the number of data pages cannot exceed the number of buffer slots. You are only required to implement the clock sweep policy for memory buffer replacement, which is also used by Postgresql. To verified your process of buffer memory for marking, you need to invoke log_read_page() every time you read a page from the hard drive and invoke log_release_page() every time you release a page from the memory buffer. The input argument for both functions are the page id. You can check what we do in these functions in db.c.

As introduced in lectures, opening a file is relatively costly. To improve the performance, we can maintain several opened file pointers. When reading data from a file, you can directly used the file pointer if the file has already been opened and maintained. Given the Conf.file_limit, you need to close a file pointer when the number of opened files reaches the limit. You are free to use any replacement stratey for opened file pointers. To verified your process of file pointers management, you need to invoke log_open_file() every time you open a file and log_close_file() every time you close a file. The input argument for both functions are the table oid. You can check what we do in these functions in db.c. You will be reduced 3/20 marks if maintaining opened file pointers is not implemented (i.e., open a file when you need, and close the file when finish reading).

Task 1: select

The function for select operator is presented as follows. You can see it in ro.h. For simplicity, we only consider the equality testing for a single attribute in the condition of select. Instead of giving an attribute name, we directly provide the idx argument, which the offset of the attribute in the tuple. idx starts from 0. cond_val is the value for equality testing of the (idx+1)-th attribute value. Clearly, table_name represents the table name. The returned value is a _Table pointer. See db.h for details of _Table. Below is an example of output.

_Table* sel(const UINT idx, const INT cond_val, const char* table_name);

# a table example, named "T1"
# the table contains three attributes for each tuple
1 2 3
5 8 13
21 34 55
2 8 6

# the following result table is derived after running sel(1,8,"T1")
5 8 13
2 8 6

Task 2: join

The function for (inner) join operator is presented as follows. You can see it in ro.h. Similar to sel(), we directly provide the attribute offset of each table for joining. table1_name and table2_name are names for two tables, respectively. idx1 and idx2 are the offsets of two attributes in table1_name and table2_name, respectively. The returned value is a _Table pointer. Below is an example of output.

_Table* join(const UINT idx1, const char* table1_name, const UINT idx2, const char* table2_name);

# a table example, named "T1"
# the table contains three attributes for each tuple
1 2 3
5 8 13
21 34 55
5 8 6

# a table example, named "T2"
# the table contains three attributes for each tuple
1 2 3
4 5 6
7 8 9
10 11 12

# the following result table is derived after running join(0, "T1", 1, "T2")
5 8 13 4 5 6
5 8 6 4 5 6

Assume we aim to join two tables named T1 and T2. T1.npages and T2.npages denote the number of pages of two tables, respectively. For the strategy of join, you are only require to implement the naive nested for-loop join when the number of buffer slots is not enough to hold two tables, i.e., Conf.buf_slots < T1.npages + T2.npages. The number of pages for a table can be calculated by Conf.page_size and Table.ntuples. If the buffer slots are enough (i.e., Conf.buf_slots >= T1.npages + T2.npages), you can load all data in memory and perform an in-memory join. In this case, you need to implement either sort-merge join or hash join.

Submission & Evaluation

In this assignment, you are required to complete ro.c and ro.h. You may add some additional .c and .h files, and if that happens, you definitely need to update Makefile to make your new files work. Therefore, you need to submit at least three files, ro.c, ro.h, and Makefile. You need to submit all files on Moodle. You do not need to upload all green files marked in the file structure since we will add them to your code in testing. That means you are free to change them for developing and debugging, but all of them will be overwritten even you submit them. Specifically, below is the process to test your code.

download your solution.
copy db.h, db.c and main.c to the folder (~~the same ones~~ as you can see in ass2/). Overwrite files with the same name.
execute make to compile the code.
copy run.sh and test/ to the folder, which provide different and much more test cases.
execute ./run.sh to load data and process queries.
compared the output log file with the expected result.

Have Fun~