COMP9315 Week 03 Monday Lecture

∧ >>

❖ Things To Note

no Quiz this week
prepare for Assignment 1 (Prac Exercise P04)
- complex.c is useful, but is not everything
have "fixed" the SET in testing expected output
auto-marking will have more tests than run_test.py
submission via give and Webcms3 are now set up

<< ∧ >>

❖ Assignment 1

pname.c

struct PersonName ... internal representation of names
pname_in() ... reads a string and converts to a PersonName
pname_out() ... converts a PersonName to a string
pname_family(), pname_given(), pname_show()
pname_eq(), pname_lt(), etc. etc. etc.

pname.source

defines a "bridge" between SQL and C functions
define function function interfaces and operator properties
define operator classes for indexing (btree and hash)

<< ∧ >>

❖ Assignment 1 (cont)

How it fits together:

Makefile converts pname.c into pname.so
which can be linked into the PostgreSQL server
Makefile converts pname.source into pname.sql
by filling in the full file/path names
load SQL interface by loading pname.sql into a database
which gives the type, all functions, all operators and indexes
can remove the type via drop PersonName cascade

<< ∧ >>

❖ Assignment 1 (cont)

Creating and using PersonName objects:

<< ∧ >>

❖ Assignment 1 (cont)

Incorrect memory usage:

<< ∧ >>

❖ Assignment 1 (cont)

Correct memory usage:

<< ∧ >>

❖ Tuples

<< ∧ >>

❖ Tuples

Each page contains a collection of tuples

What do tuples contain? How are they structured internally?

<< ∧ >>

❖ Records vs Tuples

A table is defined by a collection of attributes (schema), e.g.

create table Employee (
   id integer primary key,  name varchar(20),
   job  varchar(10),  dept number(4)
);

Tuple = collection of attribute values for such a schema, e.g.

(33357462, 'Neil Young', 'Musician', 0277)

Record = sequence of bytes, containing data for one tuple, e.g.

Bytes need to be interpreted relative to schema to get tuple

<< ∧ >>

❖ Operations on Records

Common operation on records ... access record via RecordId:

Record get_record(Relation rel, RecordId rid) {
    (pid,tid) = rid;
    Page *buf = request_page(rel, pid);
    return get_bytes(buf, tid);
}

Gives a sequence of bytes, which needs to be interpreted, e.g.

Relation rel = ... // relation schema
Record rec = get_record(rel,rid)
Tuple t = makeTuple(rel,rec)

Once we have a tuple, we can access individual attributes/fields

<< ∧ >>

❖ Operations on Tuples

Once we have a record, we need to interpret it as a tuple ...

Tuple t = makeTuple(Relation rel, Record rec)

convert record to tuple data structure for relation rel

Once we have a tuple, we want to examine its contents ...

Typ getTypField(Tuple t, int fieldNum)

extract the fno'th field from a Tuple as a value of type Typ

E.g. int x = getIntField(t,1), char *s = getStrField(t,2)

<< ∧ >>

❖ Scanning

Access methods typically involve iterators, e.g.

Scan s = start_scan(Relation r, ...)

commence a scan of relation r
Scan may include condition to implement WHERE-clause
Scan holds data on progress through file (e.g. current page)

Tuple next_tuple(Scan s)

returns next Tuple after last accessed one
returns NULL if no more Tuples left in the relation

<< ∧ >>

❖ Example Query

Example: simple scan of a table ...

select name from Employee

implemented as:

DB db = openDatabase("myDB");
Relation r = openRel(db,"Employee");
Scan s = start_scan(r);
Tuple t;  // current tuple
while ((t = next_tuple(s)) != NULL)
{
   char *name = getStrField(t,2);
   printf("%s\n", name);
}

<< ∧ >>

❖ Exercise: Implement next_tuple()

Consider the following possible Scan data structure

typedef struct {
   Relation rel;
   Page     *curPage;  // Page buffer
   int      curPID;    // current pid
   int      curTID;    // current tid
} ScanData;

Assume tuples are indexed 0..nTuples(p)-1

Assume pages are indexed 0..nPages(rel)-1

Implement the Tuple next_tuple(Scan) function

P.S. What's in a Relation object?

<< ∧ >>

❖ Fixed-length Records

Encoding scheme for fixed-length records:

record format (length + offsets) stored in catalogue
data values stored in fixed-size slots in data pages

Since record format is frequently used at query time, should be in memory.

<< ∧ >>

❖ Variable-length Records

Some encoding schemes for variable-length records:

Prefix each field by length
Terminate fields by delimiter
Array of offsets

<< ∧ >>

❖ Converting Records to Tuples

A Record is an array of bytes (byte[])

representing the data values from a typed Tuple

A Tuple is a collection of named,typed values (cf. C struct)

Information on how to interpret the bytes as typed values

will be contained in schema data in DBMS catalogue
may be stored in the header for the data file
may be stored partly in the record and partly in the schema

For variable-length records, some formatting info ...

must be stored in the record or in the page directory

<< ∧ >>

❖ Converting Records to Tuples (cont)

DBMSs typically define a fixed set of field types, e.g.

DATE, FLOAT, INTEGER, NUMBER(n), VARCHAR(n), ...

This determines implementation-level data types:

DATE time_t

FLOAT float,double

INTEGER int,long

NUMBER(n) int[] (?)

VARCHAR(n) char[]

<< ∧ >>

❖ Converting Records to Tuples (cont)

A Tuple could be defined as

a list of field descriptors for a record instance
(where a FieldDesc gives (offset,length,type) information)
along with a reference to the Record data

typedef struct {
  ushort    nfields;   // number of fields/attrs
  ushort    data_off;  // offset in struct for data
  FieldDesc fields[];  // field descriptions
  Record    data;      // pointer to record in buffer
} Tuple;

Fields are derived from relation descriptor + record instance data.

<< ∧ >>

❖ Converting Records to Tuples (cont)

Tuple data could be

a pointer to bytes stored elsewhere in memory

<< ∧ >>

❖ Converting Records to Tuples (cont)

Or, tuple data could be ...

appended to Tuple struct (used widely in PostgreSQL)

<< ∧ >>

❖ Exercise: How big is a FieldDesc?

FieldDesc = (offset,length,type), where

offset = offset of field within record data
length = length (in bytes) of field
type = data type of field

If pages are 8KB in size, how many bits are needed for each?

E.g.

<< ∧ >>

❖ PostgreSQL Tuples

Definitions: include/postgres.h, include/access/*tup*.h

Functions: backend/access/common/*tup*.c e.g.

HeapTuple heap_form_tuple(desc,values[],isnull[])
heap_deform_tuple(tuple,desc,values[],isnull[])

PostgreSQL defines tuples via:

a contiguous chunk of memory
starting with a header giving e.g. #fields, nulls
followed by the data values (as sequence of Datum)

<< ∧ >>

❖ PostgreSQL Tuples (cont)

Tuple structure:

[Diagram:Pics/storage/pg-tuple-struct.png]

<< ∧ >>

❖ PostgreSQL Tuples (cont)

Tuple-related data types:

// representation of a data value
typedef uintptr_t Datum;

The actual data value:

may be stored in the Datum (e.g. int)
may have a header with length (for varlen attributes)
may be stored in a TOAST file

<< ∧ >>

❖ PostgreSQL Tuples (cont)

Tuple-related data types: (cont)

// TupleDescData: schema-related information for HeapTuples

typedef struct tupleDescData
{
  int         natts;          // number of attributes in the tuple 
  Oid         tdtypeid;       // composite type ID for tuple type 
  int32       tdtypmod;       // typmod for tuple type 
  int         tdrefcount;     // reference count (-1 if not counting)
  TupleConstr *constr;        // constraints, or NULL if none 
  FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER];
  // attrs[N] is description of attribute number N+1 
} *TupleDesc;

<< ∧ >>

❖ PostgreSQL Tuples (cont)

HeapTupleData contains information about a stored tuple

typedef HeapTupleData *HeapTuple;

typedef struct HeapTupleData
{
  uint32           t_len;  // length of *t_data 
  ItemPointerData t_self;  // SelfItemPointer 
  Oid         t_tableOid;  // table the tuple came from 
  HeapTupleHeader t_data;  // -> tuple header and data 
} HeapTupleData;

HeapTupleHeader is a pointer to a location in a buffer

<< ∧ >>

❖ PostgreSQL Tuples (cont)

PostgreSQL stores a single block of data for tuple

containing a tuple header, followed by data byte[]

typedef struct HeapTupleHeaderData // simplified
{
  HeapTupleFields t_heap;
  ItemPointerData t_ctid;      // TID of this tuple or newer version
  uint16          t_infomask2; // #attributes + flags
  uint16          t_infomask;  // flags e.g. has_null, has_varwidth
  uint8           t_hoff;      // sizeof header incl. bitmap+padding
  // above is fixed size (23 bytes) for all heap tuples
  bits8           t_bits[1];   // bitmap of NULLs, variable length
  // OID goes here if HEAP_HASOID is set in t_infomask
  // actual data follows at end of struct
} HeapTupleHeaderData;

<< ∧ >>

❖ PostgreSQL Tuples (cont)

Tuple-related data types: (cont)

typedef struct HeapTupleFields  // simplified
{
  TransactionId t_xmin;  // inserting xact ID
  TransactionId t_xmax;  // deleting or locking xact ID
  union {
    CommandId   t_cid;   // inserting or deleting command ID
    TransactionId t_xvac;// old-style VACUUM FULL xact ID 
  } t_field3;
} HeapTupleFields;

Note that not all system fields from stored tuple appear

oid is stored after the tuple header, if used
both xmin/xmax are stored, but only one of cmin/cmax

<< ∧ >>

❖ Relational Operations and Cost Models

<< ∧ >>

❖ DBMS Architecture (revisited)

Implementation of relational operations in DBMS:

[Diagram:Pics/scansortproj/dbmsarch-relop.png]

<< ∧ >>

❖ Relational Operations

DBMS core = relational engine, with implementations of

selection, projection, join, set operations
scanning, sorting, grouping, aggregation, ...

In this part of the course:

examine methods for implementing each operation
develop cost models for each implementation
characterise when each method is most effective

Terminology reminder:

tuple = collection of data values under some schema ≅ record
page = block = collection of tuples + management data = i/o unit
relation = table ≅ file = collection of tuples

<< ∧ >>

❖ Cost Analyses

When showing the cost of operations, don't include T_r and T_w:

for queries, simply count number of pages read (or written)
for updates, use n_r and n_w to distinguish reads/writes

When comparing two methods for same query

ignore the cost of writing the result (same for both)

In counting reads and writes, assume minimal buffering

each request_page() causes a read
each release_page() causes a write (if page is dirty)

<< ∧ >>

❖ Cost Analyses (cont)

Two "dimensions of variation":

which relational operation (e.g. Sel, Proj, Join, Sort, ...)
which access-method (e.g. file struct: heap, indexed, hashed, ...)

Each query method involves an operator and a file structure:

e.g. primary-key selection on hashed file
e.g. primary-key selection on indexed file
e.g. join on ordered heap files (sort-merge join)
e.g. join on hashed files (hash join)
e.g. two-dimensional range query on R-tree indexed file

As well as query costs, consider update costs (insert/delete).

<< ∧ >>

❖ Cost Analyses (cont)

SQL vs DBMS engine

select ... from R where C
- find relevant tuples (satisfying C) in file(s) of R
insert into R values(...)
- place new tuple in some page of a file of R
delete from R where C
- find relevant tuples and "remove" from file(s) of R
update R set ... where C
- find relevant tuples in file(s) of R and "change" them

<< ∧ >>

❖ Query Types

Type SQL RelAlg a.k.a.

Scan select * from R R -

Proj select x,y from R Proj[x,y]R -

Sort select * from R
order by x Sort[x]R ord

Sel₁ select * from R
where id = k Sel[id=k]R one

Sel_n select * from R
where a = k Sel[a=k]R -

Join₁ select * from R,S
where R.id = S.r R Join[id=r] S -

Different query classes exhibit different query processing behaviours.

<< ∧ >>

❖ File Structures

When describing file structures

use a large box to represent a page
use either a small box or tup_i (or rec_i) to represent a tuple
sometimes refer to tuples via their key
- mostly, key corresponds to the notion of "primary key"
- sometimes, key means "search key" in selection condition

[Diagram:Pics/scansortproj/one-page.png]

<< ∧ >>

❖ File Structures (cont)

Consider three simple file structures:

heap file ... tuples added to any page which has space
sorted file ... tuples arranged in file in key order
hash file ... tuples placed in pages using hash function

All files are composed of b primary blocks/pages

[Diagram:Pics/scansortproj/file-struct0.png]

Some records in each page may be marked as "deleted".

<< ∧ >>

❖ Exercise: Operation Costs

For each of the following file structures

determine #page-reads + #page-writes for each operation

You can assume the existence of a file header containing

values for r, R, b, B, c
index of first page with free space (and a free list)

Assume also

each page contains a header and directory as well as tuples
no buffering (worst case scenario)

<< ∧ >>

❖ Operation Costs Example

Heap file with b = 4, c = 4:

[Diagram:Pics/scansortproj/file-heap.png]

<< ∧ >>

❖ Operation Costs Example (cont)

Sorted file with b = 4, c = 4:

[Diagram:Pics/scansortproj/file-sort1.png]

<< ∧ >>

❖ Operation Costs Example (cont)

Hashed file with b = 3, c = 4, h(k) = k%3

[Diagram:Pics/scansortproj/file-hash.png]

<< ∧ >>

❖ Scanning

<< ∧ >>

❖ Scanning

Consider the query:

select * from Rel;

Operational view:

for each page P in file of relation Rel {
   for each tuple t in page P {
      add tuple t to result set
   }
}

Cost: read every data page once

Cost = b (b page reads + ≤b page writes)

<< ∧ >>

❖ Scanning (cont)

Scan implementation when file has overflow pages, e.g.

[Diagram:Pics/scansortproj/file-struct1.png]

<< ∧ >>

❖ Scanning (cont)

In this case, the implementation changes to:

for each page P in file of relation T {
    for each tuple t in page P {
        add tuple t to result set
    }
    for each overflow page V of page P {
        for each tuple t in page V {
            add tuple t to result set
}   }   }

Cost: read each data and overflow page once

Cost = b + b_Ov

where b_Ov = total number of overflow pages

<< ∧ >>

❖ Selection via Scanning

Consider a one query like:

select * from Employee where id = 762288;

In an unordered file, search for matching tuple requires:

[Diagram:Pics/scansortproj/file-search.png]

Guaranteed at most one answer; but could be in any page.

<< ∧ >>

❖ Selection via Scanning (cont)

Overview of scan process:

for each page P in relation Employee {
    for each tuple t in page P {
        if (t.id == 762288) return t
}   }

Cost analysis for one searching in unordered file

best case: read one page, find tuple
worst case: read all b pages, find in last (or don't find)
average case: read half of the pages (b/2 )

Page Costs: Cost_avg = b/2 Cost_min = 1 Cost_max = b

<< ∧ >>

❖ Exercise: Cost of Search in Hashed File

Consider the hashed file structure b = 10, c = 4, h(k) = k%10

[Diagram:Pics/scansortproj/hash-file-10.png]

Describe how the following queries

select * from R where k = 51;
select * from R where k > 50;

might be solved in a file structure like the above (h(k) = k%b).

Estimate the minimum and maximum cost (as #pages read)

<< ∧ >>

❖ Relation Copying

Consider an SQL statement like:

create table T as (select * from S);

Effectively, copies data from one table to another.

Process:

s = start scan of S
make empty relation T
while (t = next_tuple(s)) {
    insert tuple t into relation T
}

<< ∧ >>

❖ Relation Copying (cont)

Possible that T is smaller than S

may be unused free space in S where tuples were removed
if T is built by simple append, will be compact

[Diagram:Pics/scansortproj/copy-files.png]

<< ∧ >>

❖ Relation Copying (cont)

In terms of existing relation/page/tuple operations:

Relation in;       // relation handle (incl. files)
Relation out;      // relation handle (incl. files)
int ipid,opid,tid; // page and record indexes
Record rec;        // current record (tuple)
Page ibuf,obuf;    // input/output file buffers

in = openRelation("S", READ);
out = openRelation("T", NEW|WRITE);
clear(obuf);  opid = 0;
for (ipid = 0; ipid < nPages(in); ipid++) {
    get_page(in, ipid, ibuf);
    for (tid = 0; tid < nTuples(ibuf); tid++) {
        rec = get_record(ibuf, tid);
        if (!hasSpace(obuf,rec)) {
            put_page(out, opid++, obuf);
            clear(obuf);
        }
        insert_record(obuf,rec);
}   }
if (nTuples(obuf) > 0) put_page(out, opid, obuf);

<< ∧

❖ Exercise: Cost of Relation Copy

Analyse cost for relation copying:

if both input and output are heap files
if input is sorted and output is heap file
if input is heap file and output is sorted

Assume ...

r records in input file, c records/page
b_in = number of pages in input file
some pages in input file are not full
all pages in output file are full (except the last)

Give cost in terms of #pages read + #pages written

`DATE`		`time_t`
`FLOAT`		`float,double`
`INTEGER`		`int,long`
`NUMBER(`n`)`		`int[]` (?)
`VARCHAR(`n`)`		`char[]`

Type	SQL	RelAlg	a.k.a.
Scan	`select * from R`	R	-
Proj	`select` x,y `from R`	Proj[x,y]R	-
Sort	`select * from R` `order by` x	Sort[x]R	ord
Sel₁	`select * from R` `where id =` k	Sel[id=k]R	one
Sel_n	`select * from R` `where a =` k	Sel[a=k]R	-
Join₁	`select * from R,S` `where R.id = S.r`	R Join[id=r] S	-