C: Costs, Implementing Scan, Sort, Projection

<< ∧ >>

❖ Relational Operations

DBMS core = relational engine, with implementations of

selection, projection, join, set operations
scanning, sorting, grouping, aggregation, ...

In this part of the course:

examine methods for implementing each operation
develop cost models for each implementation
characterise when each method is most effective

Terminology reminder:

tuple = collection of data values under some schema ≅ record
page = block = collection of tuples + management data = i/o unit
relation = table ≅ file = collection of tuples

<< ∧ >>

❖ Cost Analyses

When showing the cost of operations, don't include T_r and T_w:

for queries, simply count number of pages read (or written)
for updates, use n_r and n_w to distinguish reads/writes

When comparing two methods for same query

ignore the cost of writing the result (same for both)

In counting reads and writes, assume minimal buffering

each request_page() causes a read
each release_page() causes a write (if page is dirty)

<< ∧ >>

❖ Cost Analyses (cont)

Two "dimensions of variation":

which relational operation (e.g. Sel, Proj, Join, Sort, ...)
which access-method (e.g. file struct: heap, indexed, hashed, ...)

Each query method involves an operator and a file structure:

e.g. primary-key selection on hashed file
e.g. primary-key selection on indexed file
e.g. join on ordered heap files (sort-merge join)
e.g. join on hashed files (hash join)
e.g. two-dimensional range query on R-tree indexed file

As well as query costs, consider update costs (insert/delete).

<< ∧ >>

❖ Cost Analyses (cont)

SQL vs DBMS engine

select ... from R where C
- find relevant tuples (satisfying C) in file(s) of R
insert into R values(...)
- place new tuple in some page of a file of R
delete from R where C
- find relevant tuples and "remove" from file(s) of R
update R set ... where C
- find relevant tuples in file(s) of R and "change" them

<< ∧ >>

❖ Query Types

Type SQL RelAlg a.k.a.

Scan select * from R R -

Proj select x,y from R Proj[x,y]R -

Sort select * from R
order by x Sort[x]R ord

Sel₁ select * from R
where id = k Sel[id=k]R one

Sel_n select * from R
where a = k Sel[a=k]R -

Join₁ select * from R,S
where R.id = S.r R Join[id=r] S -

Different query classes exhibit different query processing behaviours.

<< ∧ >>

❖ File Structures

When describing file structures

use a large box to represent a page
use either a small box or tup_i (or rec_i) to represent a tuple
sometimes refer to tuples via their key
- mostly, key corresponds to the notion of "primary key"
- sometimes, key means "search key" in selection condition

[Diagram:Pics/scansortproj/one-page.png]

<< ∧ >>

❖ File Structures (cont)

Consider three simple file structures:

heap file ... tuples added to any page which has space
sorted file ... tuples arranged in file in key order
hash file ... tuples placed in pages using hash function

All files are composed of b primary blocks/pages

[Diagram:Pics/scansortproj/file-struct0.png]

Some records in each page may be marked as "deleted".

<< ∧ >>

❖ Operation Costs Example

Heap file with b = 4, c = 4:

[Diagram:Pics/scansortproj/file-heap.png]

<< ∧ >>

❖ Operation Costs Example (cont)

Sorted file with b = 4, c = 4:

[Diagram:Pics/scansortproj/file-sort1.png]

<< ∧ >>

❖ Operation Costs Example (cont)

Hashed file with b = 3, c = 4, h(k) = k%3

[Diagram:Pics/scansortproj/file-hash.png]

<< ∧ >>

❖ Scanning

Consider the query:

select * from Rel;

Operational view:

for each page P in file of relation Rel {
   for each tuple t in page P {
      add tuple t to result set
   }
}

Cost: read every data page once

Cost = b (b page reads + ≤b page writes)

<< ∧ >>

❖ Scanning (cont)

Scan implementation when file has overflow pages, e.g.

[Diagram:Pics/scansortproj/file-struct1.png]

<< ∧ >>

❖ Scanning (cont)

In this case, the implementation changes to:

for each page P in file of relation T {
    for each tuple t in page P {
        add tuple t to result set
    }
    for each overflow page V of page P {
        for each tuple t in page V {
            add tuple t to result set
}   }   }

Cost: read each data and overflow page once

Cost = b + b_Ov

where b_Ov = total number of overflow pages

<< ∧ >>

❖ Selection via Scanning

Consider a one query like:

select * from Employee where id = 762288;

In an unordered file, search for matching tuple requires:

[Diagram:Pics/scansortproj/file-search.png]

Guaranteed at most one answer; but could be in any page.

<< ∧ >>

❖ Selection via Scanning (cont)

Overview of scan process:

for each page P in relation Employee {
    for each tuple t in page P {
        if (t.id == 762288) return t
}   }

Cost analysis for one searching in unordered file

best case: read one page, find tuple
worst case: read all b pages, find in last (or don't find)
average case: read half of the pages (b/2 )

Page Costs: Cost_avg = b/2 Cost_min = 1 Cost_max = b

<< ∧ >>

❖ Relation Copying

Consider an SQL statement like:

create table T as (select * from S);

Effectively, copies data from one table to another.

Process:

s = start scan of S
make empty relation T
while (t = next_tuple(s)) {
    insert tuple t into relation T
}

<< ∧ >>

❖ Relation Copying (cont)

Possible that T is smaller than S

may be unused free space in S where tuples were removed
if T is built by simple append, will be compact

[Diagram:Pics/scansortproj/copy-files.png]

<< ∧ >>

❖ Relation Copying (cont)

In terms of existing relation/page/tuple operations:

Relation in;       // relation handle (incl. files)
Relation out;      // relation handle (incl. files)
int ipid,opid,tid; // page and record indexes
Record rec;        // current record (tuple)
Page ibuf,obuf;    // input/output file buffers

in = openRelation("S", READ);
out = openRelation("T", NEW|WRITE);
clear(obuf);  opid = 0;
for (ipid = 0; ipid < nPages(in); ipid++) {
    get_page(in, ipid, ibuf);
    for (tid = 0; tid < nTuples(ibuf); tid++) {
        rec = get_record(ibuf, tid);
        if (!hasSpace(obuf,rec)) {
            put_page(out, opid++, obuf);
            clear(obuf);
        }
        insert_record(obuf,rec);
}   }
if (nTuples(obuf) > 0) put_page(out, opid, obuf);

<< ∧ >>

❖ Scanning Relations

Example: simple scan of a table ...

select name from Employee

implemented as:

DB db = openDatabase("myDB");
Relation r = openRel(db,"Employee");
Scan s = start_scan(r);
Tuple t;  // current tuple
while ((t = next_tuple(s)) != NULL)
{
   char *name = getStrField(t,2);
   printf("%s\n", name);
}

<< ∧ >>

❖ Scanning Relations (cont)

Consider the following simple Scan data structure

typedef struct {
   Relation rel;
   Page     *curPage;  // Page buffer
   int      curPID;    // current pid
   int      curTID;    // current tid
} ScanData;
typedef ScanData *Scan;

Scan start_scan(Relation rel)
{
	Scan new = malloc(ScanData);
	new->rel = rel;
	new->curPage = get_page(rel,0);
	new->curPID = 0;
	new->curTID = 0;
	return new;
}

<< ∧ >>

❖ Scanning Relations (cont)

Implementation of next_tuple() function:

Tuple next_tuple(Scan s)
{
	if (s->curTID == nTuples(s->curPage)) {
		// finished cur page, get next
		if (s->curPID == nPages(s->rel))
			return NULL;
		s->curPID++;
		s->curPage = get_page(s->rel, s->curPID);
		s->curTID =0;
	}
	Record r = get_record(s->curPage, s->curTID);
	s->curTID++;
	return makeTuple(s->rel, r);
}

<< ∧ >>

❖ Scanning in PostgreSQL

Scanning defined in: backend/access/heap/heapam.c

Implements iterator data/operations:

HeapScanDesc ... struct containing iteration state
scan = heap_beginscan(rel,...,nkeys,keys)
tup = heap_getnext(scan, direction)
heap_endscan(scan) ... frees up scan struct
res = HeapKeyTest(tuple,...,nkeys,keys)
... performs ScanKeys tests on tuple ... checks is it a result tuple?

<< ∧ >>

❖ Scanning in PostgreSQL (cont)


typedef HeapScanDescData *HeapScanDesc;

typedef struct HeapScanDescData
{
  // scan parameters 
  Relation      rs_rd;        // heap relation descriptor 
  Snapshot      rs_snapshot;  // snapshot ... tuple visibility 
  int           rs_nkeys;     // number of scan keys 
  ScanKey       rs_key;       // array of scan key descriptors 
  ...
  // state set up at initscan time 
  PageNumber    rs_npages;    // number of pages to scan 
  PageNumber    rs_startpage; // page # to start at 
  ...
  // scan current state, initally set to invalid 
  HeapTupleData rs_ctup;      // current tuple in scan
  PageNumber    rs_cpage;     // current page # in scan
  Buffer        rs_cbuf;      // current buffer in scan
   ...
} HeapScanDescData;

<< ∧ >>

❖ Scanning in other File Structures

Above examples are for heap files

simple, unordered, maybe indexed, no hashing

Other access file structures in PostgreSQL:

btree, hash, gist, gin
each implements:
- startscan, getnext, endscan
- insert, delete (update=delete+insert)
- other file-specific operators

<< ∧ >>

❖ The Sort Operation

Sorting is explicit in queries only in the order by clause

select * from Students order by name;

Sorting is used internally in other operations:

eliminating duplicate tuples for projection
ordering files to enhance select efficiency
implementing various styles of join
forming tuple groups in group by

Sort methods such as quicksort are designed for in-memory data.

For large data on disks, need external sorts such as merge sort.

<< ∧ >>

❖ Two-way Merge Sort

Example:

[Diagram:Pics/scansortproj/two-way-ex2.png]

<< ∧ >>

❖ Two-way Merge Sort (cont)

Requires three in-memory buffers:

[Diagram:Pics/scansortproj/two-way-buf.png]

Assumption: cost of Merge operation on two in-memory buffers ≅ 0.

<< ∧ >>

❖ Comparison for Sorting

Above assumes that we have a function to compare tuples.

Needs to understand ordering on different data types.

Need a function tupCompare(r1,r2,f) (cf. C's strcmp)

int tupCompare(r1,r2,f)
{
   if (r1.f < r2.f) return -1;
   if (r1.f > r2.f) return 1;
   return 0;
}

Assume =, <, > are available for all attribute types.

<< ∧ >>

❖ Comparison for Sorting (cont)

In reality, need to sort on multiple attributes and ASC/DESC, e.g.

-- example multi-attribute sort
select * from Students
order by age desc, year_enrolled

Sketch of multi-attribute sorting function

int tupCompare(r1,r2,criteria)
{
   foreach (f,ord) in criteria {
      if (ord == ASC) {
         if (r1.f < r2.f) return -1;
         if (r1.f > r2.f) return 1;
      }
      else {
         if (r1.f > r2.f) return -1;
         if (r1.f < r2.f) return 1;
      }
   }
   return 0;
}

<< ∧ >>

❖ Cost of Two-way Merge Sort

For a file containing b data pages:

require ceil(log₂b) passes to sort,
each pass requires b page reads, b page writes

Gives total cost: 2.b.ceil(log₂b)

Example: Relation with r=10⁵ and c=50 ⇒ b=2000 pages.

Number of passes for sort: ceil(log₂2000) = 11

Reads/writes entire file 11 times! Can we do better?

<< ∧ >>

❖ n-Way Merge Sort

Initial pass uses: B total buffers

[Diagram:Pics/scansortproj/n-way-buf-pass0.png]

Reads B pages at a time, sorts in memory, writes out in order

<< ∧ >>

❖ n-Way Merge Sort (cont)

Merge passes use: B-1 = n input buffers, 1 output buffer

[Diagram:Pics/scansortproj/n-way-buf.png]

<< ∧ >>

❖ n-Way Merge Sort (cont)

Method:


// Produce B-page-long runs
for each group of B pages in Rel {
    read B pages into memory buffers
    sort group in memory
    write B pages out to Temp
}
// Merge runs until everything sorted
numberOfRuns = ⌈b/B⌉
while (numberOfRuns > 1) {
    // n-way merge, where n=B-1
    for each group of n runs in Temp {
        merge into a single run via input buffers
        write run to newTemp via output buffer
    }
    numberOfRuns = ⌈numberOfRuns/n⌉
    Temp = newTemp // swap input/output files
}

<< ∧ >>

❖ Cost of n-Way Merge Sort

Consider file where b = 4096, B = 16 total buffers:

pass 0 produces 256 × 16-page sorted runs
pass 1
- performs 15-way merge of groups of 16-page sorted runs
- produces 18 × 240-page sorted runs (17 full runs, 1 short run)
pass 2
- performs 15-way merge of groups of 240-page sorted runs
- produces 2 × 3600-page sorted runs (1 full run, 1 short run)
pass 3
- performs 15-way merge of groups of 3600-page sorted runs
- produces 1 × 4096-page sorted runs

(cf. two-way merge sort which needs 11 passes)

<< ∧ >>

❖ Cost of n-Way Merge Sort (cont)

Generalising from previous example ...

For b data pages and B buffers

first pass: read/writes b pages, gives b₀ = ⌈b/B⌉ runs
then need ⌈log_nb₀⌉ passes until sorted, where n = B-1
each pass reads and writes b pages (i.e. 2.b page accesses)

Cost = 2.b.(1 + ⌈log_nb₀⌉), where b₀ = ⌈b/B⌉ and n = B-1

<< ∧ >>

❖ Sorting in PostgreSQL

Sort uses a merge-sort (from Knuth) similar to above:

backend/utils/sort/tuplesort.c
include/utils/sortsupport.h

Tuples are mapped to SortTuple structs for sorting:

each SortTuple contains pointer to tuple and sort key
no need to deal with actual Tuples during sort
unless multiple attributes used in sort

If all data fits into memory, sort using qsort().

If memory fills while reading, form "runs" and do disk-based sort.

<< ∧ >>

❖ Sorting in PostgreSQL (cont)

Disk-based sort has phases:

divide input into sorted runs using HeapSort
merge using N buffers, one output buffer
N = as many buffers as workMem allows

Described in terms of "tapes" ("tape" ≅ sorted run)

Implementation of "tapes": backend/utils/sort/logtape.c

<< ∧ >>

❖ Sorting in PostgreSQL (cont)

Sorting comparison operators are obtained via catalog (in Type.o):

// gets pointer to function via pg_operator
struct Tuplesortstate { ... SortTupleComparator ... };

// returns negative, zero, positive
ApplySortComparator(Datum datum1, bool isnull1,
                    Datum datum2, bool isnull2,
                    SortSupport sort_helper);

Flags indicate: ascending/descending, nulls-first/last.

ApplySortComparator() is PostgreSQL's version of tupCompare()

<< ∧ >>

❖ The Projection Operation

Consider the query:

select distinct name,age from Employee;

If the Employee relation has four tuples such as:

(94002, John, Sales, Manager,   32)
(95212, Jane, Admin, Manager,   39)
(96341, John, Admin, Secretary, 32)
(91234, Jane, Admin, Secretary, 21)

then the result of the projection is:

(Jane, 21)   (Jane, 39)   (John, 32)

Note that duplicate tuples (e.g. (John,32)) are eliminated.

<< ∧ >>

❖ The Projection Operation (cont)

The projection operation needs to:

scan the entire relation as input
- already seen how to do scanning
remove unwanted attributes in output tuples
- implementation depends on tuple internal structure
- essentially, make a new tuple with fewer attributes
  and where the values may be computed from existing attributes
eliminate any duplicates produced (if distinct)
- two approaches: sorting or hashing

<< ∧ >>

❖ Sort-based Projection

Requires a temporary file/relation (Temp)

for each tuple T in Rel {
    T' = mkTuple([attrs],T)
    write T' to Temp
}

sort Temp on [attrs]

for each tuple T in Temp {
    if (T == Prev) continue
    write T to Result
    Prev = T
}

<< ∧ >>

❖ Cost of Sort-based Projection

The costs involved are (assuming B=n+1 buffers for sort):

scanning original relation Rel: b_R (with c_R)
writing Temp relation: b_T (smaller tuples, c_T > c_R, sorted)
sorting Temp relation:
2.b_T.(1+ceil(log_nb₀)) where b₀ = ceil(b_T/B)
scanning Temp, removing duplicates: b_T
writing the result relation: b_Out (maybe less tuples)

Cost = sum of above = b_R + b_T + 2.b_T.(1+ceil(log_nb₀)) + b_T + b_Out

<< ∧ >>

❖ Hash-based Projection

Partitioning phase:

[Diagram:Pics/scansortproj/hash-project.png]

<< ∧ >>

❖ Hash-based Projection (cont)

Duplicate elimination phase:

[Diagram:Pics/scansortproj//hash-project2.png]

<< ∧ >>

❖ Hash-based Projection (cont)

Algorithm for both phases:

for each tuple T in relation Rel {
    T' = mkTuple([attrs],T)
    H = h1(T', n)
    B = buffer for partition[H]
    if (B full) write and clear B
    insert T' into B
}
for each partition P in 0..n-1 {
    for each tuple T in partition P {
        H = h2(T, n)
        B = buffer for hash value H
        if (T not in B) insert T into B
        // assumes B never gets full
    }
    write and clear all buffers
}

<< ∧ >>

❖ Cost of Hash-based Projection

The total cost is the sum of the following:

scanning original relation R: b_R
writing partitions: b_P (b_R vs b_P ?)
re-reading partitions: b_P
writing the result relation: b_Out

Cost = b_R + 2b_P + b_Out

To ensure that n is larger than the largest partition ...

use hash functions (h1,h2) with uniform spread
allocate at least sqrt(b_R)+1 buffers
if insufficient buffers, significant re-reading overhead

<< ∧ >>

❖ Projection on Primary Key

No duplicates, so the above approaches are not required.

Method:

bR = nPages(Rel)
for i in 0 .. bR-1 {
   P = read page i
   for j in 0 .. nTuples(P) {
      T = getTuple(P,j)
      T' = mkTuple(pk, T)
      if (outBuf is full) write and clear
      append T' to outBuf
   }
}
if (nTuples(outBuf) > 0) write

<< ∧ >>

❖ Index-only Projection

Can do projection without accessing data file iff ...

relation is indexed on (A₁,A₂,...A_n) (indexes described later)
projected attributes are a prefix of (A₁,A₂,...A_n)

Basic idea:

scan through index file (which is already sorted on attributes)
duplicates are already adjacent in index, so easy to skip

Cost analysis ...

index has b_i pages (where b_i ≪ b_R)
Cost = b_i reads + b_Out writes

<< ∧ >>

❖ Comparison of Projection Methods

Difficult to compare, since they make different assumptions:

index-only: needs an appropriate index
hash-based: needs buffers and good hash functions
sort-based: needs only buffers ⇒ use as default

Best case scenario for each (assuming n+1 in-memory buffers):

index-only: b_i + b_Out ≪ b_R + b_Out
hash-based: b_R + 2.b_P + b_Out
sort-based: b_R + b_T + 2.b_T.ceil(log_nb₀) + b_T + b_Out

We normally omit b_Out, since each method produces the same result

<< ∧

❖ Projection in PostgreSQL

Code for projection forms part of execution iterators:

backend/executor/execQual.c

Functions involved with projection:

ExecProject(projInfo,...) ... extracts projected data
check_sql_fn_retval(...) ... makes new tuple via TargetList
ExecStoreTuple(newTuple,...) ... save tuple in buffer

plus many many others ...

Type	SQL	RelAlg	a.k.a.
Scan	`select * from R`	R	-
Proj	`select` x,y `from R`	Proj[x,y]R	-
Sort	`select * from R` `order by` x	Sort[x]R	ord
Sel₁	`select * from R` `where id =` k	Sel[id=k]R	one
Sel_n	`select * from R` `where a =` k	Sel[a=k]R	-
Join₁	`select * from R,S` `where R.id = S.r`	R Join[id=r] S	-