C: Costs, Implementing Scan, Sort, Projection

COMP9315 24T1 ♢ Lectures Part C ♢ [0/51]
❖ DBMS Architecture (revisited)

Implementation of relational operations in DBMS:

[Diagram:Pics/scansortproj/dbmsarch-relop.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [1/51]
❖ Relational Operations

DBMS core = relational engine, with implementations of

In this part of the course: Terminology reminder:
COMP9315 24T1 ♢ Lectures Part C ♢ [2/51]
❖ Cost Analyses

When showing the cost of operations, don't include Tr and Tw:

When comparing two methods for same query In counting reads and writes, assume minimal buffering
COMP9315 24T1 ♢ Lectures Part C ♢ [3/51]
❖ Cost Analyses (cont)

Two "dimensions of variation":

Each query method involves an operator and a file structure: As well as query costs, consider update costs (insert/delete).
COMP9315 24T1 ♢ Lectures Part C ♢ [4/51]
❖ Cost Analyses (cont)

SQL vs DBMS engine

COMP9315 24T1 ♢ Lectures Part C ♢ [5/51]
❖ Query Types

Type SQL RelAlg a.k.a.
Scan select * from R R -
Proj select x,y from R Proj[x,y]R -
Sort select * from R
order by x
Sort[x]R ord
Sel1 select * from R
where id = k
Sel[id=k]R one
Seln select * from R
where a = k
Sel[a=k]R -
Join1 select * from R,S
where R.id = S.r
R Join[id=r] S -

Different query classes exhibit different query processing behaviours.
COMP9315 24T1 ♢ Lectures Part C ♢ [6/51]
❖ File Structures

When describing file structures

[Diagram:Pics/scansortproj/one-page.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [7/51]
❖ File Structures (cont)

Consider three simple file structures:

All files are composed of b primary blocks/pages

[Diagram:Pics/scansortproj/file-struct0.png]

Some records in each page may be marked as "deleted".

COMP9315 24T1 ♢ Lectures Part C ♢ [8/51]
❖ Operation Costs Example

Heap file with b = 4, c = 4:

[Diagram:Pics/scansortproj/file-heap.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [9/51]
❖ Operation Costs Example (cont)

Sorted file with b = 4, c = 4:

[Diagram:Pics/scansortproj/file-sort1.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [10/51]
❖ Operation Costs Example (cont)

Hashed file with b = 3, c = 4, h(k) = k%3

[Diagram:Pics/scansortproj/file-hash.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [11/51]
❖ Scanning

Consider the query:

select * from Rel;

Operational view:

for each page P in file of relation Rel {
   for each tuple t in page P {
      add tuple t to result set
   }
}

Cost: read every data page once

Cost = b     (b page reads + ≤b page writes)

COMP9315 24T1 ♢ Lectures Part C ♢ [12/51]
❖ Scanning (cont)

Scan implementation when file has overflow pages, e.g.

[Diagram:Pics/scansortproj/file-struct1.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [13/51]
❖ Scanning (cont)

In this case, the implementation changes to:

for each page P in file of relation T {
    for each tuple t in page P {
        add tuple t to result set
    }
    for each overflow page V of page P {
        for each tuple t in page V {
            add tuple t to result set
}   }   }

Cost: read each data and overflow page once

Cost = b + bOv

where bOv = total number of overflow pages

COMP9315 24T1 ♢ Lectures Part C ♢ [14/51]
❖ Selection via Scanning

Consider a one query like:

select * from Employee where id = 762288;

In an unordered file, search for matching tuple requires:


[Diagram:Pics/scansortproj/file-search.png]


Guaranteed at most one answer; but could be in any page.

COMP9315 24T1 ♢ Lectures Part C ♢ [15/51]
❖ Selection via Scanning (cont)

Overview of scan process:

for each page P in relation Employee {
    for each tuple t in page P {
        if (t.id == 762288) return t
}   }

Cost analysis for one searching in unordered file

Page Costs:   Costavg = b/2     Costmin = 1     Costmax = b
COMP9315 24T1 ♢ Lectures Part C ♢ [16/51]
❖ Relation Copying

Consider an SQL statement like:

create table T as (select * from S);

Effectively, copies data from one table to another.

Process:

s = start scan of S
make empty relation T
while (t = next_tuple(s)) {
    insert tuple t into relation T
}

COMP9315 24T1 ♢ Lectures Part C ♢ [17/51]
❖ Relation Copying (cont)

Possible that T is smaller than S

[Diagram:Pics/scansortproj/copy-files.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [18/51]
❖ Relation Copying (cont)

In terms of existing relation/page/tuple operations:

Relation in;       // relation handle (incl. files)
Relation out;      // relation handle (incl. files)
int ipid,opid,tid; // page and record indexes
Record rec;        // current record (tuple)
Page ibuf,obuf;    // input/output file buffers

in = openRelation("S", READ);
out = openRelation("T", NEW|WRITE);
clear(obuf);  opid = 0;
for (ipid = 0; ipid < nPages(in); ipid++) {
    get_page(in, ipid, ibuf);
    for (tid = 0; tid < nTuples(ibuf); tid++) {
        rec = get_record(ibuf, tid);
        if (!hasSpace(obuf,rec)) {
            put_page(out, opid++, obuf);
            clear(obuf);
        }
        insert_record(obuf,rec);
}   }
if (nTuples(obuf) > 0) put_page(out, opid, obuf);

COMP9315 24T1 ♢ Lectures Part C ♢ [19/51]
❖ Scanning Relations

Example: simple scan of a table ...

select name from Employee

implemented as:

DB db = openDatabase("myDB");
Relation r = openRel(db,"Employee");
Scan s = start_scan(r);
Tuple t;  // current tuple
while ((t = next_tuple(s)) != NULL)
{
   char *name = getStrField(t,2);
   printf("%s\n", name);
}

COMP9315 24T1 ♢ Lectures Part C ♢ [20/51]
❖ Scanning Relations (cont)

Consider the following simple Scan data structure

typedef struct {
   Relation rel;
   Page     *curPage;  // Page buffer
   int      curPID;    // current pid
   int      curTID;    // current tid
} ScanData;
typedef ScanData *Scan;

Scan start_scan(Relation rel)
{
	Scan new = malloc(ScanData);
	new->rel = rel;
	new->curPage = get_page(rel,0);
	new->curPID = 0;
	new->curTID = 0;
	return new;
}

COMP9315 24T1 ♢ Lectures Part C ♢ [21/51]
❖ Scanning Relations (cont)

Implementation of next_tuple() function:

Tuple next_tuple(Scan s)
{
	if (s->curTID == nTuples(s->curPage)) {
		// finished cur page, get next
		if (s->curPID == nPages(s->rel))
			return NULL;
		s->curPID++;
		s->curPage = get_page(s->rel, s->curPID);
		s->curTID =0;
	}
	Record r = get_record(s->curPage, s->curTID);
	s->curTID++;
	return makeTuple(s->rel, r);
}

COMP9315 24T1 ♢ Lectures Part C ♢ [22/51]
❖ Scanning in PostgreSQL


Scanning defined in: backend/access/heap/heapam.c

Implements iterator data/operations:

COMP9315 24T1 ♢ Lectures Part C ♢ [23/51]
❖ Scanning in PostgreSQL (cont)


typedef HeapScanDescData *HeapScanDesc;

typedef struct HeapScanDescData
{
  // scan parameters 
  Relation      rs_rd;        // heap relation descriptor 
  Snapshot      rs_snapshot;  // snapshot ... tuple visibility 
  int           rs_nkeys;     // number of scan keys 
  ScanKey       rs_key;       // array of scan key descriptors 
  ...
  // state set up at initscan time 
  PageNumber    rs_npages;    // number of pages to scan 
  PageNumber    rs_startpage; // page # to start at 
  ...
  // scan current state, initally set to invalid 
  HeapTupleData rs_ctup;      // current tuple in scan
  PageNumber    rs_cpage;     // current page # in scan
  Buffer        rs_cbuf;      // current buffer in scan
   ...
} HeapScanDescData;

COMP9315 24T1 ♢ Lectures Part C ♢ [24/51]
❖ Scanning in other File Structures


Above examples are for heap files

Other access file structures in PostgreSQL:
COMP9315 24T1 ♢ Lectures Part C ♢ [25/51]
❖ The Sort Operation

Sorting is explicit in queries only in the order by clause

select * from Students order by name;

Sorting is used internally in other operations:

Sort methods such as quicksort are designed for in-memory data.

For large data on disks, need external sorts such as merge sort.

COMP9315 24T1 ♢ Lectures Part C ♢ [26/51]
❖ Two-way Merge Sort

Example:

[Diagram:Pics/scansortproj/two-way-ex2.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [27/51]
❖ Two-way Merge Sort (cont)

Requires three in-memory buffers:

[Diagram:Pics/scansortproj/two-way-buf.png]


Assumption: cost of Merge operation on two in-memory buffers ≅ 0.

COMP9315 24T1 ♢ Lectures Part C ♢ [28/51]
❖ Comparison for Sorting

Above assumes that we have a function to compare tuples.

Needs to understand ordering on different data types.

Need a function tupCompare(r1,r2,f) (cf. C's strcmp)

int tupCompare(r1,r2,f)
{
   if (r1.f < r2.f) return -1;
   if (r1.f > r2.f) return 1;
   return 0;
}

Assume =, <, > are available for all attribute types.

COMP9315 24T1 ♢ Lectures Part C ♢ [29/51]
❖ Comparison for Sorting (cont)

In reality, need to sort on multiple attributes and ASC/DESC, e.g.

-- example multi-attribute sort
select * from Students
order by age desc, year_enrolled

Sketch of multi-attribute sorting function

int tupCompare(r1,r2,criteria)
{
   foreach (f,ord) in criteria {
      if (ord == ASC) {
         if (r1.f < r2.f) return -1;
         if (r1.f > r2.f) return 1;
      }
      else {
         if (r1.f > r2.f) return -1;
         if (r1.f < r2.f) return 1;
      }
   }
   return 0;
}

COMP9315 24T1 ♢ Lectures Part C ♢ [30/51]
❖ Cost of Two-way Merge Sort

For a file containing b data pages:

Gives total cost:   2.b.ceil(log2b)

Example: Relation with r=105 and c=50     b=2000 pages.

Number of passes for sort:   ceil(log22000)  =  11

Reads/writes entire file 11 times!    Can we do better?

COMP9315 24T1 ♢ Lectures Part C ♢ [31/51]
❖ n-Way Merge Sort

Initial pass uses: B total buffers

[Diagram:Pics/scansortproj/n-way-buf-pass0.png]

Reads B pages at a time, sorts in memory, writes out in order

COMP9315 24T1 ♢ Lectures Part C ♢ [32/51]
❖ n-Way Merge Sort (cont)

Merge passes use:   B-1 = n  input buffers,   1  output buffer

[Diagram:Pics/scansortproj/n-way-buf.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [33/51]
❖ n-Way Merge Sort (cont)

Method:


// Produce B-page-long runs
for each group of B pages in Rel {
    read B pages into memory buffers
    sort group in memory
    write B pages out to Temp
}
// Merge runs until everything sorted
numberOfRuns = ⌈b/B⌉
while (numberOfRuns > 1) {
    // n-way merge, where n=B-1
    for each group of n runs in Temp {
        merge into a single run via input buffers
        write run to newTemp via output buffer
    }
    numberOfRuns = ⌈numberOfRuns/n⌉
    Temp = newTemp // swap input/output files
}

COMP9315 24T1 ♢ Lectures Part C ♢ [34/51]
❖ Cost of n-Way Merge Sort

Consider file where b = 4096, B = 16 total buffers:

(cf. two-way merge sort which needs 11 passes)
COMP9315 24T1 ♢ Lectures Part C ♢ [35/51]
❖ Cost of n-Way Merge Sort (cont)

Generalising from previous example ...

For b data pages and B buffers


Cost = 2.b.(1 + ⌈lognb0⌉),   where b0 = ⌈b/B⌉ and n = B-1
COMP9315 24T1 ♢ Lectures Part C ♢ [36/51]
❖ Sorting in PostgreSQL

Sort uses a merge-sort (from Knuth) similar to above:

Tuples are mapped to SortTuple structs for sorting: If all data fits into memory, sort using qsort().

If memory fills while reading, form "runs" and do disk-based sort.

COMP9315 24T1 ♢ Lectures Part C ♢ [37/51]
❖ Sorting in PostgreSQL (cont)

Disk-based sort has phases:

Described in terms of "tapes" ("tape" sorted run)

Implementation of "tapes": backend/utils/sort/logtape.c

COMP9315 24T1 ♢ Lectures Part C ♢ [38/51]
❖ Sorting in PostgreSQL (cont)

Sorting comparison operators are obtained via catalog (in Type.o):

// gets pointer to function via pg_operator
struct Tuplesortstate { ... SortTupleComparator ... };

// returns negative, zero, positive
ApplySortComparator(Datum datum1, bool isnull1,
                    Datum datum2, bool isnull2,
                    SortSupport sort_helper);

Flags indicate: ascending/descending, nulls-first/last.

ApplySortComparator() is PostgreSQL's version of tupCompare()

COMP9315 24T1 ♢ Lectures Part C ♢ [39/51]
❖ The Projection Operation

Consider the query:

select distinct name,age from Employee;

If the Employee relation has four tuples such as:

(94002, John, Sales, Manager,   32)
(95212, Jane, Admin, Manager,   39)
(96341, John, Admin, Secretary, 32)
(91234, Jane, Admin, Secretary, 21)

then the result of the projection is:

(Jane, 21)   (Jane, 39)   (John, 32)

Note that duplicate tuples (e.g. (John,32)) are eliminated.

COMP9315 24T1 ♢ Lectures Part C ♢ [40/51]
❖ The Projection Operation (cont)

The projection operation needs to:

  1. scan the entire relation as input
    • already seen how to do scanning

  2. remove unwanted attributes in output tuples
    • implementation depends on tuple internal structure
    • essentially, make a new tuple with fewer attributes
      and where the values may be computed from existing attributes

  3. eliminate any duplicates produced   (if distinct)
    • two approaches: sorting or hashing
COMP9315 24T1 ♢ Lectures Part C ♢ [41/51]
❖ Sort-based Projection

Requires a temporary file/relation (Temp)

for each tuple T in Rel {
    T' = mkTuple([attrs],T)
    write T' to Temp
}

sort Temp on [attrs]

for each tuple T in Temp {
    if (T == Prev) continue
    write T to Result
    Prev = T
}

COMP9315 24T1 ♢ Lectures Part C ♢ [42/51]
❖ Cost of Sort-based Projection

The costs involved are (assuming B=n+1 buffers for sort):

Cost = sum of above = bR + bT + 2.bT.(1+ceil(lognb0)) + bT + bOut
COMP9315 24T1 ♢ Lectures Part C ♢ [43/51]
❖ Hash-based Projection

Partitioning phase:

[Diagram:Pics/scansortproj/hash-project.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [44/51]
❖ Hash-based Projection (cont)

Duplicate elimination phase:

[Diagram:Pics/scansortproj//hash-project2.png]

COMP9315 24T1 ♢ Lectures Part C ♢ [45/51]
❖ Hash-based Projection (cont)

Algorithm for both phases:

for each tuple T in relation Rel {
    T' = mkTuple([attrs],T)
    H = h1(T', n)
    B = buffer for partition[H]
    if (B full) write and clear B
    insert T' into B
}
for each partition P in 0..n-1 {
    for each tuple T in partition P {
        H = h2(T, n)
        B = buffer for hash value H
        if (T not in B) insert T into B
        // assumes B never gets full
    }
    write and clear all buffers
}

COMP9315 24T1 ♢ Lectures Part C ♢ [46/51]
❖ Cost of Hash-based Projection

The total cost is the sum of the following:

Cost = bR + 2bP + bOut

To ensure that n is larger than the largest partition ...

COMP9315 24T1 ♢ Lectures Part C ♢ [47/51]
❖ Projection on Primary Key

No duplicates, so the above approaches are not required.

Method:

bR = nPages(Rel)
for i in 0 .. bR-1 {
   P = read page i
   for j in 0 .. nTuples(P) {
      T = getTuple(P,j)
      T' = mkTuple(pk, T)
      if (outBuf is full) write and clear
      append T' to outBuf
   }
}
if (nTuples(outBuf) > 0) write

COMP9315 24T1 ♢ Lectures Part C ♢ [48/51]
❖ Index-only Projection

Can do projection without accessing data file iff ...

Basic idea:

Cost analysis ...
COMP9315 24T1 ♢ Lectures Part C ♢ [49/51]
❖ Comparison of Projection Methods

Difficult to compare, since they make different assumptions:

Best case scenario for each (assuming n+1 in-memory buffers):

We normally omit bOut, since each method produces the same result
COMP9315 24T1 ♢ Lectures Part C ♢ [50/51]
❖ Projection in PostgreSQL

Code for projection forms part of execution iterators:

Functions involved with projection: plus many many others ...
COMP9315 24T1 ♢ Lectures Part C ♢ [51/51]


Produced: 30 Apr 2024