Scan, Sort, Project
Implementing Relational Operations |
Relational Operations | 2/93 |
DBMS core = relational engine, with implementations of
... Relational Operations | 3/93 |
Implementation of relational operations in DBMS:
... Relational Operations | 4/93 |
All relational operations return a set of tuples.
Can represent a typical operation programmatically as:
ResultSet = {}// initially an empty set while (t = nextRelevantTuple()) {// format tuple according to projection t' = formatResultTuple(t,Projection)// add next relevant tuple to result set ResultSet = ResultSet ∪ t' } return ResultSet
All of the hard work is in the nextRelevantTuple()
... Relational Operations | 5/93 |
nextRelevantTuple()
nextRelevantTuple()
ResultSet
ResultSet
... Relational Operations | 6/93 |
There are three "dimensions of variation" in this system:
Query Types | 7/93 |
Queries fall into a number of classes:
SQL | RelAlg | a.k.a. | ||||
select * from R |
R | - | ||||
select from R |
Proj[x,y]R | - | ||||
select * from R order by |
Sort[x]R | ord |
... Query Types | 8/93 |
SQL | RelAlg | a.k.a. | ||||
select * from R where id = |
Sel[id=k]R | one | ||||
select * from R where a = |
Sel[a=k]R | - | ||||
select * from R where a= and b= |
Sel[a=j ∧ b=k]R | pmr | ||||
select * from R where a> and a< |
Sel[a>j ∧ a<k]R | rng | ||||
select * from R where a> and a< and b> and b< |
Sel[...]R | space |
... Query Types | 9/93 |
SQL | RelAlg | a.k.a. | ||||
select * from R,S where R.id = S.r |
R Join[id=r] S | - | ||||
select * from R,S where R. =S. and R. =S. |
R Join[v=w ∧ x=y] S | - | ||||
select * from R,S where R. S. |
R Join[...] S | - | ||||
select * from R where R.* |
R ≅ Obj | sim |
Cost Models |
Cost Models | 11/93 |
An important aspect of this course is
Rather, we attempt to develop cost models
... Cost Models | 12/93 |
Assumptions in our cost models:
... Cost Models | 13/93 |
In developing cost models, we also assume:
... Cost Models | 14/93 |
Typical values for measures used in cost models:
Symbol | E.g. Value | |||
r | 106 | |||
R | 128 bytes | |||
b | 105 | |||
B | 8192 bytes | |||
c | 60 | |||
Tr,Tw | 10 msec | |||
- | ≅ 0 | |||
answers for query q |
bq | ≥ 0 |
... Cost Models | 15/93 |
With buffer pool, request_page()
Instead, we assume no buffer pool (worst-case cost analysis)
Use either readPage()
get_page()
// Assume data types for Relation, Page get_page(Relation r, int pid, Page buf) { buf = readPage(r.file, pid); } Page readPage(File f, int pid) { Page buf = newPageBuffer(); lseek(f, pid*PAGE_SIZE, SEEK_SET); read(f, buf, PAGE_SIZE); return buf; }
Example file structures | 16/93 |
When describing file structures
... Example file structures | 17/93 |
Consider three simple file structures:
Some records in each page may be marked as "deleted".
... Example file structures | 18/93 |
Heap file with b = 4, c = 4:
... Example file structures | 19/93 |
Sorted file with b = 4, c = 4:
... Example file structures | 20/93 |
Hashed file with b = 3, c = 4, h(k) = k%3
... Example file structures | 21/93 |
Indexed file with b = 4, c = 4, bi = 2, ci = 8:
Scanning |
Scanning | 23/93 |
Consider the query:
select * from T;
Conceptually:
for each tuple t in relation T { add tuple t to result set }
... Scanning | 24/93 |
Implemented via iteration over file containing T:
for each page P in file of relation T { for each tuple t in page P { add tuple t to result set } }
Cost: read every data page once
Cost = b.Tr
... Scanning | 25/93 |
In terms of file operations:
// implementation of "select * from T" File inf;// data file handle int p;// input file page number Buffer buf;// input file buffer int i;// current record in input buf Tuple t;// data for current record inf = openFile(fileName("T"), READ) for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } }
... Scanning | 26/93 |
Scan implementation when file has overflow pages, e.g.
... Scanning | 27/93 |
In this case, the implementation changes to:
for each page P in file of relation T { for each tuple t in page P { add tuple t to result set } for each overflow page V of page P { for each tuple t in page V { add tuple t to result set } } }
Cost: read each data and overflow page once
Cost = (b + bOv).Tr
where bOv = total number of overflow pages
... Scanning | 28/93 |
In terms of file operations:
// implementation of "select * from T" File inf;// data file handle File ovf;// overflow file handle int p;// input file page number int ovp;// overflow file page number Buffer buf;// input file buffer int i;// current record in input buf Tuple t;// data for current record inf = openFile(fileName("T"), READ) ovf = openFile(ovFileName("T"), READ) for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } ovp = ovflow(buf); while (ovp != NO_PAGE) { buf = readPage(ovf,ovp); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } ovp = ovflow(buf); } }
Cost: read data+ovflow page Cost = (b+bov).Tr
Selection via Scanning | 29/93 |
Consider a one query like:
select * from Employee where id = 762288;
In an unordered file, search for matching record requires:
Guaranteed at most one answer; could be in any page.
... Selection via Scanning | 30/93 |
In terms of file operations (assuming var delcarations as before):
inf = openFile(fileName("Employee"), READ); for (p = 0; p < nPages(inf); p++) buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); if (getField(t,"id") == 762288) return t; } }
For different selection condition, simply replace
(getField(t,"id")==762288)
... Selection via Scanning | 31/93 |
Cost analysis for one searching in unordered file
File Copying | 32/93 |
Consider an SQL statement like:
create table T as (select * from S);
Effectively, copies data from one file to another.
Conceptually:
make empty relation T for each tuple t in relation S { append tuple t to relation T }
... File Copying | 33/93 |
In terms of previously defined relation/page/tuple operations:
Relation in;// relation handle (incl. files) Relation out; // relation handle (incl. files) int ipid,opid; // input/output page indexes int tid;// record/tuple index on current page Record rec;// current record (tuple) Page ibuf,obuf;// input/output file buffers in = openRelation("S", READ); out = openRelation("T", NEW|WRITE); clear(obuf); opid = 0; for (ipid = 0; ipid < nPages(in); ipid++) { get_page(in, ipid, ibuf); for (tid = 0; tid < nTuples(ibuf); tid++) { rec = get_record(ibuf, tid); if (!hasSpace(obuf,rec)) { put_page(out, opid++, obuf); clear(obuf); } insert_record(obuf,rec); } } if (nTuples(obuf) > 0) put_page(out, opid, obuf);
Exercise 1: Cost of Relation Copy | 34/93 |
Analyse cost for relation copying:
Give cost in terms of #pages read + #pages written
Iterators | 35/93 |
Higher-levels of DBMS are given a view of scanning as:
cursor = initScan(relName,condition); while (tup = getNextTuple(cursor)) { process tup } endScan(cursor);
Also known as iterator.
... Iterators | 36/93 |
Implementation of simple scan iterator (via file operations):
typedef struct { File inf;// data file handle Buffer buf;// input buffer int curp;// current page number int curi;// current record number Expr cond;// representation of condition } Cursor;
... Iterators | 37/93 |
Implementation of simple scan iterator (continued):
Cursor *initScan(char *rel, char *cond) { Cursor *c; c = malloc(sizeof(Cursor)); c->inf = openFile(fileName(rel),READ); c->buf = readPage(c->inf,0); c->curp = 0; c->curi = 0; c->cond = makeTestableCondition(cond); return c; } void endScan(Course *c) { closeFile(c->inf); freeExpr(c->cond); free(c); }
... Iterators | 38/93 |
Implementation of simple scan iterator (continued):
Tuple getNextTuple(Cursor *c) { getNextTuple: if (c->curi < nTuples(c->buf)) return getTuple(c->buf, c->curi++); else {// no more tuples in this page; get next page c->curp++; if (c->curp == nPages(c->inf)) return NULL;// no more pages else { c->buf = readPage(c->inf,c->curp); c->curi = 0; goto getNextTuple; } } }
... Iterators | 39/93 |
Implementation of full iterator interface via file operations:
typedef struct { File inf;// data file handle File ovf;// overflow file handle Buffer buf;// input buffer int curp;// current page number int curop;// current ovflow page number int curi;// current record number Expr cond;// representation of condition } Cursor;
... Iterators | 40/93 |
Implementation of full iterator interface (continued):
Cursor *initScan(char *rel, char *cond) { Cursor *c; c = malloc(sizeof(Cursor)); c->inf = openFile(fileName(rel),READ); c->ovf = openFile(ovFileName(rel),READ); c->buf = readPage(c->inf,0); c->curp = 0; c->curop = NO_PAGE; c->curi = 0; c->cond = makeTestableCondition(cond) return c; } void endScan(Course *c) { closeFile(c->inf); if (c->ovf) closeFile(c->ovf); freeExpr(c->cond); free(c); }
... Iterators | 41/93 |
Implementation of scanning interface (continued):
Tuple getNextTuple(Cursor *c) { getNextTuple: if (c->curi < nTuples(c->buf)) return getTuple(c->buf, c->curi++); else {// no more tuples in this page; get next page if (c->curop == NO_PAGE) { c->curop = ovflow(c->buf); if (c->curop != NO_PAGE) {// start ovflow chain scan getNextOvPage: c->buf = readPage(c->ovf,c->curop); c->curi = 0; goto getNextTuple; } else { getNextDataPage: c->curp++; if (c->curp == nPages(c->inf)) return NULL;// no more pages else { c->buf = readPage(c->inf,c->curp); c->curi = 0; goto getNextTuple; } } } else {// continue ovflow chain scan c->curop = ovflow(c->buf); if (c->curop == NO_PAGE) goto getNextDataPage; else goto getNextOvPage; } } }
Scanning in PostgreSQL | 42/93 |
Scanning defined in: /backend/access/heap/heapam.c
Implements iterator data/operations:
HeapScanDesc
scan = heap_beginscan(rel,...,nkeys,keys)
... uses initscan()
tup = heap_getnext(scan, direction)
... uses heapgettup()
heap_endscan(scan)
scan
res = HeapKeyTest(tuple,...,nkeys,keys)
... performs ScanKeys
... Scanning in PostgreSQL | 43/93 |
typedef struct HeapScanDescData {// scan parameters Relation rs_rd;// heap relation descriptor Snapshot rs_snapshot;// snapshot ... tuple visibility int rs_nkeys;// number of scan keys ScanKey rs_key;// array of scan key descriptors ...// state set up at initscan time PageNumber rs_npages;// number of pages to scan PageNumber rs_startpage;// page # to start at ...// scan current state, initally set to invalid HeapTupleData rs_ctup;// current tuple in scan PageNumber rs_cpage;// current page # in scan Buffer rs_cbuf;// current buffer in scan ... } HeapScanDescData;
Scanning in other File Structures | 44/93 |
Above examples are for heap files
btree
hash
gist
gin
Sorting |
The Sort Operation | 46/93 |
Sorting is explicit in queries only in the order by
select * from Students order by name;
More important, sorting is used internally in other operations:
group by
External Sorting | 47/93 |
Sort methods such as quicksort are designed for in-memory data.
For data on disks, need external sorting techniques.
The standard external sorting method (merge sort) works by
... External Sorting | 48/93 |
Sorting tuples within pages
Two-way Merge Sort | 49/93 |
Requires three in-memory buffers:
Assumption: cost of merge on two buffers ≅ 0.
... Two-way Merge Sort | 50/93 |
Two-way merge-sort method:
read each page into buffer, sort it, write it numberOfRuns = b; runLength = 1; while (numberOfRuns > 1) { for each pair of adjacent runs { merge the pair of runs to output, by - read pages from runs into input buffers, one page at a time - apply merge algorithm to transfer tuples to output buffer - flush output buffer when full and when merge finished } numberOfRuns = numberOfRuns / 2 runLength = runLength * 2 }
... Two-way Merge Sort | 51/93 |
Example:
... Two-way Merge Sort | 52/93 |
Two-way merge-sort method (improved):
numberOfRuns = b; runLength = 1; while (numberOfRuns > 1) { for each pair of adjacent runs { merge the pair of runs to output, by - read pages from runs into input buffers, one page at a time - if (runLength == 1) sort contents of each input buffer - apply merge algorithm to transfer tuples to output buffer - flush output buffer when full and when merge finished } numberOfRuns = numberOfRuns / 2 runLength = runLength * 2 }
Avoids first pass to sort contents of individual pages.
... Two-way Merge Sort | 53/93 |
Consider file where b = 2k:
... Two-way Merge Sort | 54/93 |
Example:
Merging Two Sorted Pages | 55/93 |
Method using operations on files and buffers:
// Pre: buffers B1,B2; outfile position op // Post: tuples from B1,B2 output in order i1 = i2 = 0; clear(Out); R1 = getTuple(B1,i1); R2 = getTuple(B2,i2); while (i1 < nTuples(B1) && i2 < nTuples(B2)) { if (lessThan(R1,R2)) { addTuple(R1,Out); i1++; R1 = getTuple(B1,i1); } else { addTuple(R2,Out); i2++; R2 = getTuple(B2,i2); } if (isFull(Out)) { writePage(outf,op++,Out); clear(Out); } } for (i1=i1; i1 < nTuples(B1); i1++) { addTuple(getTuple(B1,i1), Out); if (isFull(Out)) { writePage(outf,op++,Out); clear(Out); } } for (i2=i2; i2 < nTuples(B2); i2++) { addTuple(getTuple(B2,i2), Out); if (isFull(Out)) { writePage(outf,op++,Out); clear(Out); } } if (nTuples(Out) > 0) writePage(outf,op,Out);
Merging Runs vs Merging Pages | 56/93 |
In the above, we merged two input buffers.
In general, we need to merge sorted "runs" of pages.
The only difference that this makes to the above method:
R1 = getTuple(B1,i1);
becomes
if (i1 == nTuples(B1)) { B1 = readPage(inf,ip++); i1 = 0; } R1 = getTuple(B1,i1);
Comparison for Sorting | 57/93 |
Above assumes that we have a function to compare tuples.
Mechanism needs to be generic, to handle all of:
select * from Employee order by eid; select * from Employee order by name; select * from Employee order by age;
Envisage a function tupCompare(r1,r2,f)
strcmp
r1
r2
f
r1.f < r2.f
r1.f > r2.f
r1.f == r2.f
... Comparison for Sorting | 58/93 |
In reality, need to sort on multiple attributes and ASC/DESC, e.g.
-- example multi-attribute sort select * from Students order by age desc, year_enrolled
Sketch of multi-attribute sorting function
int tupCompare(r1,r2,criteria) { foreach (f,ord) in criteria { if (ord == ASC) { if (r1.f < r2.f) return -1; if (r1.f > r2.f) return 1; } else { if (r1.f > r2.f) return -1; if (r1.f < r2.f) return 1; } } return 0; }
Cost of Two-way Merge Sort | 59/93 |
For a file containing b data pages:
Example: Relation with r=105 and c=50 ⇒ b=2000 pages.
Number of passes for sort: ⌈log22000⌉ = 11
Reads/writes entire file 11 times! Can we do better?
n-Way Merge Sort | 60/93 |
Initial pass uses: B total buffers
... n-Way Merge Sort | 61/93 |
Merge passes use: n input buffers, 1 output buffer
... n-Way Merge Sort | 62/93 |
Method:
// Produce B-page-long runs for each group of B pages in Rel { read pages into memory buffers sort group in memory write pages out to Temp }// Merge runs until everything sorted // n-way merge, where n=B-1 numberOfRuns = ⌈b/B⌉ while (numberOfRuns > 1) { for each group of n runs in Temp { merge into a single run via input buffers write run to newTemp via output buffer } numberOfRuns = ⌈numberOfRuns/n⌉ Temp = newTemp// swap input/output files }
... n-Way Merge Sort | 63/93 |
Method for merging n runs (n input buffers, 1 output buffer):
for i = 1..n { read first page of run[i] into a buffer[i] set current tuple cur[i] to first tuple in buffer[i] } while (more than 1 run still has tuples) { s = find buffer with smallest tuple as cur[i] copy tuple cur[i] to output buffer if (output buffer full) { write it and clear it} advance cur[i] to next tuple if (no more tuples in buffer[i]) { if (no more pages in run[i]) mark run[i] as complete else { read next page of run[i] into buffer[i] set cur[i] to first tuple in buffer[i] } } } copy tuples in non-empty buffer to output
Cost of n-Way Merge Sort | 64/93 |
Consider file where b = 4096, B = 16 total buffers:
... Cost of n-Way Merge Sort | 65/93 |
Generalising from previous example ...
For b data pages and B buffers
... Cost of n-Way Merge Sort | 66/93 |
Costs (number of passes) for varying b and B (n=B-1):
B=3 | B=16 | B=128 | ||||
7 | 2 | 1 | ||||
10 | 3 | 2 | ||||
13 | 4 | 2 | ||||
17 | 5 | 3 | ||||
20 | 5 | 3 |
In the above, we assume that
Sorting in PostgreSQL | 67/93 |
Sort uses a merge-sort (from Knuth) similar to above:
SortTuple
qsort()
If memory fills while reading, form "runs" and do disk-based sort.
... Sorting in PostgreSQL | 68/93 |
Disk-based sort has phases:
workMem
Effectively, a "tape" is a sorted run.
Implementation of "tapes": backend/utils/sort/logtape.c
... Sorting in PostgreSQL | 69/93 |
Sorting comparison operators are obtained via catalog (in Type.o
// gets pointer to function via pg_operator struct Tuplesortstate { ... SortTupleComparator ... };// returns negative, zero, positive ApplySortComparator(Datum datum1, bool isnull1, Datum datum2, bool isnull2, SortSupport sort_helper);
Flags indicate: ascending/descending, nulls-first/last.
ApplySortComparator()
tupCompare()
Implementing Projection |
The Projection Operation | 71/93 |
Consider the query:
select distinct name,age from Employee;
If the Employee
(94002, John, Sales, Manager, 32) (95212, Jane, Admin, Manager, 39) (96341, John, Admin, Secretary, 32) (91234, Jane, Admin, Secretary, 21)
then the result of the projection is:
(Jane, 21) (Jane, 39) (John, 32)
Note that duplicate tuples (e.g. (John,32)
... The Projection Operation | 72/93 |
The projection operation needs to:
(straightforward, whichever file organisation is used)
(straightforward, manipulating internal record structure)
(not as simple as other operations ...)
Removing Attributes | 73/93 |
Projecting attributes involves creating a new tuple, using only some values from the original tuple.
Precisely how to achieve this depends on tuple internals.
Removing attributes from fixed-length tuples:
... Removing Attributes | 74/93 |
Removing attributes from variable-length tuples:
Sort-based Projection | 75/93 |
Overview of the method:
Rel
Temp
... Sort-based Projection | 76/93 |
The method, in detail:
// Inputs: relName, attrList inf = openFile(fileName(relName),READ); tempf = openFile(tmpName,CREATE); clear(outbuf); j = 0; for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); newtup = project(tup,attrList); addTuple(newtup,outbuf); if (isFull(outbuf)) { writePage(tempf,j++,outbuf); clear(outbuf); } } } mergeSort(tempf);
... Sort-based Projection | 77/93 |
tempf = openFile(tmpName,READ); outf = openFile(result,CREATE); clear(outbuf); prev = EMPTY; j = 0; for (p = 0; p < nPages(tempf); p++) { buf = readPage(tempf,p); for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tupCompare(tup,prev) != 0) { addTuple(tup,outbuf); if (isFull(outbuf)) { writePage(outf,j++,outbuf); clear(outbuf); } prev = tup; } } }
Cost of Sort-based Projection | 78/93 |
The costs involved are (assuming B=n+1 buffers for sort):
Rel
Temp
Temp
Temp
Note that we often ignore cost of writing the result; especially when comparing different algorithms for the same relational operation.
Improving Sort-based Projection | 79/93 |
Some approaches for improving the cost:
Hash-based Projection | 80/93 |
Overview of the method:
Rel
Hash Functions | 81/93 |
Hash function h(tuple,range)
mod
... Hash Functions | 82/93 |
Usual approach in hash function:
unsigned int hash(char *val, int b) { char *cp; unsigned int v, sum = 0; for (c = val; *c != '\0'; c++) { v = *c + (*(c+1) << 8); sum += (sum + 2153*v) % 19937; } return(sum % b); }
Hash-based Projection | 83/93 |
Partitioning phase:
... Hash-based Projection | 84/93 |
Algorithm for partitioning phase:
for each page P in relation Rel { for each tuple t in page P { t' = project(t, attrList) H = h1(t', B-1) write t' to partition[H] } }
Each partition could be implemented as a simple data file.
... Hash-based Projection | 85/93 |
Duplicate elimination phase:
... Hash-based Projection | 86/93 |
Algorithm for duplicate elimination phase:
for each partition P in 0..B-2 { for each tuple t in partition P { H = h2(t, B-1) if (!(t occurs in buffer[H])) append t to buffer H } output contents of all buffers clear all buffers }
Cost of Hash-based Projection | 87/93 |
The total cost is the sum of the following:
Rel
... Cost of Hash-based Projection | 88/93 |
If the largest partition had more than B-1 pages
Index-only Projection | 89/93 |
Under the conditions:
Basic idea:
... Index-only Projection | 90/93 |
Method:
for each entry I in index file { tup = project(I.key, attrList) if (tupCompare(tup,prev) != 0) { addTuple(outbuf,tup) if (isFull(outbuf)) { writePage(outf,op++,outbuf); clear(outbuf); } prev = tup; } }
"for each index entry": loop over index pages and loop over entries in each page
Cost of Index-only Projection | 91/93 |
Assume that the index (see details later):
Index
Comparison of Projection Methods | 92/93 |
Difficult to compare, since they make different assumptions:
Best case scenario for each (assuming B+1 in-memory buffers):
Projection in PostgreSQL | 93/93 |
Code for projection forms part of execution iterators:
ExecProject(projInfo,...)
ExecTargetList(...)
ExecStoreTuple(newTuple,...)
Produced: 24 Jun 2019