Scan, Sort, Project
| Implementing Relational Operations |
| Relational Operations | 2/93 |
DBMS core = relational engine, with implementations of
| ... Relational Operations | 3/93 |
Implementation of relational operations in DBMS:
| ... Relational Operations | 4/93 |
All relational operations return a set of tuples.
Can represent a typical operation programmatically as:
ResultSet = {} // initially an empty set
while (t = nextRelevantTuple()) {
// format tuple according to projection
t' = formatResultTuple(t,Projection)
// add next relevant tuple to result set
ResultSet = ResultSet ∪ t'
}
return ResultSet
All of the hard work is in the nextRelevantTuple()
| ... Relational Operations | 5/93 |
nextRelevantTuple()
nextRelevantTuple()ResultSetResultSet
| ... Relational Operations | 6/93 |
There are three "dimensions of variation" in this system:
| Query Types | 7/93 |
Queries fall into a number of classes:
| SQL | RelAlg | a.k.a. | ||||
select * from R |
R | - | ||||
select from R |
Proj[x,y]R | - | ||||
select * from Rorder by |
Sort[x]R | ord |
| ... Query Types | 8/93 |
| SQL | RelAlg | a.k.a. | ||||
select * from Rwhere id = |
Sel[id=k]R | one | ||||
select * from Rwhere a = |
Sel[a=k]R | - | ||||
select * from Rwhere a= and b= |
Sel[a=j ∧ b=k]R | pmr | ||||
select * from Rwhere a> and a< |
Sel[a>j ∧ a<k]R | rng | ||||
select * from Rwhere a> and a<and b> and b< |
Sel[...]R | space |
| ... Query Types | 9/93 |
| SQL | RelAlg | a.k.a. | ||||
select * from R,Swhere R.id = S.r |
R Join[id=r] S | - | ||||
select * from R,Swhere R.=S. and R.=S. |
R Join[v=w ∧ x=y] S | - | ||||
select * from R,Swhere R.S. |
R Join[...] S | - | ||||
select * from Rwhere R.* |
R ≅ Obj | sim |
| Cost Models |
| Cost Models | 11/93 |
An important aspect of this course is
Rather, we attempt to develop cost models
| ... Cost Models | 12/93 |
Assumptions in our cost models:
| ... Cost Models | 13/93 |
In developing cost models, we also assume:
| ... Cost Models | 14/93 |
Typical values for measures used in cost models:
| Symbol | E.g. Value | |||
| r | 106 | |||
| R | 128 bytes | |||
| b | 105 | |||
| B | 8192 bytes | |||
| c | 60 | |||
| Tr,Tw | 10 msec | |||
| - | ≅ 0 | |||
answers for query q |
bq | ≥ 0 |
| ... Cost Models | 15/93 |
With buffer pool, request_page()
Instead, we assume no buffer pool (worst-case cost analysis)
Use either readPage()get_page()
// Assume data types for Relation, Page
get_page(Relation r, int pid, Page buf)
{
buf = readPage(r.file, pid);
}
Page readPage(File f, int pid)
{
Page buf = newPageBuffer();
lseek(f, pid*PAGE_SIZE, SEEK_SET);
read(f, buf, PAGE_SIZE);
return buf;
}
| Example file structures | 16/93 |
When describing file structures
| ... Example file structures | 17/93 |
Consider three simple file structures:
Some records in each page may be marked as "deleted".
| ... Example file structures | 18/93 |
Heap file with b = 4, c = 4:
| ... Example file structures | 19/93 |
Sorted file with b = 4, c = 4:
| ... Example file structures | 20/93 |
Hashed file with b = 3, c = 4, h(k) = k%3
| ... Example file structures | 21/93 |
Indexed file with b = 4, c = 4, bi = 2, ci = 8:
| Scanning |
| Scanning | 23/93 |
Consider the query:
select * from T;
Conceptually:
for each tuple t in relation T {
add tuple t to result set
}
| ... Scanning | 24/93 |
Implemented via iteration over file containing T:
for each page P in file of relation T {
for each tuple t in page P {
add tuple t to result set
}
}
Cost: read every data page once
Cost = b.Tr
| ... Scanning | 25/93 |
In terms of file operations:
// implementation of "select * from T" File inf;// data file handle int p;// input file page number Buffer buf;// input file buffer int i;// current record in input buf Tuple t;// data for current record inf = openFile(fileName("T"), READ) for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } }
| ... Scanning | 26/93 |
Scan implementation when file has overflow pages, e.g.
| ... Scanning | 27/93 |
In this case, the implementation changes to:
for each page P in file of relation T {
for each tuple t in page P {
add tuple t to result set
}
for each overflow page V of page P {
for each tuple t in page V {
add tuple t to result set
} } }
Cost: read each data and overflow page once
Cost = (b + bOv).Tr
where bOv = total number of overflow pages
| ... Scanning | 28/93 |
In terms of file operations:
// implementation of "select * from T" File inf;// data file handle File ovf;// overflow file handle int p;// input file page number int ovp;// overflow file page number Buffer buf;// input file buffer int i;// current record in input buf Tuple t;// data for current record inf = openFile(fileName("T"), READ) ovf = openFile(ovFileName("T"), READ) for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } ovp = ovflow(buf); while (ovp != NO_PAGE) { buf = readPage(ovf,ovp); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } ovp = ovflow(buf); } }
Cost: read data+ovflow page Cost = (b+bov).Tr
| Selection via Scanning | 29/93 |
Consider a one query like:
select * from Employee where id = 762288;
In an unordered file, search for matching record requires:
Guaranteed at most one answer; could be in any page.
| ... Selection via Scanning | 30/93 |
In terms of file operations (assuming var delcarations as before):
inf = openFile(fileName("Employee"), READ);
for (p = 0; p < nPages(inf); p++)
buf = readPage(inf,p);
for (i = 0; i < nTuples(buf); i++) {
t = getTuple(buf,i);
if (getField(t,"id") == 762288)
return t;
} }
For different selection condition, simply replace
(getField(t,"id")==762288)
| ... Selection via Scanning | 31/93 |
Cost analysis for one searching in unordered file
| File Copying | 32/93 |
Consider an SQL statement like:
create table T as (select * from S);
Effectively, copies data from one file to another.
Conceptually:
make empty relation T
for each tuple t in relation S {
append tuple t to relation T
}
| ... File Copying | 33/93 |
In terms of previously defined relation/page/tuple operations:
Relation in;// relation handle (incl. files) Relation out; // relation handle (incl. files) int ipid,opid; // input/output page indexes int tid;// record/tuple index on current page Record rec;// current record (tuple) Page ibuf,obuf;// input/output file buffers in = openRelation("S", READ); out = openRelation("T", NEW|WRITE); clear(obuf); opid = 0; for (ipid = 0; ipid < nPages(in); ipid++) { get_page(in, ipid, ibuf); for (tid = 0; tid < nTuples(ibuf); tid++) { rec = get_record(ibuf, tid); if (!hasSpace(obuf,rec)) { put_page(out, opid++, obuf); clear(obuf); } insert_record(obuf,rec); } } if (nTuples(obuf) > 0) put_page(out, opid, obuf);
| Exercise 1: Cost of Relation Copy | 34/93 |
Analyse cost for relation copying:
Give cost in terms of #pages read + #pages written
| Iterators | 35/93 |
Higher-levels of DBMS are given a view of scanning as:
cursor = initScan(relName,condition);
while (tup = getNextTuple(cursor)) {
process tup
}
endScan(cursor);
Also known as iterator.
| ... Iterators | 36/93 |
Implementation of simple scan iterator (via file operations):
typedef struct {
File inf; // data file handle
Buffer buf; // input buffer
int curp; // current page number
int curi; // current record number
Expr cond; // representation of condition
} Cursor;
| ... Iterators | 37/93 |
Implementation of simple scan iterator (continued):
Cursor *initScan(char *rel, char *cond)
{
Cursor *c;
c = malloc(sizeof(Cursor));
c->inf = openFile(fileName(rel),READ);
c->buf = readPage(c->inf,0);
c->curp = 0;
c->curi = 0;
c->cond = makeTestableCondition(cond);
return c;
}
void endScan(Course *c)
{
closeFile(c->inf);
freeExpr(c->cond);
free(c);
}
| ... Iterators | 38/93 |
Implementation of simple scan iterator (continued):
Tuple getNextTuple(Cursor *c)
{
getNextTuple:
if (c->curi < nTuples(c->buf))
return getTuple(c->buf, c->curi++);
else {
// no more tuples in this page; get next page
c->curp++;
if (c->curp == nPages(c->inf))
return NULL; // no more pages
else {
c->buf = readPage(c->inf,c->curp);
c->curi = 0;
goto getNextTuple;
}
}
}
| ... Iterators | 39/93 |
Implementation of full iterator interface via file operations:
typedef struct {
File inf; // data file handle
File ovf; // overflow file handle
Buffer buf; // input buffer
int curp; // current page number
int curop; // current ovflow page number
int curi; // current record number
Expr cond; // representation of condition
} Cursor;
| ... Iterators | 40/93 |
Implementation of full iterator interface (continued):
Cursor *initScan(char *rel, char *cond)
{
Cursor *c;
c = malloc(sizeof(Cursor));
c->inf = openFile(fileName(rel),READ);
c->ovf = openFile(ovFileName(rel),READ);
c->buf = readPage(c->inf,0);
c->curp = 0;
c->curop = NO_PAGE;
c->curi = 0;
c->cond = makeTestableCondition(cond)
return c;
}
void endScan(Course *c)
{
closeFile(c->inf);
if (c->ovf) closeFile(c->ovf);
freeExpr(c->cond);
free(c);
}
| ... Iterators | 41/93 |
Implementation of scanning interface (continued):
Tuple getNextTuple(Cursor *c)
{
getNextTuple:
if (c->curi < nTuples(c->buf))
return getTuple(c->buf, c->curi++);
else {
// no more tuples in this page; get next page
if (c->curop == NO_PAGE) {
c->curop = ovflow(c->buf);
if (c->curop != NO_PAGE) {
// start ovflow chain scan
getNextOvPage:
c->buf = readPage(c->ovf,c->curop);
c->curi = 0;
goto getNextTuple;
}
else {
getNextDataPage:
c->curp++;
if (c->curp == nPages(c->inf))
return NULL; // no more pages
else {
c->buf = readPage(c->inf,c->curp);
c->curi = 0;
goto getNextTuple;
}
}
}
else {
// continue ovflow chain scan
c->curop = ovflow(c->buf);
if (c->curop == NO_PAGE)
goto getNextDataPage;
else
goto getNextOvPage;
}
}
}
| Scanning in PostgreSQL | 42/93 |
Scanning defined in: /backend/access/heap/heapam.c
Implements iterator data/operations:
HeapScanDesc
scan = heap_beginscan(rel,...,nkeys,keys)
... uses initscan()
tup = heap_getnext(scan, direction)
... uses heapgettup()
heap_endscan(scan)scan
res = HeapKeyTest(tuple,...,nkeys,keys)
... performs ScanKeys
| ... Scanning in PostgreSQL | 43/93 |
typedef struct HeapScanDescData
{
// scan parameters
Relation rs_rd; // heap relation descriptor
Snapshot rs_snapshot; // snapshot ... tuple visibility
int rs_nkeys; // number of scan keys
ScanKey rs_key; // array of scan key descriptors
...
// state set up at initscan time
PageNumber rs_npages; // number of pages to scan
PageNumber rs_startpage; // page # to start at
...
// scan current state, initally set to invalid
HeapTupleData rs_ctup; // current tuple in scan
PageNumber rs_cpage; // current page # in scan
Buffer rs_cbuf; // current buffer in scan
...
} HeapScanDescData;
| Scanning in other File Structures | 44/93 |
Above examples are for heap files
btreehashgistgin
| Sorting |
| The Sort Operation | 46/93 |
Sorting is explicit in queries only in the order by
select * from Students order by name;
More important, sorting is used internally in other operations:
group by
| External Sorting | 47/93 |
Sort methods such as quicksort are designed for in-memory data.
For data on disks, need external sorting techniques.
The standard external sorting method (merge sort) works by
| ... External Sorting | 48/93 |
Sorting tuples within pages
| Two-way Merge Sort | 49/93 |
Requires three in-memory buffers:
Assumption: cost of merge on two buffers ≅ 0.
| ... Two-way Merge Sort | 50/93 |
Two-way merge-sort method:
read each page into buffer, sort it, write it
numberOfRuns = b; runLength = 1;
while (numberOfRuns > 1) {
for each pair of adjacent runs {
merge the pair of runs to output, by
- read pages from runs into input
buffers, one page at a time
- apply merge algorithm to transfer
tuples to output buffer
- flush output buffer when full and
when merge finished
}
numberOfRuns = numberOfRuns / 2
runLength = runLength * 2
}
| ... Two-way Merge Sort | 51/93 |
Example:
| ... Two-way Merge Sort | 52/93 |
Two-way merge-sort method (improved):
numberOfRuns = b; runLength = 1;
while (numberOfRuns > 1) {
for each pair of adjacent runs {
merge the pair of runs to output, by
- read pages from runs into input
buffers, one page at a time
- if (runLength == 1)
sort contents of each input buffer
- apply merge algorithm to transfer
tuples to output buffer
- flush output buffer when full and
when merge finished
}
numberOfRuns = numberOfRuns / 2
runLength = runLength * 2
}
Avoids first pass to sort contents of individual pages.
| ... Two-way Merge Sort | 53/93 |
Consider file where b = 2k:
| ... Two-way Merge Sort | 54/93 |
Example:
| Merging Two Sorted Pages | 55/93 |
Method using operations on files and buffers:
// Pre: buffers B1,B2; outfile position op
// Post: tuples from B1,B2 output in order
i1 = i2 = 0; clear(Out);
R1 = getTuple(B1,i1); R2 = getTuple(B2,i2);
while (i1 < nTuples(B1) && i2 < nTuples(B2)) {
if (lessThan(R1,R2))
{ addTuple(R1,Out); i1++; R1 = getTuple(B1,i1); }
else
{ addTuple(R2,Out); i2++; R2 = getTuple(B2,i2); }
if (isFull(Out))
{ writePage(outf,op++,Out); clear(Out); }
}
for (i1=i1; i1 < nTuples(B1); i1++) {
addTuple(getTuple(B1,i1), Out);
if (isFull(Out))
{ writePage(outf,op++,Out); clear(Out); }
}
for (i2=i2; i2 < nTuples(B2); i2++) {
addTuple(getTuple(B2,i2), Out);
if (isFull(Out))
{ writePage(outf,op++,Out); clear(Out); }
}
if (nTuples(Out) > 0) writePage(outf,op,Out);
| Merging Runs vs Merging Pages | 56/93 |
In the above, we merged two input buffers.
In general, we need to merge sorted "runs" of pages.
The only difference that this makes to the above method:
R1 = getTuple(B1,i1);
becomes
if (i1 == nTuples(B1)) {
B1 = readPage(inf,ip++); i1 = 0;
}
R1 = getTuple(B1,i1);
| Comparison for Sorting | 57/93 |
Above assumes that we have a function to compare tuples.
Mechanism needs to be generic, to handle all of:
select * from Employee order by eid; select * from Employee order by name; select * from Employee order by age;
Envisage a function tupCompare(r1,r2,f)strcmp
r1r2fr1.f < r2.fr1.f > r2.fr1.f == r2.f
| ... Comparison for Sorting | 58/93 |
In reality, need to sort on multiple attributes and ASC/DESC, e.g.
-- example multi-attribute sort select * from Students order by age desc, year_enrolled
Sketch of multi-attribute sorting function
int tupCompare(r1,r2,criteria)
{
foreach (f,ord) in criteria {
if (ord == ASC) {
if (r1.f < r2.f) return -1;
if (r1.f > r2.f) return 1;
}
else {
if (r1.f > r2.f) return -1;
if (r1.f < r2.f) return 1;
}
}
return 0;
}
| Cost of Two-way Merge Sort | 59/93 |
For a file containing b data pages:
Example: Relation with r=105 and c=50 ⇒ b=2000 pages.
Number of passes for sort: ⌈log22000⌉ = 11
Reads/writes entire file 11 times! Can we do better?
| n-Way Merge Sort | 60/93 |
Initial pass uses: B total buffers
| ... n-Way Merge Sort | 61/93 |
Merge passes use: n input buffers, 1 output buffer
| ... n-Way Merge Sort | 62/93 |
Method:
// Produce B-page-long runs for each group of B pages in Rel { read pages into memory buffers sort group in memory write pages out to Temp }// Merge runs until everything sorted // n-way merge, where n=B-1 numberOfRuns = ⌈b/B⌉ while (numberOfRuns > 1) { for each group of n runs in Temp { merge into a single run via input buffers write run to newTemp via output buffer } numberOfRuns = ⌈numberOfRuns/n⌉ Temp = newTemp// swap input/output files }
| ... n-Way Merge Sort | 63/93 |
Method for merging n runs (n input buffers, 1 output buffer):
for i = 1..n {
read first page of run[i] into a buffer[i]
set current tuple cur[i] to first tuple in buffer[i]
}
while (more than 1 run still has tuples) {
s = find buffer with smallest tuple as cur[i]
copy tuple cur[i] to output buffer
if (output buffer full) { write it and clear it}
advance cur[i] to next tuple
if (no more tuples in buffer[i]) {
if (no more pages in run[i])
mark run[i] as complete
else {
read next page of run[i] into buffer[i]
set cur[i] to first tuple in buffer[i]
} } }
copy tuples in non-empty buffer to output
| Cost of n-Way Merge Sort | 64/93 |
Consider file where b = 4096, B = 16 total buffers:
| ... Cost of n-Way Merge Sort | 65/93 |
Generalising from previous example ...
For b data pages and B buffers
| ... Cost of n-Way Merge Sort | 66/93 |
Costs (number of passes) for varying b and B (n=B-1):
| B=3 | B=16 | B=128 | ||||
| 7 | 2 | 1 | ||||
| 10 | 3 | 2 | ||||
| 13 | 4 | 2 | ||||
| 17 | 5 | 3 | ||||
| 20 | 5 | 3 |
In the above, we assume that
| Sorting in PostgreSQL | 67/93 |
Sort uses a merge-sort (from Knuth) similar to above:
SortTupleqsort()If memory fills while reading, form "runs" and do disk-based sort.
| ... Sorting in PostgreSQL | 68/93 |
Disk-based sort has phases:
workMemEffectively, a "tape" is a sorted run.
Implementation of "tapes": backend/utils/sort/logtape.c
| ... Sorting in PostgreSQL | 69/93 |
Sorting comparison operators are obtained via catalog (in Type.o
// gets pointer to function via pg_operator struct Tuplesortstate { ... SortTupleComparator ... };// returns negative, zero, positive ApplySortComparator(Datum datum1, bool isnull1, Datum datum2, bool isnull2, SortSupport sort_helper);
Flags indicate: ascending/descending, nulls-first/last.
ApplySortComparator()tupCompare()
| Implementing Projection |
| The Projection Operation | 71/93 |
Consider the query:
select distinct name,age from Employee;
If the Employee
(94002, John, Sales, Manager, 32) (95212, Jane, Admin, Manager, 39) (96341, John, Admin, Secretary, 32) (91234, Jane, Admin, Secretary, 21)
then the result of the projection is:
(Jane, 21) (Jane, 39) (John, 32)
Note that duplicate tuples (e.g. (John,32)
| ... The Projection Operation | 72/93 |
The projection operation needs to:
(straightforward, whichever file organisation is used)
(straightforward, manipulating internal record structure)
(not as simple as other operations ...)
| Removing Attributes | 73/93 |
Projecting attributes involves creating a new tuple, using only some values from the original tuple.
Precisely how to achieve this depends on tuple internals.
Removing attributes from fixed-length tuples:
| ... Removing Attributes | 74/93 |
Removing attributes from variable-length tuples:
| Sort-based Projection | 75/93 |
Overview of the method:
RelTemp
| ... Sort-based Projection | 76/93 |
The method, in detail:
// Inputs: relName, attrList inf = openFile(fileName(relName),READ); tempf = openFile(tmpName,CREATE); clear(outbuf); j = 0; for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); newtup = project(tup,attrList); addTuple(newtup,outbuf); if (isFull(outbuf)) { writePage(tempf,j++,outbuf); clear(outbuf); } } } mergeSort(tempf);
| ... Sort-based Projection | 77/93 |
tempf = openFile(tmpName,READ);
outf = openFile(result,CREATE);
clear(outbuf); prev = EMPTY; j = 0;
for (p = 0; p < nPages(tempf); p++) {
buf = readPage(tempf,p);
for (i = 0; i < nTuples(buf); i++) {
tup = getTuple(buf,i);
if (tupCompare(tup,prev) != 0) {
addTuple(tup,outbuf);
if (isFull(outbuf)) {
writePage(outf,j++,outbuf);
clear(outbuf);
}
prev = tup;
}
}
}
| Cost of Sort-based Projection | 78/93 |
The costs involved are (assuming B=n+1 buffers for sort):
RelTempTempTempNote that we often ignore cost of writing the result; especially when comparing different algorithms for the same relational operation.
| Improving Sort-based Projection | 79/93 |
Some approaches for improving the cost:
| Hash-based Projection | 80/93 |
Overview of the method:
Rel
| Hash Functions | 81/93 |
Hash function h(tuple,range)
mod
| ... Hash Functions | 82/93 |
Usual approach in hash function:
unsigned int hash(char *val, int b)
{
char *cp;
unsigned int v, sum = 0;
for (c = val; *c != '\0'; c++) {
v = *c + (*(c+1) << 8);
sum += (sum + 2153*v) % 19937;
}
return(sum % b);
}
| Hash-based Projection | 83/93 |
Partitioning phase:
| ... Hash-based Projection | 84/93 |
Algorithm for partitioning phase:
for each page P in relation Rel {
for each tuple t in page P {
t' = project(t, attrList)
H = h1(t', B-1)
write t' to partition[H]
} }
Each partition could be implemented as a simple data file.
| ... Hash-based Projection | 85/93 |
Duplicate elimination phase:
| ... Hash-based Projection | 86/93 |
Algorithm for duplicate elimination phase:
for each partition P in 0..B-2 {
for each tuple t in partition P {
H = h2(t, B-1)
if (!(t occurs in buffer[H]))
append t to buffer H
}
output contents of all buffers
clear all buffers
}
| Cost of Hash-based Projection | 87/93 |
The total cost is the sum of the following:
Rel
| ... Cost of Hash-based Projection | 88/93 |
If the largest partition had more than B-1 pages
| Index-only Projection | 89/93 |
Under the conditions:
Basic idea:
| ... Index-only Projection | 90/93 |
Method:
for each entry I in index file {
tup = project(I.key, attrList)
if (tupCompare(tup,prev) != 0) {
addTuple(outbuf,tup)
if (isFull(outbuf)) {
writePage(outf,op++,outbuf);
clear(outbuf);
}
prev = tup;
} }
"for each index entry": loop over index pages and loop over entries in each page
| Cost of Index-only Projection | 91/93 |
Assume that the index (see details later):
Index
| Comparison of Projection Methods | 92/93 |
Difficult to compare, since they make different assumptions:
Best case scenario for each (assuming B+1 in-memory buffers):
| Projection in PostgreSQL | 93/93 |
Code for projection forms part of execution iterators:
ExecProject(projInfo,...)ExecTargetList(...)ExecStoreTuple(newTuple,...)Produced: 24 Jun 2019