Week 04 Lectures
Tuples |
Tuples | 2/84 |
Each page contains a collection of tuples
What do tuples contain? How are they structured internally?
Records vs Tuples | 3/84 |
A table is defined by a schema, e.g.
create table Employee ( id integer primary key, name varchar(20) not null, job varchar(10), dept number(4) references Dept(id) );
where a schema is a collection of attributes (name,type,constraints)
Schema information (meta-data) is stored in the DB catalog
... Records vs Tuples | 4/84 |
Tuple = collection of attribute values based on a schema, e.g.
(33357462, 'Neil Young', 'Musician', 0277)
Record = sequence of bytes, containing data for one tuple, e.g.
Bytes need to be interpreted relative to schema to get tuple
Converting Records to Tuples | 5/84 |
A Record
byte[]
Tuple
Tuple
struct
... Converting Records to Tuples | 6/84 |
Information on how to interpret bytes in a record ...
Operations on Records | 7/84 |
Common operation on records ... access record via RecordId
Record get_record(Relation rel, RecordId rid) { (pid,tid) = rid; Page buf = get_page(rel, pid); return get_bytes(rel, buf, tid); }
Cannot use a Record
Tuple
Relation rel = ...// relation schema Record rec = get_record(rel, rid) Tuple t = mkTuple(rel, rec)
Once we have a Tuple
Operations on Tuples | 8/84 |
Once we have a record, we need to interpret it as a tuple ...
Tuple t = mkTuple(rel, rec)
rel
Once we have a tuple, we want to examines its contents ...
Typ getTypField(Tuple t, int i)
i
Tuple
int x = getIntField(t,1)
char *s = getStrField(t,2)
Fixed-length Records | 9/84 |
A possible encoding scheme for fixed-length records:
Since record format is frequently used at query time, cache in memory.
Variable-length Records | 10/84 |
Possible encoding schemes for variable-length records:
Data Types | 11/84 |
DBMSs typically define a fixed set of base types, e.g.
DATE
FLOAT
INTEGER
NUMBER(
)
VARCHAR(
)
This determines implementation-level data types for field values:
DATE |
time_t |
|
FLOAT |
float,double |
|
INTEGER |
int,long |
|
NUMBER( ) |
int[] |
|
VARCHAR( ) |
char[] |
PostgreSQL allows new base types to be added
Field Descriptors | 12/84 |
A Tuple
FieldDesc
Record
typedef struct { ushort nfields;// number of fields/attrs ushort data_off;// offset in struct for data FieldDesc fields[];// field descriptions Record data;// pointer to record in buffer } Tuple;
Fields are derived from relation descriptor + record instance data.
... Field Descriptors | 13/84 |
Tuple data
... Field Descriptors | 14/84 |
Or, tuple data
Tuple struct
Exercise 1: How big is a FieldDesc | 15/84 |
FieldDesc
E.g.
PostgreSQL Tuples | 16/84 |
Definitions: include/postgres.h
include/access/*tup*.h
Functions: backend/access/common/*tup*.c
HeapTuple heap_form_tuple(desc,values[],isnull[])
heap_deform_tuple(tuple,desc,values[],isnull[])
Datum
... PostgreSQL Tuples | 17/84 |
Tuple structure:
... PostgreSQL Tuples | 18/84 |
Tuple-related data types: (cont)
// TupleDesc: schema-related information for HeapTuples typedef struct tupleDesc { int natts;// # attributes in tuple Oid tdtypeid;// composite type ID for tuple type int32 tdtypmod;// typmod for tuple type bool tdhasoid;// does tuple have oid attribute? int tdrefcount;// reference count TupleConstr *constr;(-1 if not counting) // constraints, or NULL if none FormData_pg_attribute attrs[];// attrs[N] is a pointer to description of attribute N+1 } *TupleDesc;
... PostgreSQL Tuples | 19/84 |
Tuple-related data types: (cont)
// FormData_pg_attribute: // schema-related information for one attribute typedef struct FormData_pg_attribute { Oid attrelid;// OID of reln containing attr NameData attname;// name of attribute Oid atttypid;// OID of attribute's data type int16 attlen;// attribute length int32 attndims;// # dimensions if array type bool attnotnull;// can attribute have NULL value .....// and many other fields } FormData_pg_attribute;
For details, see include/catalog/pg_attribute.h
... PostgreSQL Tuples | 20/84 |
HeapTupleData
typedef HeapTupleData *HeapTuple; typedef struct HeapTupleData { uint32 t_len;// length of *t_data ItemPointerData t_self;// SelfItemPointer Oid t_tableOid;// table the tuple came from HeapTupleHeader t_data;// -> tuple header and data } HeapTupleData;
HeapTupleHeader
... PostgreSQL Tuples | 21/84 |
PostgreSQL stores a single block of data for tuple
byte[]
typedef struct HeapTupleHeaderData// simplified { HeapTupleFields t_heap; ItemPointerData t_ctid;// TID of newer version uint16 t_infomask2;// #attributes + flags uint16 t_infomask;// flags e.g. has_null uint8 t_hoff;// sizeof header incl. t_bits // above is fixed size (23 bytes) for all heap tuples bits8 t_bits[1];// bitmap of NULLs, var.len. // OID goes here if HEAP_HASOID is set in t_infomask // actual data follows at end of struct } HeapTupleHeaderData;
... PostgreSQL Tuples | 22/84 |
Some of the bits in t_infomask
#define HEAP_HASNULL 0x0001/* has null attribute(s) */ #define HEAP_HASVARWIDTH 0x0002/* has variable-width attribute(s) */ #define HEAP_HASEXTERNAL 0x0004/* has external stored attribute(s) */ #define HEAP_HASOID_OLD 0x0008/* has an object-id field */
Location of NULL
... PostgreSQL Tuples | 23/84 |
Tuple-related data types: (cont)
typedef struct HeapTupleFields// simplified { TransactionId t_xmin;// inserting xact ID TransactionId t_xmax;// deleting or locking xact ID union { CommandId t_cid;// inserting or deleting command ID TransactionId t_xvac;// old-style VACUUM FULL xact ID } t_field3; } HeapTupleFields;
Note that not all system fields from stored tuple appear
oid
xmin
xmax
cmin
cmax
PostgreSQL Attribute Values | 24/84 |
Values of attributes in PostgreSQL tuples are packaged as Datum
// representation of a data value typedef uintptr_t Datum;
The actual data value:
Datum
int
... PostgreSQL Attribute Values | 25/84 |
Attribute values can be extracted as Datum
HeapTuple
Datum heap_getattr( HeapTuple tup,// tuple (in memory) int attnum,// which attribute TupleDesc tupDesc,// field descriptors bool *isnull// flag to record NULL )
isnull
NULL
attnum
For details, see include/access/htup_details.h
... PostgreSQL Attribute Values | 26/84 |
Values of Datum
// DatumGetBool: // Returns boolean value of a Datum. #define DatumGetBool(X) ((bool) ((X) != 0))// BoolGetDatum: // Returns Datum representation for a boolean. #define BoolGetDatum(X) ((Datum) ((X) ? 1 : 0))
For details, see include/postgres.h
Implementing Relational Operations |
DBMS Architecture (revisited) | 28/84 |
Implementation of relational operations in DBMS:
Relational Operations | 29/84 |
DBMS core = relational engine, with implementations of
... Relational Operations | 30/84 |
Two "dimensions of variation":
... Relational Operations | 31/84 |
SQL vs DBMS engine
select ... from R where C
insert into R values(...)
delete from R where C
update R set ... where C
Cost Models |
Cost Models | 33/84 |
An important aspect of this course is
... Cost Models | 34/84 |
Since time cost is affected by many factors
For comparing methods, page cost is better
Addtional assumption: every page request leads to some i/o
... Cost Models | 35/84 |
In developing cost models, we also assume:
... Cost Models | 36/84 |
Our cost models are "rough" (based on assumptions)
But do give an O(x) feel for how expensive operations are.
Example "rough" estimation: how many piano tuners in Sydney?
Actual number of tuners in Yellow Pages = 120
Example borrowed from Alan Fekete at Sydney University.
Query Types | 37/84 |
SQL | RelAlg | a.k.a. | ||||
select * from R |
R | - | ||||
select from R |
Proj[x,y]R | - | ||||
select * from R order by |
Sort[x]R | ord | ||||
select * from R where id = |
Sel[id=k]R | one | ||||
select * from R where a = |
Sel[a=k]R | - | ||||
select * from R,S where R.id = S.r |
R Join[id=r] S | - |
Example File Structures | 38/84 |
When describing file structures
... Example File Structures | 39/84 |
Consider three simple file structures:
Some records in each page may be marked as "deleted".
Exercise 2: Operation Costs | 40/84 |
For each of the following file structures
You can assume the existence of a file header containing
Scanning |
Scanning | 42/84 |
Consider the query:
select * from Rel;
Operational view:
for each page P in file of relation Rel { for each tuple t in page P { add tuple t to result set } }
Cost: read every data page once
Time Cost = b.Tr, Page Cost = b
... Scanning | 43/84 |
Scan implementation when file has overflow pages, e.g.
... Scanning | 44/84 |
In this case, the implementation changes to:
for each page P in data file of relation Rel { for each tuple t in page P { add tuple t to result set } for each overflow page V of page P { for each tuple t in page V { add tuple t to result set } } }
Cost: read each data page and each overflow page once
Cost = b + bOv
where bOv = total number of overflow pages
Selection via Scanning | 45/84 |
Consider a one query like:
select * from Employee where id = 762288;
In an unordered file, search for matching tuple requires:
Guaranteed at most one answer; but could be in any page.
... Selection via Scanning | 46/84 |
Overview of scan process:
for each page P in relation Employee { for each tuple t in page P { if (t.id == 762288) return t } }
Cost analysis for one searching in unordered file
Exercise 3: Cost of Search in Hashed File | 47/84 |
Consider the hashed file structure b = 10, c = 4, h(k) = k%10
Describe how the following queries
select * from R where k = 51; select * from R where k > 50;
might be solved in a file structure like the above (h(k) = k%b).
Estimate the minimum and maximum cost (as #pages read)
Iterators | 48/84 |
Access methods typically involve iterators, e.g.
Scan s = start_scan(Relation r, ...)
r
Scan
WHERE
Scan
Tuple next_tuple(Scan s)
Tuple
NULL
Tuples
Example Query | 49/84 |
Example: simple scan of a table ...
select name from Employee
implemented as:
DB db = openDatabase("myDB"); Relation r = openRelation(db,"Employee",READ); Scan s = start_scan(r); Tuple t;// current tuple while ((t = next_tuple(s)) != NULL) { char *name = getStrField(t,2); printf("%s\n", name); }
Exercise 4: Implement next_tuple() | 50/84 |
Consider the following possible Scan
typedef struct { Relation rel; Page *curPage;// Page buffer int curPID;// current pid int curTID;// current tid } ScanData;
Assume tuples are indexed 0..nTuples(p)
Assume pages are indexed 0..nPages(rel)
Implement the Tuple next_tuple(Scan)
P.S. What's in a Relation
Relation Copying | 51/84 |
Consider an SQL statement like:
create table T as (select * from S);
Effectively, copies data from one table to a new table.
Process:
make empty relation T s = start scan of S while (t = next_tuple(s)) { insert tuple t into relation T }
... Relation Copying | 52/84 |
Possible that T
S
S
T
... Relation Copying | 53/84 |
In terms of existing relation/page/tuple operations:
Relation in;// relation handle (incl. files) Relation out;// relation handle (incl. files) int ipid,opid,tid;// page and record indexes Record rec;// current record (tuple) Page ibuf,obuf;// input/output file buffers in = openRelation("S", READ); out = openRelation("T", NEW|WRITE); clear(obuf); opid = 0; for (ipid = 0; ipid < nPages(in); ipid++) { ibuf = get_page(in, ipid); for (tid = 0; tid < nTuples(ibuf); tid++) { rec = get_record(ibuf, tid); if (!hasSpace(obuf,rec)) { put_page(out, opid++, obuf); clear(obuf); } insert_record(obuf,rec); } } if (nTuples(obuf) > 0) put_page(out, opid, obuf);
Exercise 5: Cost of Relation Copy | 54/84 |
Analyse cost for relation copying:
Give cost in terms of #pages read + #pages written
Scanning in PostgreSQL | 55/84 |
Scanning defined in: backend/access/heap/heapam.c
Implements iterator data/operations:
HeapScanDesc
scan = heap_beginscan(rel,...,nkeys,keys)
tup = heap_getnext(scan, direction)
heap_endscan(scan)
scan
res = HeapKeyTest(tuple,...,nkeys,keys)
ScanKeys
... Scanning in PostgreSQL | 56/84 |
typedef HeapScanDescData *; typedef struct HeapScanDescData { HeapScanDesc
// scan parameters Relation rs_rd;// heap relation descriptor Snapshot rs_snapshot;// snapshot ... tuple visibility int rs_nkeys;// number of scan keys ScanKey rs_key;// array of scan key descriptors ...// state set up at initscan time PageNumber rs_npages;// number of pages to scan PageNumber rs_startpage;// page # to start at ...// scan current state, initally set to invalid HeapTupleData rs_ctup;// current tuple in scan PageNumber rs_cpage;// current page # in scan Buffer rs_cbuf;// current buffer in scan ... } HeapScanDescData;
Scanning in other File Structures | 57/84 |
Above examples are for heap files
btree
hash
gist
gin
Sorting |
The Sort Operation | 59/84 |
Sorting is explicit in queries only in the order by
select * from Students order by name;
Sorting is used internally in other operations:
group by
For large data on disks, need external sorts such as merge sort.
Two-way Merge Sort | 60/84 |
Example:
... Two-way Merge Sort | 61/84 |
Requires three in-memory buffers:
Assumption: cost of Merge operation on two in-memory buffers ≅ 0.
Comparison for Sorting | 62/84 |
Above assumes that we have a function to compare tuples.
Needs to understand ordering on different data types.
Need a function tupCompare(r1,r2,f)
strcmp
int tupCompare(r1,r2,f) { if (r1.f < r2.f) return -1; if (r1.f > r2.f) return 1; return 0; }
Assume =, <, > are available for all attribute types.
... Comparison for Sorting | 63/84 |
In reality, need to sort on multiple attributes and ASC/DESC, e.g.
-- example multi-attribute sort select * from Students order by age desc, year_enrolled
Sketch of multi-attribute sorting function
int tupCompare(r1,r2,criteria) { foreach (f,ord) in criteria { if (ord == ASC) { if (r1.f < r2.f) return -1; if (r1.f > r2.f) return 1; } else { if (r1.f > r2.f) return -1; if (r1.f < r2.f) return 1; } } return 0; }
Cost of Two-way Merge Sort | 64/84 |
For a file containing b data pages:
Example: Relation with r=105 and c=50 ⇒ b=2000 pages.
Number of passes for sort: ceil(log22000) = 11
Reads/writes entire file 11 times! Can we do better?
n-Way Merge Sort | 65/84 |
Initial pass uses: B total buffers
Reads B pages at a time, sorts in memory, writes out in order
... n-Way Merge Sort | 66/84 |
Merge passes use: n input buffers, 1 output buffer
... n-Way Merge Sort | 67/84 |
Method:
// Produce B-page-long runs for each group of B pages in Rel { read B pages into memory buffers sort group in memory write B pages out to Temp }// Merge runs until everything sorted numberOfRuns = ⌈b/B⌉ while (numberOfRuns > 1) {// n-way merge, where n=B-1 for each group of n runs in Temp { merge into a single run via input buffers write run to newTemp via output buffer } numberOfRuns = ⌈numberOfRuns/n⌉ Temp = newTemp// swap input/output files }
Cost of n-Way Merge Sort | 68/84 |
Consider file where b = 4096, B = 16 total buffers:
... Cost of n-Way Merge Sort | 69/84 |
Generalising from previous example ...
For b data pages and B buffers
Exercise 6: Cost of n-Way Merge Sort | 70/84 |
How many reads+writes to sort the following:
Sorting in PostgreSQL | 71/84 |
Sort uses a merge-sort (from Knuth) similar to above:
SortTuple
qsort()
If memory fills while reading, form "runs" and do disk-based sort.
... Sorting in PostgreSQL | 72/84 |
Disk-based sort has phases:
workMem
Implementation of "tapes": backend/utils/sort/logtape.c
... Sorting in PostgreSQL | 73/84 |
Sorting comparison operators are obtained via catalog (in Type.o
// gets pointer to function via pg_operator struct Tuplesortstate { ... SortTupleComparator ... };// returns negative, zero, positive ApplySortComparator(Datum datum1, bool isnull1, Datum datum2, bool isnull2, SortSupport sort_helper);
Flags in SortSupport
ApplySortComparator()
tupCompare()
Implementing Projection |
The Projection Operation | 75/84 |
Consider the query:
select distinct name,age from Employee;
If the Employee
(94002, John, Sales, Manager, 32) (95212, Jane, Admin, Manager, 39) (96341, John, Admin, Secretary, 32) (91234, Jane, Admin, Secretary, 21)
then the result of the projection is:
(Jane, 21) (Jane, 39) (John, 32)
Note that duplicate tuples (e.g. (John,32)
... The Projection Operation | 76/84 |
The projection operation needs to:
distinct
Sort-based Projection | 77/84 |
Requires a temporary file/relation (Temp
for each tuple T in Rel { T' = mkTuple([attrs],T) write T' to Temp } sort Temp on [attrs] for each tuple T in Temp { if (T == Prev) continue write T to Result Prev = T }
Exercise 7: Cost of Sort-based Projection | 78/84 |
Consider a table R(x,y,z) with tuples:
Page 0: (1,1,'a') (11,2,'a') (3,3,'c') Page 1: (13,5,'c') (2,6,'b') (9,4,'a') Page 2: (6,2,'a') (17,7,'a') (7,3,'b') Page 3: (14,6,'a') (8,4,'c') (5,2,'b') Page 4: (10,1,'b') (15,5,'b') (12,6,'b') Page 5: (4,2,'a') (16,9,'c') (18,8,'c')
SQL: create T as (select distinct y from R)
Assuming:
R
T
Cost of Sort-based Projection | 79/84 |
The costs involved are (assuming B=n+1 buffers for sort):
Rel
Temp
Temp
Temp
Hash-based Projection | 80/84 |
Partitioning phase:
... Hash-based Projection | 81/84 |
Duplicate elimination phase:
... Hash-based Projection | 82/84 |
Algorithm for both phases:
for each tuple T in relation Rel { T' = mkTuple([attrs],T) H = h1(T', n) B = buffer for partition[H] if (B full) write and clear B insert T' into B } for each partition P in 0..n-1 { for each tuple T in partition P { H = h2(T, n) B = buffer for hash value H if (T not in B) insert T into B// assumes B never gets full } write and clear all buffers }
Exercise 8: Cost of Hash-based Projection | 83/84 |
Consider a table R(x,y,z) with tuples:
Page 0: (1,1,'a') (11,2,'a') (3,3,'c') Page 1: (13,5,'c') (2,6,'b') (9,4,'a') Page 2: (6,2,'a') (17,7,'a') (7,3,'b') Page 3: (14,6,'a') (8,4,'c') (5,2,'b') Page 4: (10,1,'b') (15,5,'b') (12,6,'b') Page 5: (4,2,'a') (16,9,'c') (18,8,'c')-- and then the same tuples repeated for pages 6-11
SQL: create T as (select distinct y from R)
Assuming:
R
T
Cost of Hash-based Projection | 84/84 |
The total cost is the sum of the following:
R
To ensure that n is larger than the largest partition ...
Produced: 10 Mar 2020