Week 04 Lectures
| Tuples |
| Tuples | 2/84 |
Each page contains a collection of tuples
What do tuples contain? How are they structured internally?
| Records vs Tuples | 3/84 |
A table is defined by a schema, e.g.
create table Employee ( id integer primary key, name varchar(20) not null, job varchar(10), dept number(4) references Dept(id) );
where a schema is a collection of attributes (name,type,constraints)
Schema information (meta-data) is stored in the DB catalog
| ... Records vs Tuples | 4/84 |
Tuple = collection of attribute values based on a schema, e.g.
(33357462, 'Neil Young', 'Musician', 0277)
Record = sequence of bytes, containing data for one tuple, e.g.
Bytes need to be interpreted relative to schema to get tuple
| Converting Records to Tuples | 5/84 |
A Recordbyte[]
TupleTuplestruct
| ... Converting Records to Tuples | 6/84 |
Information on how to interpret bytes in a record ...
| Operations on Records | 7/84 |
Common operation on records ... access record via RecordId
Record get_record(Relation rel, RecordId rid) {
(pid,tid) = rid;
Page buf = get_page(rel, pid);
return get_bytes(rel, buf, tid);
}
Cannot use a RecordTuple
Relation rel = ...// relation schema Record rec = get_record(rel, rid) Tuple t = mkTuple(rel, rec)
Once we have a Tuple
| Operations on Tuples | 8/84 |
Once we have a record, we need to interpret it as a tuple ...
Tuple t = mkTuple(rel, rec)
rel
Once we have a tuple, we want to examines its contents ...
Typ getTypField(Tuple t, int i)
iTupleint x = getIntField(t,1)char *s = getStrField(t,2)
| Fixed-length Records | 9/84 |
A possible encoding scheme for fixed-length records:
Since record format is frequently used at query time, cache in memory.
| Variable-length Records | 10/84 |
Possible encoding schemes for variable-length records:
| Data Types | 11/84 |
DBMSs typically define a fixed set of base types, e.g.
DATEFLOATINTEGERNUMBER()VARCHAR()This determines implementation-level data types for field values:
DATE |
time_t |
|
FLOAT |
float,double |
|
INTEGER |
int,long |
|
NUMBER() |
int[] |
|
VARCHAR() |
char[] |
PostgreSQL allows new base types to be added
| Field Descriptors | 12/84 |
A Tuple
FieldDescRecord
typedef struct {
ushort nfields; // number of fields/attrs
ushort data_off; // offset in struct for data
FieldDesc fields[]; // field descriptions
Record data; // pointer to record in buffer
} Tuple;
Fields are derived from relation descriptor + record instance data.
| ... Field Descriptors | 13/84 |
Tuple data
| ... Field Descriptors | 14/84 |
Or, tuple data
Tuple struct
Exercise 1: How big is a FieldDesc | 15/84 |
FieldDesc
E.g.
| PostgreSQL Tuples | 16/84 |
Definitions: include/postgres.hinclude/access/*tup*.h
Functions: backend/access/common/*tup*.c
HeapTuple heap_form_tuple(desc,values[],isnull[])heap_deform_tuple(tuple,desc,values[],isnull[])Datum
| ... PostgreSQL Tuples | 17/84 |
Tuple structure:
| ... PostgreSQL Tuples | 18/84 |
Tuple-related data types: (cont)
// TupleDesc: schema-related information for HeapTuples typedef struct tupleDesc { int natts;// # attributes in tuple Oid tdtypeid;// composite type ID for tuple type int32 tdtypmod;// typmod for tuple type bool tdhasoid;// does tuple have oid attribute? int tdrefcount;// reference count TupleConstr *constr;(-1 if not counting) // constraints, or NULL if none FormData_pg_attribute attrs[];// attrs[N] is a pointer to description of attribute N+1 } *TupleDesc;
| ... PostgreSQL Tuples | 19/84 |
Tuple-related data types: (cont)
// FormData_pg_attribute: // schema-related information for one attribute typedef struct FormData_pg_attribute { Oid attrelid;// OID of reln containing attr NameData attname;// name of attribute Oid atttypid;// OID of attribute's data type int16 attlen;// attribute length int32 attndims;// # dimensions if array type bool attnotnull;// can attribute have NULL value .....// and many other fields } FormData_pg_attribute;
For details, see include/catalog/pg_attribute.h
| ... PostgreSQL Tuples | 20/84 |
HeapTupleData
typedef HeapTupleData *HeapTuple;
typedef struct HeapTupleData
{
uint32 t_len; // length of *t_data
ItemPointerData t_self; // SelfItemPointer
Oid t_tableOid; // table the tuple came from
HeapTupleHeader t_data; // -> tuple header and data
} HeapTupleData;
HeapTupleHeader
| ... PostgreSQL Tuples | 21/84 |
PostgreSQL stores a single block of data for tuple
byte[]typedef struct HeapTupleHeaderData// simplified { HeapTupleFields t_heap; ItemPointerData t_ctid;// TID of newer version uint16 t_infomask2;// #attributes + flags uint16 t_infomask;// flags e.g. has_null uint8 t_hoff;// sizeof header incl. t_bits // above is fixed size (23 bytes) for all heap tuples bits8 t_bits[1];// bitmap of NULLs, var.len. // OID goes here if HEAP_HASOID is set in t_infomask // actual data follows at end of struct } HeapTupleHeaderData;
| ... PostgreSQL Tuples | 22/84 |
Some of the bits in t_infomask
#define HEAP_HASNULL 0x0001
/* has null attribute(s) */
#define HEAP_HASVARWIDTH 0x0002
/* has variable-width attribute(s) */
#define HEAP_HASEXTERNAL 0x0004
/* has external stored attribute(s) */
#define HEAP_HASOID_OLD 0x0008
/* has an object-id field */
Location of NULL
| ... PostgreSQL Tuples | 23/84 |
Tuple-related data types: (cont)
typedef struct HeapTupleFields// simplified { TransactionId t_xmin;// inserting xact ID TransactionId t_xmax;// deleting or locking xact ID union { CommandId t_cid;// inserting or deleting command ID TransactionId t_xvac;// old-style VACUUM FULL xact ID } t_field3; } HeapTupleFields;
Note that not all system fields from stored tuple appear
oidxminxmaxcmincmax
| PostgreSQL Attribute Values | 24/84 |
Values of attributes in PostgreSQL tuples are packaged as Datum
// representation of a data value typedef uintptr_t Datum;
The actual data value:
Datumint
| ... PostgreSQL Attribute Values | 25/84 |
Attribute values can be extracted as DatumHeapTuple
Datum heap_getattr(
HeapTuple tup, // tuple (in memory)
int attnum, // which attribute
TupleDesc tupDesc, // field descriptors
bool *isnull // flag to record NULL
)
isnullNULL
attnum
For details, see include/access/htup_details.h
| ... PostgreSQL Attribute Values | 26/84 |
Values of Datum
// DatumGetBool: // Returns boolean value of a Datum. #define DatumGetBool(X) ((bool) ((X) != 0))// BoolGetDatum: // Returns Datum representation for a boolean. #define BoolGetDatum(X) ((Datum) ((X) ? 1 : 0))
For details, see include/postgres.h
| Implementing Relational Operations |
| DBMS Architecture (revisited) | 28/84 |
Implementation of relational operations in DBMS:
| Relational Operations | 29/84 |
DBMS core = relational engine, with implementations of
| ... Relational Operations | 30/84 |
Two "dimensions of variation":
| ... Relational Operations | 31/84 |
SQL vs DBMS engine
select ... from R where Cinsert into R values(...)delete from R where Cupdate R set ... where C
| Cost Models |
| Cost Models | 33/84 |
An important aspect of this course is
| ... Cost Models | 34/84 |
Since time cost is affected by many factors
For comparing methods, page cost is better
Addtional assumption: every page request leads to some i/o
| ... Cost Models | 35/84 |
In developing cost models, we also assume:
| ... Cost Models | 36/84 |
Our cost models are "rough" (based on assumptions)
But do give an O(x) feel for how expensive operations are.
Example "rough" estimation: how many piano tuners in Sydney?
Actual number of tuners in Yellow Pages = 120
Example borrowed from Alan Fekete at Sydney University.
| Query Types | 37/84 |
| SQL | RelAlg | a.k.a. | ||||
select * from R |
R | - | ||||
select from R |
Proj[x,y]R | - | ||||
select * from Rorder by |
Sort[x]R | ord | ||||
select * from Rwhere id = |
Sel[id=k]R | one | ||||
select * from Rwhere a = |
Sel[a=k]R | - | ||||
select * from R,Swhere R.id = S.r |
R Join[id=r] S | - |
| Example File Structures | 38/84 |
When describing file structures
| ... Example File Structures | 39/84 |
Consider three simple file structures:
Some records in each page may be marked as "deleted".
| Exercise 2: Operation Costs | 40/84 |
For each of the following file structures
You can assume the existence of a file header containing
| Scanning |
| Scanning | 42/84 |
Consider the query:
select * from Rel;
Operational view:
for each page P in file of relation Rel {
for each tuple t in page P {
add tuple t to result set
}
}
Cost: read every data page once
Time Cost = b.Tr, Page Cost = b
| ... Scanning | 43/84 |
Scan implementation when file has overflow pages, e.g.
| ... Scanning | 44/84 |
In this case, the implementation changes to:
for each page P in data file of relation Rel {
for each tuple t in page P {
add tuple t to result set
}
for each overflow page V of page P {
for each tuple t in page V {
add tuple t to result set
} } }
Cost: read each data page and each overflow page once
Cost = b + bOv
where bOv = total number of overflow pages
| Selection via Scanning | 45/84 |
Consider a one query like:
select * from Employee where id = 762288;
In an unordered file, search for matching tuple requires:
Guaranteed at most one answer; but could be in any page.
| ... Selection via Scanning | 46/84 |
Overview of scan process:
for each page P in relation Employee {
for each tuple t in page P {
if (t.id == 762288) return t
} }
Cost analysis for one searching in unordered file
| Exercise 3: Cost of Search in Hashed File | 47/84 |
Consider the hashed file structure b = 10, c = 4, h(k) = k%10
Describe how the following queries
select * from R where k = 51; select * from R where k > 50;
might be solved in a file structure like the above (h(k) = k%b).
Estimate the minimum and maximum cost (as #pages read)
| Iterators | 48/84 |
Access methods typically involve iterators, e.g.
Scan s = start_scan(Relation r, ...)
rScanWHEREScan
Tuple next_tuple(Scan s)
TupleNULLTuples
| Example Query | 49/84 |
Example: simple scan of a table ...
select name from Employee
implemented as:
DB db = openDatabase("myDB");
Relation r = openRelation(db,"Employee",READ);
Scan s = start_scan(r);
Tuple t; // current tuple
while ((t = next_tuple(s)) != NULL)
{
char *name = getStrField(t,2);
printf("%s\n", name);
}
| Exercise 4: Implement next_tuple() | 50/84 |
Consider the following possible Scan
typedef struct {
Relation rel;
Page *curPage; // Page buffer
int curPID; // current pid
int curTID; // current tid
} ScanData;
Assume tuples are indexed 0..nTuples(p)
Assume pages are indexed 0..nPages(rel)
Implement the Tuple next_tuple(Scan)
P.S. What's in a Relation
| Relation Copying | 51/84 |
Consider an SQL statement like:
create table T as (select * from S);
Effectively, copies data from one table to a new table.
Process:
make empty relation T
s = start scan of S
while (t = next_tuple(s)) {
insert tuple t into relation T
}
| ... Relation Copying | 52/84 |
Possible that TS
ST
| ... Relation Copying | 53/84 |
In terms of existing relation/page/tuple operations:
Relation in;// relation handle (incl. files) Relation out;// relation handle (incl. files) int ipid,opid,tid;// page and record indexes Record rec;// current record (tuple) Page ibuf,obuf;// input/output file buffers in = openRelation("S", READ); out = openRelation("T", NEW|WRITE); clear(obuf); opid = 0; for (ipid = 0; ipid < nPages(in); ipid++) { ibuf = get_page(in, ipid); for (tid = 0; tid < nTuples(ibuf); tid++) { rec = get_record(ibuf, tid); if (!hasSpace(obuf,rec)) { put_page(out, opid++, obuf); clear(obuf); } insert_record(obuf,rec); } } if (nTuples(obuf) > 0) put_page(out, opid, obuf);
| Exercise 5: Cost of Relation Copy | 54/84 |
Analyse cost for relation copying:
Give cost in terms of #pages read + #pages written
| Scanning in PostgreSQL | 55/84 |
Scanning defined in: backend/access/heap/heapam.c
Implements iterator data/operations:
HeapScanDescscan = heap_beginscan(rel,...,nkeys,keys)tup = heap_getnext(scan, direction)heap_endscan(scan)scanres = HeapKeyTest(tuple,...,nkeys,keys)ScanKeys
| ... Scanning in PostgreSQL | 56/84 |
typedef HeapScanDescData *; typedef struct HeapScanDescData { HeapScanDesc// scan parameters Relation rs_rd;// heap relation descriptor Snapshot rs_snapshot;// snapshot ... tuple visibility int rs_nkeys;// number of scan keys ScanKey rs_key;// array of scan key descriptors ...// state set up at initscan time PageNumber rs_npages;// number of pages to scan PageNumber rs_startpage;// page # to start at ...// scan current state, initally set to invalid HeapTupleData rs_ctup;// current tuple in scan PageNumber rs_cpage;// current page # in scan Buffer rs_cbuf;// current buffer in scan ... } HeapScanDescData;
| Scanning in other File Structures | 57/84 |
Above examples are for heap files
btreehashgistgin
| Sorting |
| The Sort Operation | 59/84 |
Sorting is explicit in queries only in the order by
select * from Students order by name;
Sorting is used internally in other operations:
group byFor large data on disks, need external sorts such as merge sort.
| Two-way Merge Sort | 60/84 |
Example:
| ... Two-way Merge Sort | 61/84 |
Requires three in-memory buffers:
Assumption: cost of Merge operation on two in-memory buffers ≅ 0.
| Comparison for Sorting | 62/84 |
Above assumes that we have a function to compare tuples.
Needs to understand ordering on different data types.
Need a function tupCompare(r1,r2,f)strcmp
int tupCompare(r1,r2,f)
{
if (r1.f < r2.f) return -1;
if (r1.f > r2.f) return 1;
return 0;
}
Assume =, <, > are available for all attribute types.
| ... Comparison for Sorting | 63/84 |
In reality, need to sort on multiple attributes and ASC/DESC, e.g.
-- example multi-attribute sort select * from Students order by age desc, year_enrolled
Sketch of multi-attribute sorting function
int tupCompare(r1,r2,criteria)
{
foreach (f,ord) in criteria {
if (ord == ASC) {
if (r1.f < r2.f) return -1;
if (r1.f > r2.f) return 1;
}
else {
if (r1.f > r2.f) return -1;
if (r1.f < r2.f) return 1;
}
}
return 0;
}
| Cost of Two-way Merge Sort | 64/84 |
For a file containing b data pages:
Example: Relation with r=105 and c=50 ⇒ b=2000 pages.
Number of passes for sort: ceil(log22000) = 11
Reads/writes entire file 11 times! Can we do better?
| n-Way Merge Sort | 65/84 |
Initial pass uses: B total buffers
Reads B pages at a time, sorts in memory, writes out in order
| ... n-Way Merge Sort | 66/84 |
Merge passes use: n input buffers, 1 output buffer
| ... n-Way Merge Sort | 67/84 |
Method:
// Produce B-page-long runs for each group of B pages in Rel { read B pages into memory buffers sort group in memory write B pages out to Temp }// Merge runs until everything sorted numberOfRuns = ⌈b/B⌉ while (numberOfRuns > 1) {// n-way merge, where n=B-1 for each group of n runs in Temp { merge into a single run via input buffers write run to newTemp via output buffer } numberOfRuns = ⌈numberOfRuns/n⌉ Temp = newTemp// swap input/output files }
| Cost of n-Way Merge Sort | 68/84 |
Consider file where b = 4096, B = 16 total buffers:
| ... Cost of n-Way Merge Sort | 69/84 |
Generalising from previous example ...
For b data pages and B buffers
| Exercise 6: Cost of n-Way Merge Sort | 70/84 |
How many reads+writes to sort the following:
| Sorting in PostgreSQL | 71/84 |
Sort uses a merge-sort (from Knuth) similar to above:
SortTupleqsort()If memory fills while reading, form "runs" and do disk-based sort.
| ... Sorting in PostgreSQL | 72/84 |
Disk-based sort has phases:
workMemImplementation of "tapes": backend/utils/sort/logtape.c
| ... Sorting in PostgreSQL | 73/84 |
Sorting comparison operators are obtained via catalog (in Type.o
// gets pointer to function via pg_operator struct Tuplesortstate { ... SortTupleComparator ... };// returns negative, zero, positive ApplySortComparator(Datum datum1, bool isnull1, Datum datum2, bool isnull2, SortSupport sort_helper);
Flags in SortSupport
ApplySortComparator()tupCompare()
| Implementing Projection |
| The Projection Operation | 75/84 |
Consider the query:
select distinct name,age from Employee;
If the Employee
(94002, John, Sales, Manager, 32) (95212, Jane, Admin, Manager, 39) (96341, John, Admin, Secretary, 32) (91234, Jane, Admin, Secretary, 21)
then the result of the projection is:
(Jane, 21) (Jane, 39) (John, 32)
Note that duplicate tuples (e.g. (John,32)
| ... The Projection Operation | 76/84 |
The projection operation needs to:
distinct
| Sort-based Projection | 77/84 |
Requires a temporary file/relation (Temp
for each tuple T in Rel {
T' = mkTuple([attrs],T)
write T' to Temp
}
sort Temp on [attrs]
for each tuple T in Temp {
if (T == Prev) continue
write T to Result
Prev = T
}
| Exercise 7: Cost of Sort-based Projection | 78/84 |
Consider a table R(x,y,z) with tuples:
Page 0: (1,1,'a') (11,2,'a') (3,3,'c') Page 1: (13,5,'c') (2,6,'b') (9,4,'a') Page 2: (6,2,'a') (17,7,'a') (7,3,'b') Page 3: (14,6,'a') (8,4,'c') (5,2,'b') Page 4: (10,1,'b') (15,5,'b') (12,6,'b') Page 5: (4,2,'a') (16,9,'c') (18,8,'c')
SQL: create T as (select distinct y from R)
Assuming:
RT
| Cost of Sort-based Projection | 79/84 |
The costs involved are (assuming B=n+1 buffers for sort):
RelTempTempTemp
| Hash-based Projection | 80/84 |
Partitioning phase:
| ... Hash-based Projection | 81/84 |
Duplicate elimination phase:
| ... Hash-based Projection | 82/84 |
Algorithm for both phases:
for each tuple T in relation Rel {
T' = mkTuple([attrs],T)
H = h1(T', n)
B = buffer for partition[H]
if (B full) write and clear B
insert T' into B
}
for each partition P in 0..n-1 {
for each tuple T in partition P {
H = h2(T, n)
B = buffer for hash value H
if (T not in B) insert T into B
// assumes B never gets full
}
write and clear all buffers
}
| Exercise 8: Cost of Hash-based Projection | 83/84 |
Consider a table R(x,y,z) with tuples:
Page 0: (1,1,'a') (11,2,'a') (3,3,'c') Page 1: (13,5,'c') (2,6,'b') (9,4,'a') Page 2: (6,2,'a') (17,7,'a') (7,3,'b') Page 3: (14,6,'a') (8,4,'c') (5,2,'b') Page 4: (10,1,'b') (15,5,'b') (12,6,'b') Page 5: (4,2,'a') (16,9,'c') (18,8,'c')-- and then the same tuples repeated for pages 6-11
SQL: create T as (select distinct y from R)
Assuming:
RT
| Cost of Hash-based Projection | 84/84 |
The total cost is the sum of the following:
RTo ensure that n is larger than the largest partition ...
Produced: 10 Mar 2020