❖ Cost Models |
An important aspect of this course is
❖ Cost Models (cont) |
Since time cost is affected by many factors
For comparing methods, page cost is better
Estimating costs with multiple concurrent operations and buffering is difficult!!
❖ Cost Models (cont) |
Terminology:
❖ Cost Models (cont) |
In developing cost models, we make assumptions on how DBMSs store data:
❖ Cost Models (cont) |
Typical DBMS/table parameter values:
Symbol | E.g. Value | |||
r | 106 | |||
R | 128 bytes | |||
b | 105 | |||
B | 8192 bytes | |||
c | 60 | |||
Tr ,Tw | 10 msec | |||
one page in memory |
- | ≅ 0 |
❖ Cost Models (cont) |
Our cost models are "rough" (based on assumptions)
But do give an O(x) feel for how expensive operations are.
Example "rough" estimation: how many piano tuners in Sydney?
Actual number of piano tuners in Yellow Pages = 120
Example borrowed from Alan Fekete at Sydney University.
❖ Storage Technology |
Persistent storage is
❖ Storage Technology (cont) |
Hard disks are well-established, cheap, high-volume, ...
Alternative bulk storage: SSD
❖ Storage Technology (cont) |
Comparison of HDD and SSD properties:
HDD | SSD | |
Cost/byte | ~ $13 / TB | ~ $35 / TB |
Read latency | ~ 6ms | ~ 50µs |
Write latency | ~ 6ms | ~ 900µs | R/W rate | 150MB/s | 450MB/s |
Read unit | block (e.g. 1KB) | byte |
Writing | write a block | write on empty block |
Will SSDs ever completely replace HDDs?
❖ Storage Management (cont) |
Aims of storage management in DBMS:
❖ Storage Management |
Topics in storage management ...
❖ Views of Data in Query Evaluation (cont) |
Representing database objects during query evaluation:
DB
Rel
Page
Tuple
PageID = FileID+Offset
Offset
TupleID = PageID+Index
Index
❖ File Management |
Aims of file management subsystem:
Builds higher-level operations on top of OS file operations.
❖ File Management (cont) |
Typical file operations provided by the operating system:
fd = open(fileName,mode) // open a named file for reading/writing/appending close(fd) // close an open file, via its descriptor nread = read(fd, buf, nbytes) // attempt to read data from file into buffer nwritten = write(fd, buf, nbytes) // attempt to write data from buffer to file lseek(fd, offset, seek_type) // move file pointer to relative/absolute file offset fsync(fd) // flush contents of file buffers to disk
❖ DBMS File Organisation |
How is data for DB objects arranged in the file system?
Different DBMSs make different choices, e.g.
❖ Single-file DBMS |
Consider a single file for the entire database (e.g. SQLite)
Objects are allocated to regions (segments) of the file.
If an object grows too large for allocated segment, allocate an extension.
What happens to allocated space when objects are removed?
❖ Single-file DBMS (cont) |
Allocating space in Unix files is easy:
❖ Single-file Storage Manager |
Consider the following simple single-file DBMS layout:
E.g.
SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]
TableMap = [ ("employee",20,500), ("project",620,40) ]
❖ Single-file Storage Manager (cont) |
Each file segment consists of a number fixed-size blocks
The following data/constant definitions are useful
#define PAGESIZE 2048 // bytes per page typedef long PageId; // PageId is block index // pageOffset=PageId*PAGESIZE typedef char *Page; // pointer to page/block buffer
Typical PAGESIZE
❖ Single-file Storage Manager (cont) |
Storage Manager data structures for opened DBs & Tables
typedef struct DBrec { char *dbname; // copy of database name int fd; // the database file SpaceMap map; // map of free/used areas NameTable names; // map names to areas + sizes } *DB; typedef struct Relrec { char *relname; // copy of table name int start; // page index of start of table data int npages; // number of pages of table data ... } *Rel;
❖ Example: Scanning a Relation |
With the above disk manager, our example:
select name from Employee
might be implemented as something like
DB db = openDatabase("myDB"); Rel r = openRelation(db,"Employee"); Page buffer = malloc(PAGESIZE*sizeof(char)); for (int i = 0; i < r->npages; i++) { PageId pid = r->start+i; get_page(db, pid, buffer); for each tuple in buffer { get tuple data and extract name add (name) to result tuples } }
❖ Single-File Storage Manager |
// start using DB, buffer meta-data DB openDatabase(char *name) { DB db = new(struct DBrec); db->dbname = strdup(name); db->fd = open(name,O_RDWR); db->map = readSpaceTable(db->fd); db->names = readNameTable(db->fd); return db; } // stop using DB and update all meta-data void closeDatabase(DB db) { writeSpaceTable(db->fd,db->map); writeNameTable(db->fd,db->map); fsync(db->fd); close(db->fd); free(db->dbname); free(db); }
❖ Single-File Storage Manager (cont) |
// set up struct describing relation Rel openRelation(DB db, char *rname) { Rel r = new(struct Relrec); r->relname = strdup(rname); // get relation data from map tables r->start = ...; r->npages = ...; return r; } // stop using a relation void closeRelation(Rel r) { free(r->relname); free(r); }
❖ Single-File Storage Manager (cont) |
// assume that Page = byte[PageSize] // assume that PageId = block number in file // read page from file into memory buffer void get_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); read(db->fd, buf, PAGESIZE); } // write page from memory buffer to file void put_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); write(db->fd, buf, PAGESIZE); }
❖ Single-File Storage Manager (cont) |
Managing contents of space mapping table can be complex:
// assume an array of (offset,length,status) records // allocate n new pages PageId allocate_pages(DB db, int n) { if (no existing free chunks are large enough) { int endfile = lseek(db->fd, 0, SEEK_END); addNewEntry(db->map, endfile, n); } else { grab "worst fit" chunk split off unused section as new chunk } }
❖ Single-File Storage Manager (cont) |
Similar complexity for freeing chunks
// drop n pages starting from p
void deallocate_pages(DB db, PageId p, int n) {
if (no adjacent free chunks) {
markUnused(db->map, p, n);
} else {
merge adjacent free chunks
compress mapping table
}
}
Changes take effect when closeDatabase()
❖ Multiple-file Disk Manager |
Most DBMSs don't use a single large file for all data.
They typically provide:
Using multiple files (one file per relation) can be easier, e.g.
❖ Multiple-file Disk Manager (cont) |
Example of single-file vs multiple-file:
Consider how you would compute file offset of page[i] in table[1] ...
❖ Multiple-file Disk Manager (cont) |
Structure of PageId
If system uses one file per table, PageId
PageId
❖ PostgreSQL Storage Manager (cont) |
Components of storage subsystem:
RelFileNode
storage/smgr
storage/smgr/md.c
storage/file
smgr
❖ Relations as Files |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode { Oid spcNode; // tablespace Oid dbNode; // database Oid relNode; // relation } RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OID
dbNode == 0
❖ Relations as Files (cont) |
The relpath
RelFileNode
char *relpath(RelFileNode r) // simplified { char *path = malloc(ENOUGH_SPACE); if (r.spcNode == GLOBALTABLESPACE_OID) { /* Shared system relations live in PGDATA/global */ Assert(r.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, r.relNode); } else if (r.spcNode == DEFAULTTABLESPACE_OID) { /* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, r.dbNode, r.relNode); } else { /* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir r.spcNode, r.dbNode, r.relNode); } return path; }
❖ File Descriptor Pool |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()
typedef char *FileName
Open files are referenced via: typedef int File
A File
❖ File Descriptor Pool (cont) |
Interface to file descriptor (pool):
File FileNameOpenFile(FileName fileName, int fileFlags, int fileMode); // open a file in the database directory ($PGDATA/base/...) File OpenTemporaryFile(bool interXact); // open temp file; flag: close at end of transaction? void FileClose(File file); void FileUnlink(File file); int FileRead(File file, char *buffer, int amount); int FileWrite(File file, char *buffer, int amount); int FileSync(File file); long FileSeek(File file, long offset, int whence); int FileTruncate(File file, long offset);
Analogous to Unix syscalls open()
close()
read()
write()
lseek()
❖ File Descriptor Pool (cont) |
Virtual file descriptors (Vfd
VfdCache[0]
❖ File Descriptor Pool (cont) |
Virtual file descriptor records (simplified):
typedef struct vfd { s_short fd; // current FD, or VFD_CLOSED if none u_short fdstate; // bitflags for VFD's state File nextFree; // link to next free VFD, if in freelist File lruMoreRecently; // doubly linked recency-of-use list File lruLessRecently; long seekPos; // current logical file position char *fileName; // name of file, or NULL for unused VFD // NB: fileName is malloc'd, and must be free'd when closing the VFD int fileFlags; // open(2) flags for (re)opening the file int fileMode; // mode to pass to open(2) } Vfd;
❖ File Manager (cont) |
PostgreSQL stores each table
PGDATA/pg_database.oid
❖ File Manager (cont) |
Data files (Oid, Oid.1, ...):
❖ File Manager (cont) |
Free space map (Oid_fsm):
VACUUM
DELETE
xmax
VACUUM
❖ File Manager (cont) |
The "magnetic disk storage manager" (storage/smgr/md.c
PageID
PageID
typedef struct { RelFileNode rnode; // which relation/file ForkNumber forkNum; // which fork (of reln) BlockNumber blockNum; // which page/block } BufferTag;
❖ File Manager (cont) |
Access to a block of data proceeds (roughly) as follows:
// pageID set from pg_catalog tables // buffer obtained from Buffer pool getBlock(BufferTag pageID, Buffer buf) { Vfd vf; off_t offset; (vf, offset) = findBlock(pageID) lseek(vf.fd, offset, SEEK_SET) vf.seekPos = offset; nread = read(vf.fd, buf, BLOCKSIZE) if (nread < BLOCKSIZE) ... we have a problem }
BLOCKSIZE
❖ File Manager (cont) |
findBlock(BufferTag pageID) returns (Vfd, off_t) { offset = pageID.blockNum * BLOCKSIZE fileName = relpath(pageID.rnode) if (pageID.forkNum > 0) fileName = fileName+"."+pageID.forkNum if (fileName is not in Vfd pool) fd = allocate new Vfd for fileName else fd = use Vfd from pool if (offset > fd.fileSize) { fd = allocate new Vfd for next fork offset = offset - fd.fileSize } return (fd, offset) }
❖ Buffer Pool |
Aim of buffer pool:
❖ Buffer Pool (cont) |
Buffer pool operations: (both take single PageID
request_page(pid)
release_page(pid)
request_page()
getBlock()
release_page()
putBlock()
Buffer pool data structures:
Page frames[NBUFS]
FrameData directory[NBUFS]
Page
byte[BUFSIZE]
❖ Buffer Pool (cont) |
For each frame, we need to know: (FrameData
❖ Buffer Pool (cont) |
How scans are performed without Buffer Pool:
Buffer buf; int N = numberOfBlocks(Rel); for (i = 0; i < N; i++) { pageID = makePageID(db,Rel,i); getBlock(pageID, buf); for (j = 0; j < nTuples(buf); j++) process(buf, j) }
Requires N
If we read it again, N
❖ Buffer Pool (cont) |
How scans are performed with Buffer Pool:
Buffer buf; int N = numberOfBlocks(Rel); for (i = 0; i < N; i++) { pageID = makePageID(db,Rel,i); bufID = request_page(pageID); buf = frames[bufID] for (j = 0; j < nTuples(buf); j++) process(buf, j) release_page(pageID); }
Requires N
If we read it again, 0 ≤ page reads ≤ N
❖ Buffer Pool (cont) |
Implementation of request_page()
int request_page(PageID pid)
{
if (pid in Pool)
bufID = index for pid in Pool
else {
if (no free frames in Pool)
evict a page // free a frame
bufID = allocate free frame
directory[bufID].page = pid
directory[bufID].pin_count = 0
directory[bufID].dirty_bit = 0
}
directory[bufID].pin_count++
return bufID
}
❖ Buffer Pool (cont) |
The release_page(pid)
The
The
mark_page(pid)
operation:
Note: doesn't actually write to disk; just indicates that page changed
flush_page(pid)
Note: not generally used by higher levels of DBMS
write_page
❖ Buffer Pool (cont) |
Evicting a page ...
❖ Page Replacement Policies |
Several schemes are commonly in use:
❖ Page Replacement Policies (cont) |
Cost benefit from buffer pool (with n frames) is determined by:
First scan costs b reads; subsequent scans are "free".
Example (b): sequential scan, MRU, n < b
First scan costs b reads; subsequent scans cost b - n reads.
Example (c): sequential scan, LRU, n < b
All scans cost b reads; known as sequential flooding.
❖ Effect of Buffer Management |
Consider a query to find customers who are also employees:
select c.name from Customer c, Employee e where c.ssn = e.ssn;
This might be implemented inside the DBMS via nested loops:
for each tuple t1 in Customer { for each tuple t2 in Employee { if (t1.ssn == t2.ssn) append (t1.name) to result set } }
❖ Effect of Buffer Management (cont) |
In terms of page-level operations, the algorithm looks like:
Rel rC = openRelation("Customer"); Rel rE = openRelation("Employee"); for (int i = 0; i < nPages(rC); i++) { PageID pid1 = makePageID(db,rC,i); Page p1 = request_page(pid1); for (int j = 0; j < nPages(rE); j++) { PageID pid2 = makePageID(db,rE,j); Page p2 = request_page(pid2); // compare all pairs of tuples from p1,p2 // construct solution set from matching pairs release_page(pid2); } release_page(pid1); }
❖ PostgreSQL Buffer Manager |
PostgreSQL buffer manager:
Definitions: src/include/storage/buf*.h
Functions: src/backend/storage/buffer/*.c
Buffer code is also used by backends who want a private buffer pool
❖ PostgreSQL Buffer Manager (cont) |
Buffer pool consists of:
BufferDescriptors
NBuffers
BufferDesc
BufferBlocks
NBuffers
Buffer
1..NBuffers
shared_buffers = 16MB # min 128KB, 16*8KB buffers
❖ PostgreSQL Buffer Manager (cont) |
PostgreSQL buffer descriptors:
typedef struct BufferDesc { BufferTag tag; /* ID of page contained in buffer */ int buf_id; /* buffer's index number (from 0) */ /* state of the tag, containing: flags, refcount and usagecount */ pg_atomic_uint32 state; int wait_backend_pgprocno; /* pin-count waiter */ int freeNext; /* link in freelist chain */ LWLock content_lock; /* to lock access to buffer contents */ } BufferDesc;
❖ PostgreSQL Buffer Manager (cont) |
include/storage/buf.h
Buffer
include/storage/bufmgr.h
include/storage/buf_internals.h
BufferDesc
backend/storage/buffer/*.c
Commentary: backend/storage/buffer/README
❖ Some terminology |
Terminology used in these slides ...
Record
Tuple
Record
Page
PageId
pid
pageOffsetInFile
pid * PAGESIZE
TupleId
tid
RecordId
PageId
TupleId
rid
recOffsetInPage
page.directory[tid].offset
Relation
❖ Page Formats |
A Page
byte[]
Want to interpret/manipulate it as a collection of Record
Typical operations on pages and records:
buf = request_page(rel,pid)
PageId
rec = get_record(buf,tid)
rid = insert_record(rel,pid,rec)
update_record(rel,rid,rec)
delete_record(rel,rid)
rid
(PageId,TupleId)
rel
❖ Page Formats (cont) |
Factors affecting Page
Page
Page
Page
Page
Page
❖ Page Formats (cont) |
For fixed-length records, use record slots.
❖ Page Formats (cont) |
For variable-length records, must use record directory
directory[i]
TupleId tid
❖ Page Formats (cont) |
Possibilities for handling free-space within block:
❖ Storage Utilisation |
How many records can fit in a page? (denoted c = capacity)
Depends on:
Given c, HeaderSize + c*SlotSize + c*R ≤ PageSize
❖ Overflows |
Sometimes, it may not be possible to insert a record into a page:
If still insufficient space, we need an alternative solution ...
❖ Overflows (cont) |
File organisation determines how cases (2)..(4) are handled.
If records may be inserted anywhere that there is free space
❖ PostgreSQL Page Representation |
Functions: src/backend/storage/page/*.c
Definitions: src/include/storage/bufpage.h
Each page is 8KB (default BLCKSZ
Also supports ~SQL-standard BLOBs (explicit large data items)
❖ PostgreSQL Page Representation (cont) |
Page-related data types:
// a Page is simply a pointer to start of buffer typedef Pointer Page; // indexes into the tuple directory typedef uint16 LocationIndex; // entries in tuple directory (line pointer array) typedef struct ItemIdData { unsigned lp_off:15, // tuple offset from start of page lp_flags:2, // unused,normal,redirect,dead lp_len:15; // length of tuple (bytes) } ItemIdData;
❖ PostgreSQL Page Representation (cont) |
Page-related data types: (cont)
typedef struct PageHeaderData (simplified) { ... // transaction-related data uint16 pd_checksum; // checksum uint16 pd_flags; // flag bits (e.g. free, full, ... LocationIndex pd_lower; // offset to start of free space LocationIndex pd_upper; // offset to end of free space LocationIndex pd_special; // offset to start of special space uint16 pd_pagesize_version; ItemIdData pd_linp[1]; // beginning of line pointer array } PageHeaderData; typedef PageHeaderData *PageHeader;
❖ PostgreSQL Page Representation (cont) |
Operations on Page
void PageInit(Page page, Size pageSize, ...)
Page
pd_lower
pd_upper
OffsetNumber PageAddItem(Page page,
Item item, Size size, ...)
Page
void PageRepairFragmentation(Page page)
❖ PostgreSQL Page Representation (cont) |
PostgreSQL has two kinds of pages:
One important difference:
❖ TOAST Files |
Each data file has a corresponding TOAST file (if needed)
Tuples in data pages contain rid
TOAST = The Oversized Attribute Storage Technique
❖ Tuples |
Each page contains a collection of tuples
What do tuples contain? How are they structured internally?
❖ Records vs Tuples |
A table is defined by a collection of attributes (schema), e.g.
create table Employee ( id integer primary key, name varchar(20), job varchar(10), dept number(4) );
Tuple = collection of attribute values for such a schema, e.g.
(33357462, 'Neil Young', 'Musician', 0277)
Record = sequence of bytes, containing data for one tuple, e.g.
Bytes need to be interpreted relative to schema to get tuple
❖ Operations on Records |
Common operation on records ... access record via RecordId
Record get_record(Relation rel, RecordId rid) { (pid,tid) = rid; Page *buf = request_page(rel, pid); return get_bytes(buf, tid); }
Gives a sequence of bytes, which needs to be interpreted, e.g.
Relation rel = ... // relation schema
Record rec = get_record(rel,rid)
Tuple t = makeTuple(rel,rec)
Once we have a tuple, we can access individual attributes/fields
❖ Operations on Tuples |
Once we have a record, we need to interpret it as a tuple ...
Tuple t = makeTuple(Relation rel, Record rec)
rel
Once we have a tuple, we want to examine its contents ...
Typ getTypField(Tuple t, int fieldNum)
fno
Tuple
int x = getIntField(t,1)
char *s = getStrField(t,2)
❖ Scanning |
Access methods typically involve iterators, e.g.
Scan s = start_scan(Relation r, ...)
r
Scan
WHERE
Scan
Tuple next_tuple(Scan s)
Tuple
NULL
Tuples
❖ Example Query |
Example: simple scan of a table ...
select name from Employee
implemented as:
DB db = openDatabase("myDB");
Relation r = openRel(db,"Employee");
Scan s = start_scan(r);
Tuple t; // current tuple
while ((t = next_tuple(s)) != NULL)
{
char *name = getStrField(t,2);
printf("%s\n", name);
}
❖ Fixed-length Records |
Encoding scheme for fixed-length records:
Since record format is frequently used at query time, should be in memory.
❖ Variable-length Records |
Some encoding schemes for variable-length records:
❖ Converting Records to Tuples |
A Record
byte[]
Tuple
Tuple
struct
Information on how to interpret the bytes as typed values
❖ Converting Records to Tuples (cont) |
DBMSs typically define a fixed set of field types, e.g.
DATE
FLOAT
INTEGER
NUMBER(
)
VARCHAR(
)
This determines implementation-level data types:
DATE |
time_t |
|
FLOAT |
float,double |
|
INTEGER |
int,long |
|
NUMBER( ) |
int[] |
|
VARCHAR( ) |
char[] |
❖ Converting Records to Tuples (cont) |
A Tuple
FieldDesc
Record
typedef struct { ushort nfields; // number of fields/attrs ushort data_off; // offset in struct for data FieldDesc fields[]; // field descriptions Record data; // pointer to record in buffer } Tuple;
Fields are derived from relation descriptor + record instance data.
❖ Converting Records to Tuples (cont) |
Tuple data
❖ Converting Records to Tuples (cont) |
Or, tuple data
Tuple struct
❖ PostgreSQL Tuples |
Definitions: include/postgres.h
include/access/*tup*.h
Functions: backend/access/common/*tup*.c
HeapTuple heap_form_tuple(desc,values[],isnull[])
heap_deform_tuple(tuple,desc,values[],isnull[])
Datum
❖ PostgreSQL Tuples (cont) |
Tuple-related data types:
// representation of a data value
typedef uintptr_t Datum;
The actual data value:
Datum
int
❖ PostgreSQL Tuples (cont) |
Tuple-related data types: (cont)
// TupleDescData: schema-related information for HeapTuples typedef struct tupleDescData { int natts; // number of attributes in the tuple Oid tdtypeid; // composite type ID for tuple type int32 tdtypmod; // typmod for tuple type int tdrefcount; // reference count (-1 if not counting) TupleConstr *constr; // constraints, or NULL if none FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER]; // attrs[N] is description of attribute number N+1 } *TupleDesc;
❖ PostgreSQL Tuples (cont) |
HeapTupleData
typedef HeapTupleData *HeapTuple; typedef struct HeapTupleData { uint32 t_len; // length of *t_data ItemPointerData t_self; // SelfItemPointer Oid t_tableOid; // table the tuple came from HeapTupleHeader t_data; // -> tuple header and data } HeapTupleData;
HeapTupleHeader
❖ PostgreSQL Tuples (cont) |
PostgreSQL stores a single block of data for tuple
byte[]
typedef struct HeapTupleHeaderData // simplified { HeapTupleFields t_heap; ItemPointerData t_ctid; // TID of this tuple or newer version uint16 t_infomask2; // #attributes + flags uint16 t_infomask; // flags e.g. has_null, has_varwidth uint8 t_hoff; // sizeof header incl. bitmap+padding // above is fixed size (23 bytes) for all heap tuples bits8 t_bits[1]; // bitmap of NULLs, variable length // OID goes here if HEAP_HASOID is set in t_infomask // actual data follows at end of struct } HeapTupleHeaderData;
❖ PostgreSQL Tuples (cont) |
Tuple-related data types: (cont)
typedef struct HeapTupleFields // simplified { TransactionId t_xmin; // inserting xact ID TransactionId t_xmax; // deleting or locking xact ID union { CommandId t_cid; // inserting or deleting command ID TransactionId t_xvac;// old-style VACUUM FULL xact ID } t_field3; } HeapTupleFields;
Note that not all system fields from stored tuple appear
oid
xmin
xmax
cmin
cmax
Produced: 6 May 2024