❖ Cost Models |
An important aspect of this course is
❖ Cost Models (cont) |
Since time cost is affected by many factors
For comparing methods, page cost is better
Estimating costs with multiple concurrent operations and buffering is difficult!!
❖ Cost Models (cont) |
Terminology:
❖ Cost Models (cont) |
In developing cost models, we make assumptions on how DBMSs store data:
❖ Cost Models (cont) |
Typical DBMS/table parameter values:
| Symbol | E.g. Value | |||
| r | 106 | |||
| R | 128 bytes | |||
| b | 105 | |||
| B | 8192 bytes | |||
| c | 60 | |||
| Tr ,Tw | 10 msec | |||
one page in memory |
- | ≅ 0 |
❖ Cost Models (cont) |
Our cost models are "rough" (based on assumptions)
But do give an O(x) feel for how expensive operations are.
Example "rough" estimation: how many piano tuners in Sydney?
Actual number of piano tuners in Yellow Pages = 120
Example borrowed from Alan Fekete at Sydney University.
❖ Storage Technology |
Persistent storage is
❖ Storage Technology (cont) |
Hard disks are well-established, cheap, high-volume, ...
Alternative bulk storage: SSD
❖ Storage Technology (cont) |
Comparison of HDD and SSD properties:
| HDD | SSD | |
| Cost/byte | ~ $13 / TB | ~ $35 / TB |
| Read latency | ~ 6ms | ~ 50µs |
| Write latency | ~ 6ms | ~ 900µs | R/W rate | 150MB/s | 450MB/s |
| Read unit | block (e.g. 1KB) | byte |
| Writing | write a block | write on empty block |
Will SSDs ever completely replace HDDs?
❖ Storage Management (cont) |
Aims of storage management in DBMS:
❖ Storage Management |
Topics in storage management ...
❖ Views of Data in Query Evaluation (cont) |
Representing database objects during query evaluation:
DBRelPageTuplePageID = FileID+OffsetOffsetTupleID = PageID+IndexIndex❖ File Management |
Aims of file management subsystem:
Builds higher-level operations on top of OS file operations.
❖ File Management (cont) |
Typical file operations provided by the operating system:
fd = open(fileName,mode) // open a named file for reading/writing/appending close(fd) // close an open file, via its descriptor nread = read(fd, buf, nbytes) // attempt to read data from file into buffer nwritten = write(fd, buf, nbytes) // attempt to write data from buffer to file lseek(fd, offset, seek_type) // move file pointer to relative/absolute file offset fsync(fd) // flush contents of file buffers to disk
❖ DBMS File Organisation |
How is data for DB objects arranged in the file system?
Different DBMSs make different choices, e.g.
❖ Single-file DBMS |
Consider a single file for the entire database (e.g. SQLite)
Objects are allocated to regions (segments) of the file.
If an object grows too large for allocated segment, allocate an extension.
What happens to allocated space when objects are removed?
❖ Single-file DBMS (cont) |
Allocating space in Unix files is easy:
❖ Single-file Storage Manager |
Consider the following simple single-file DBMS layout:
E.g.
SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]
TableMap = [ ("employee",20,500), ("project",620,40) ]
❖ Single-file Storage Manager (cont) |
Each file segment consists of a number fixed-size blocks
The following data/constant definitions are useful
#define PAGESIZE 2048 // bytes per page typedef long PageId; // PageId is block index // pageOffset=PageId*PAGESIZE typedef char *Page; // pointer to page/block buffer
Typical PAGESIZE
❖ Single-file Storage Manager (cont) |
Storage Manager data structures for opened DBs & Tables
typedef struct DBrec {
char *dbname; // copy of database name
int fd; // the database file
SpaceMap map; // map of free/used areas
NameTable names; // map names to areas + sizes
} *DB;
typedef struct Relrec {
char *relname; // copy of table name
int start; // page index of start of table data
int npages; // number of pages of table data
...
} *Rel;
❖ Example: Scanning a Relation |
With the above disk manager, our example:
select name from Employee
might be implemented as something like
DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
Page buffer = malloc(PAGESIZE*sizeof(char));
for (int i = 0; i < r->npages; i++) {
PageId pid = r->start+i;
get_page(db, pid, buffer);
for each tuple in buffer {
get tuple data and extract name
add (name) to result tuples
}
}
❖ Single-File Storage Manager |
// start using DB, buffer meta-data DB openDatabase(char *name) { DB db = new(struct DBrec); db->dbname = strdup(name); db->fd = open(name,O_RDWR); db->map = readSpaceTable(db->fd); db->names = readNameTable(db->fd); return db; } // stop using DB and update all meta-data void closeDatabase(DB db) { writeSpaceTable(db->fd,db->map); writeNameTable(db->fd,db->map); fsync(db->fd); close(db->fd); free(db->dbname); free(db); }
❖ Single-File Storage Manager (cont) |
// set up struct describing relation Rel openRelation(DB db, char *rname) { Rel r = new(struct Relrec); r->relname = strdup(rname); // get relation data from map tables r->start = ...; r->npages = ...; return r; } // stop using a relation void closeRelation(Rel r) { free(r->relname); free(r); }
❖ Single-File Storage Manager (cont) |
// assume that Page = byte[PageSize] // assume that PageId = block number in file // read page from file into memory buffer void get_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); read(db->fd, buf, PAGESIZE); } // write page from memory buffer to file void put_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); write(db->fd, buf, PAGESIZE); }
❖ Single-File Storage Manager (cont) |
Managing contents of space mapping table can be complex:
// assume an array of (offset,length,status) records // allocate n new pages PageId allocate_pages(DB db, int n) { if (no existing free chunks are large enough) { int endfile = lseek(db->fd, 0, SEEK_END); addNewEntry(db->map, endfile, n); } else { grab "worst fit" chunk split off unused section as new chunk } }
❖ Single-File Storage Manager (cont) |
Similar complexity for freeing chunks
// drop n pages starting from p
void deallocate_pages(DB db, PageId p, int n) {
if (no adjacent free chunks) {
markUnused(db->map, p, n);
} else {
merge adjacent free chunks
compress mapping table
}
}
Changes take effect when closeDatabase()
❖ Multiple-file Disk Manager |
Most DBMSs don't use a single large file for all data.
They typically provide:
Using multiple files (one file per relation) can be easier, e.g.
❖ Multiple-file Disk Manager (cont) |
Example of single-file vs multiple-file:
Consider how you would compute file offset of page[i] in table[1] ...
❖ Multiple-file Disk Manager (cont) |
Structure of PageId
If system uses one file per table, PageId
PageId❖ PostgreSQL Storage Manager (cont) |
Components of storage subsystem:
RelFileNodestorage/smgrstorage/smgr/md.cstorage/filesmgr❖ Relations as Files |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode {
Oid spcNode; // tablespace
Oid dbNode; // database
Oid relNode; // relation
} RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OIDdbNode == 0❖ Relations as Files (cont) |
The relpathRelFileNode
char *relpath(RelFileNode r) // simplified { char *path = malloc(ENOUGH_SPACE); if (r.spcNode == GLOBALTABLESPACE_OID) { /* Shared system relations live in PGDATA/global */ Assert(r.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, r.relNode); } else if (r.spcNode == DEFAULTTABLESPACE_OID) { /* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, r.dbNode, r.relNode); } else { /* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir r.spcNode, r.dbNode, r.relNode); } return path; }
❖ File Descriptor Pool |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()typedef char *FileName
Open files are referenced via: typedef int File
A File
❖ File Descriptor Pool (cont) |
Interface to file descriptor (pool):
File FileNameOpenFile(FileName fileName,
int fileFlags, int fileMode);
// open a file in the database directory ($PGDATA/base/...)
File OpenTemporaryFile(bool interXact);
// open temp file; flag: close at end of transaction?
void FileClose(File file);
void FileUnlink(File file);
int FileRead(File file, char *buffer, int amount);
int FileWrite(File file, char *buffer, int amount);
int FileSync(File file);
long FileSeek(File file, long offset, int whence);
int FileTruncate(File file, long offset);
Analogous to Unix syscalls open()close()read()write()lseek()
❖ File Descriptor Pool (cont) |
Virtual file descriptors (Vfd
VfdCache[0]❖ File Descriptor Pool (cont) |
Virtual file descriptor records (simplified):
typedef struct vfd
{
s_short fd; // current FD, or VFD_CLOSED if none
u_short fdstate; // bitflags for VFD's state
File nextFree; // link to next free VFD, if in freelist
File lruMoreRecently; // doubly linked recency-of-use list
File lruLessRecently;
long seekPos; // current logical file position
char *fileName; // name of file, or NULL for unused VFD
// NB: fileName is malloc'd, and must be free'd when closing the VFD
int fileFlags; // open(2) flags for (re)opening the file
int fileMode; // mode to pass to open(2)
} Vfd;
❖ File Manager (cont) |
PostgreSQL stores each table
PGDATA/pg_database.oid
❖ File Manager (cont) |
Data files (Oid, Oid.1, ...):
❖ File Manager (cont) |
Free space map (Oid_fsm):
VACUUMDELETExmaxVACUUM❖ File Manager (cont) |
The "magnetic disk storage manager" (storage/smgr/md.c
PageIDPageID
typedef struct
{
RelFileNode rnode; // which relation/file
ForkNumber forkNum; // which fork (of reln)
BlockNumber blockNum; // which page/block
} BufferTag;
❖ File Manager (cont) |
Access to a block of data proceeds (roughly) as follows:
// pageID set from pg_catalog tables // buffer obtained from Buffer pool getBlock(BufferTag pageID, Buffer buf) { Vfd vf; off_t offset; (vf, offset) = findBlock(pageID) lseek(vf.fd, offset, SEEK_SET) vf.seekPos = offset; nread = read(vf.fd, buf, BLOCKSIZE) if (nread < BLOCKSIZE) ... we have a problem }
BLOCKSIZE
❖ File Manager (cont) |
findBlock(BufferTag pageID) returns (Vfd, off_t)
{
offset = pageID.blockNum * BLOCKSIZE
fileName = relpath(pageID.rnode)
if (pageID.forkNum > 0)
fileName = fileName+"."+pageID.forkNum
if (fileName is not in Vfd pool)
fd = allocate new Vfd for fileName
else
fd = use Vfd from pool
if (offset > fd.fileSize) {
fd = allocate new Vfd for next fork
offset = offset - fd.fileSize
}
return (fd, offset)
}
❖ Buffer Pool |
Aim of buffer pool:
❖ Buffer Pool (cont) |
Buffer pool operations: (both take single PageID
request_page(pid)release_page(pid)request_page()getBlock()release_page()putBlock()Buffer pool data structures:
Page frames[NBUFS]FrameData directory[NBUFS]Pagebyte[BUFSIZE]❖ Buffer Pool (cont) |
For each frame, we need to know: (FrameData
❖ Buffer Pool (cont) |
How scans are performed without Buffer Pool:
Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
pageID = makePageID(db,Rel,i);
getBlock(pageID, buf);
for (j = 0; j < nTuples(buf); j++)
process(buf, j)
}
Requires N
If we read it again, N
❖ Buffer Pool (cont) |
How scans are performed with Buffer Pool:
Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
pageID = makePageID(db,Rel,i);
bufID = request_page(pageID);
buf = frames[bufID]
for (j = 0; j < nTuples(buf); j++)
process(buf, j)
release_page(pageID);
}
Requires N
If we read it again, 0 ≤ page reads ≤ N
❖ Buffer Pool (cont) |
Implementation of request_page()
int request_page(PageID pid)
{
if (pid in Pool)
bufID = index for pid in Pool
else {
if (no free frames in Pool)
evict a page // free a frame
bufID = allocate free frame
directory[bufID].page = pid
directory[bufID].pin_count = 0
directory[bufID].dirty_bit = 0
}
directory[bufID].pin_count++
return bufID
}
❖ Buffer Pool (cont) |
The release_page(pid)
The
The
mark_page(pid) operation:
Note: doesn't actually write to disk; just indicates that page changed
flush_page(pid)
Note: not generally used by higher levels of DBMS
write_page
❖ Buffer Pool (cont) |
Evicting a page ...
❖ Page Replacement Policies |
Several schemes are commonly in use:
❖ Page Replacement Policies (cont) |
Cost benefit from buffer pool (with n frames) is determined by:
First scan costs b reads; subsequent scans are "free".
Example (b): sequential scan, MRU, n < b
First scan costs b reads; subsequent scans cost b - n reads.
Example (c): sequential scan, LRU, n < b
All scans cost b reads; known as sequential flooding.
❖ Effect of Buffer Management |
Consider a query to find customers who are also employees:
select c.name from Customer c, Employee e where c.ssn = e.ssn;
This might be implemented inside the DBMS via nested loops:
for each tuple t1 in Customer {
for each tuple t2 in Employee {
if (t1.ssn == t2.ssn)
append (t1.name) to result set
}
}
❖ Effect of Buffer Management (cont) |
In terms of page-level operations, the algorithm looks like:
Rel rC = openRelation("Customer");
Rel rE = openRelation("Employee");
for (int i = 0; i < nPages(rC); i++) {
PageID pid1 = makePageID(db,rC,i);
Page p1 = request_page(pid1);
for (int j = 0; j < nPages(rE); j++) {
PageID pid2 = makePageID(db,rE,j);
Page p2 = request_page(pid2);
// compare all pairs of tuples from p1,p2
// construct solution set from matching pairs
release_page(pid2);
}
release_page(pid1);
}
❖ PostgreSQL Buffer Manager |
PostgreSQL buffer manager:
Definitions: src/include/storage/buf*.h
Functions: src/backend/storage/buffer/*.c
Buffer code is also used by backends who want a private buffer pool
❖ PostgreSQL Buffer Manager (cont) |
Buffer pool consists of:
BufferDescriptors
NBuffersBufferDescBufferBlocksNBuffersBuffer1..NBuffers
shared_buffers = 16MB # min 128KB, 16*8KB buffers
❖ PostgreSQL Buffer Manager (cont) |
PostgreSQL buffer descriptors:
typedef struct BufferDesc
{
BufferTag tag; /* ID of page contained in buffer */
int buf_id; /* buffer's index number (from 0) */
/* state of the tag, containing:
flags, refcount and usagecount */
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* pin-count waiter */
int freeNext; /* link in freelist chain */
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
❖ PostgreSQL Buffer Manager (cont) |
include/storage/buf.h
Bufferinclude/storage/bufmgr.hinclude/storage/buf_internals.hBufferDescbackend/storage/buffer/*.c
Commentary: backend/storage/buffer/README
❖ Some terminology |
Terminology used in these slides ...
RecordTupleRecordPagePageIdpidpageOffsetInFilepid * PAGESIZETupleIdtidRecordIdPageIdTupleIdridrecOffsetInPagepage.directory[tid].offsetRelation❖ Page Formats |
A Pagebyte[]
Want to interpret/manipulate it as a collection of Record
Typical operations on pages and records:
buf = request_page(rel,pid)PageIdrec = get_record(buf,tid)rid = insert_record(rel,pid,rec)update_record(rel,rid,rec)delete_record(rel,rid)rid(PageId,TupleId)rel❖ Page Formats (cont) |
Factors affecting Page
PagePagePagePagePage❖ Page Formats (cont) |
For fixed-length records, use record slots.
❖ Page Formats (cont) |
For variable-length records, must use record directory
directory[i]TupleId tid❖ Page Formats (cont) |
Possibilities for handling free-space within block:
❖ Storage Utilisation |
How many records can fit in a page? (denoted c = capacity)
Depends on:
Given c, HeaderSize + c*SlotSize + c*R ≤ PageSize
❖ Overflows |
Sometimes, it may not be possible to insert a record into a page:
If still insufficient space, we need an alternative solution ...
❖ Overflows (cont) |
File organisation determines how cases (2)..(4) are handled.
If records may be inserted anywhere that there is free space
❖ PostgreSQL Page Representation |
Functions: src/backend/storage/page/*.c
Definitions: src/include/storage/bufpage.h
Each page is 8KB (default BLCKSZ
Also supports ~SQL-standard BLOBs (explicit large data items)
❖ PostgreSQL Page Representation (cont) |
Page-related data types:
// a Page is simply a pointer to start of buffer typedef Pointer Page; // indexes into the tuple directory typedef uint16 LocationIndex; // entries in tuple directory (line pointer array) typedef struct ItemIdData { unsigned lp_off:15, // tuple offset from start of page lp_flags:2, // unused,normal,redirect,dead lp_len:15; // length of tuple (bytes) } ItemIdData;
❖ PostgreSQL Page Representation (cont) |
Page-related data types: (cont)
typedef struct PageHeaderData (simplified) { ... // transaction-related data uint16 pd_checksum; // checksum uint16 pd_flags; // flag bits (e.g. free, full, ... LocationIndex pd_lower; // offset to start of free space LocationIndex pd_upper; // offset to end of free space LocationIndex pd_special; // offset to start of special space uint16 pd_pagesize_version; ItemIdData pd_linp[1]; // beginning of line pointer array } PageHeaderData; typedef PageHeaderData *PageHeader;
❖ PostgreSQL Page Representation (cont) |
Operations on Page
void PageInit(Page page, Size pageSize, ...)
Pagepd_lowerpd_upperOffsetNumber PageAddItem(Page page,
Item item, Size size, ...)Pagevoid PageRepairFragmentation(Page page)❖ PostgreSQL Page Representation (cont) |
PostgreSQL has two kinds of pages:
One important difference:
❖ TOAST Files |
Each data file has a corresponding TOAST file (if needed)
Tuples in data pages contain rid
TOAST = The Oversized Attribute Storage Technique
❖ Tuples |
Each page contains a collection of tuples
What do tuples contain? How are they structured internally?
❖ Records vs Tuples |
A table is defined by a collection of attributes (schema), e.g.
create table Employee ( id integer primary key, name varchar(20), job varchar(10), dept number(4) );
Tuple = collection of attribute values for such a schema, e.g.
(33357462, 'Neil Young', 'Musician', 0277)
Record = sequence of bytes, containing data for one tuple, e.g.
Bytes need to be interpreted relative to schema to get tuple
❖ Operations on Records |
Common operation on records ... access record via RecordId
Record get_record(Relation rel, RecordId rid) {
(pid,tid) = rid;
Page *buf = request_page(rel, pid);
return get_bytes(buf, tid);
}
Gives a sequence of bytes, which needs to be interpreted, e.g.
Relation rel = ... // relation schema
Record rec = get_record(rel,rid)
Tuple t = makeTuple(rel,rec)
Once we have a tuple, we can access individual attributes/fields
❖ Operations on Tuples |
Once we have a record, we need to interpret it as a tuple ...
Tuple t = makeTuple(Relation rel, Record rec)
rel
Once we have a tuple, we want to examine its contents ...
Typ getTypField(Tuple t, int fieldNum)
fnoTupleint x = getIntField(t,1)char *s = getStrField(t,2)❖ Scanning |
Access methods typically involve iterators, e.g.
Scan s = start_scan(Relation r, ...)
rScanWHEREScan
Tuple next_tuple(Scan s)
TupleNULLTuples❖ Example Query |
Example: simple scan of a table ...
select name from Employee
implemented as:
DB db = openDatabase("myDB");
Relation r = openRel(db,"Employee");
Scan s = start_scan(r);
Tuple t; // current tuple
while ((t = next_tuple(s)) != NULL)
{
char *name = getStrField(t,2);
printf("%s\n", name);
}
❖ Fixed-length Records |
Encoding scheme for fixed-length records:
Since record format is frequently used at query time, should be in memory.
❖ Variable-length Records |
Some encoding schemes for variable-length records:
❖ Converting Records to Tuples |
A Recordbyte[]
TupleTuplestructInformation on how to interpret the bytes as typed values
❖ Converting Records to Tuples (cont) |
DBMSs typically define a fixed set of field types, e.g.
DATEFLOATINTEGERNUMBER()VARCHAR()This determines implementation-level data types:
DATE |
time_t |
|
FLOAT |
float,double |
|
INTEGER |
int,long |
|
NUMBER() |
int[] |
|
VARCHAR() |
char[] |
❖ Converting Records to Tuples (cont) |
A Tuple
FieldDescRecord
typedef struct {
ushort nfields; // number of fields/attrs
ushort data_off; // offset in struct for data
FieldDesc fields[]; // field descriptions
Record data; // pointer to record in buffer
} Tuple;
Fields are derived from relation descriptor + record instance data.
❖ Converting Records to Tuples (cont) |
Tuple data
❖ Converting Records to Tuples (cont) |
Or, tuple data
Tuple struct
❖ PostgreSQL Tuples |
Definitions: include/postgres.hinclude/access/*tup*.h
Functions: backend/access/common/*tup*.c
HeapTuple heap_form_tuple(desc,values[],isnull[])heap_deform_tuple(tuple,desc,values[],isnull[])Datum❖ PostgreSQL Tuples (cont) |
Tuple-related data types:
// representation of a data value
typedef uintptr_t Datum;
The actual data value:
Datumint❖ PostgreSQL Tuples (cont) |
Tuple-related data types: (cont)
// TupleDescData: schema-related information for HeapTuples typedef struct tupleDescData { int natts; // number of attributes in the tuple Oid tdtypeid; // composite type ID for tuple type int32 tdtypmod; // typmod for tuple type int tdrefcount; // reference count (-1 if not counting) TupleConstr *constr; // constraints, or NULL if none FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER]; // attrs[N] is description of attribute number N+1 } *TupleDesc;
❖ PostgreSQL Tuples (cont) |
HeapTupleData
typedef HeapTupleData *HeapTuple;
typedef struct HeapTupleData
{
uint32 t_len; // length of *t_data
ItemPointerData t_self; // SelfItemPointer
Oid t_tableOid; // table the tuple came from
HeapTupleHeader t_data; // -> tuple header and data
} HeapTupleData;
HeapTupleHeader
❖ PostgreSQL Tuples (cont) |
PostgreSQL stores a single block of data for tuple
byte[]typedef struct HeapTupleHeaderData // simplified { HeapTupleFields t_heap; ItemPointerData t_ctid; // TID of this tuple or newer version uint16 t_infomask2; // #attributes + flags uint16 t_infomask; // flags e.g. has_null, has_varwidth uint8 t_hoff; // sizeof header incl. bitmap+padding // above is fixed size (23 bytes) for all heap tuples bits8 t_bits[1]; // bitmap of NULLs, variable length // OID goes here if HEAP_HASOID is set in t_infomask // actual data follows at end of struct } HeapTupleHeaderData;
❖ PostgreSQL Tuples (cont) |
Tuple-related data types: (cont)
typedef struct HeapTupleFields // simplified { TransactionId t_xmin; // inserting xact ID TransactionId t_xmax; // deleting or locking xact ID union { CommandId t_cid; // inserting or deleting command ID TransactionId t_xvac;// old-style VACUUM FULL xact ID } t_field3; } HeapTupleFields;
Note that not all system fields from stored tuple appear
oidxminxmaxcmincmaxProduced: 6 May 2024