B: Cost models, Storage management

<< ∧ >>

❖ Cost Models (cont)

Since time cost is affected by many factors

speed of i/o to/from storage devices (disk, SSD)
load on machine

we do not consider time cost in our analyses.

For comparing methods, page cost is better

identifies the workload imposed by a given method

Measures the number of page read/write requests made.

Estimating costs with multiple concurrent operations and buffering is difficult!!

<< ∧ >>

❖ Cost Models (cont)

Terminology:

attribute = atomic value (e.g. int, string)
tuple = record, list of values
relation = table, set of records
page = fixed-size block of data
database = collection of tables, constraints, ...

<< ∧ >>

❖ Cost Models (cont)

In developing cost models, we make assumptions on how DBMSs store data:

a relation is a set of r tuples, with average size R bytes
the tuples are stored in b data pages on a storage device
each page has size B bytes and contains up to c tuples
the tuples which answer query q are contained in b_q pages
data is transferred bulk-storage↔memory in whole pages
cost of bulk-storage↔memory transfer T_r/w is high

[Diagram:Pics/scansortproj/file-struct0.png]

<< ∧ >>

❖ Cost Models (cont)

Typical DBMS/table parameter values:

Quantity	Symbol	E.g. Value
total # tuples	r	10⁶
record size	R	128 bytes
total # pages	b	10⁵
page size	B	8192 bytes
# tuples per page	c	60
page read/write time	T_r ,T_w	10 msec
cost to process one page in memory	-	≅ 0

<< ∧ >>

❖ Cost Models (cont)

Our cost models are "rough" (based on assumptions)

But do give an O(x) feel for how expensive operations are.

Example "rough" estimation: how many piano tuners in Sydney?

Sydney has ≅ 4 000 000 people

Average household size ≅ 3 ∴ 1 300 000 households

Let's say that 1 in 10 households owns a piano

Therefore there are ≅ 130 000 pianos

Say people get their piano tuned every 2 years (on average)

Say a tuner can tune 2/day, working 250 days per year

Therefore 1 tuner can tune 500 pianos per year

Therefore Sydney would need ≅ 130000/2/500 = 130 tuners

Actual number of piano tuners in Yellow Pages = 120

Example borrowed from Alan Fekete at Sydney University.

<< ∧ >>

❖ Storage Technology

Persistent storage is

large, cheap, relatively slow, accessed in blocks
used for long-term storage of data

Computational storage is

small, expensive, fast, accessed by byte/word
used for all analysis of data

Access cost HDD:RAM ≅ 100000:1, e.g.

10ms to read block containing two tuples
1µs to compare fields in two tuples

<< ∧ >>

❖ Storage Technology (cont)

Hard disks are well-established, cheap, high-volume, ...

Alternative bulk storage: SSD

faster than HDDs, no latency
can read single items
update requires block erase then write
over time, writes "wear out" blocks
require controllers that spread write load

Feasible for long-term, high-update environments?

<< ∧ >>

❖ Storage Technology (cont)

Comparison of HDD and SSD properties:

HDD SSD

Cost/byte ~ $13 / TB ~ $35 / TB

Read latency ~ 6ms ~ 50µs
Write latency ~ 6ms ~ 900µs

R/W rate 150MB/s 450MB/s

Read unit block (e.g. 1KB) byte

Writing write a block write on empty block

Will SSDs ever completely replace HDDs?

<< ∧ >>

❖ Storage Management

Part of DBMS related to storage management:

<< ∧ >>

❖ Storage Management (cont)

Aims of storage management in DBMS:

provide view of data as collection of pages/tuples
map from database objects (e.g. tables) to disk files
manage transfer of data to/from disk storage
use buffers to minimise disk/memory transfers
interpret loaded data as tuples/records
basis for file structures used by access methods

<< ∧ >>

❖ Storage Management

Topics in storage management ...

Disks and Files
- performance issues and organisation of disk files
Buffer Management
- using caching to improve DBMS system throughput
Tuple/Page Management
- how tuples are represented within disk pages
DB Object Management (Catalog)
- how tables/views/functions/types, etc. are represented

<< ∧ >>

❖ Views of Data in Query Evaluation

<< ∧ >>

❖ Views of Data in Query Evaluation (cont)

Representing database objects during query evaluation:

DB (handle on an authorised/opened database)
Rel (handle on an opened relation)
Page (memory buffer to hold contents of disk block)
Tuple (chunk of memory holding data values from one tuple)

Addressing in DBMSs:

PageID = FileID+Offset ... identifies a block of data
- where Offset gives location of block within file
TupleID = PageID+Index ... identifies a single tuple
- where Index gives "location" of tuple within page

<< ∧ >>

❖ File Management

Aims of file management subsystem:

organise layout of data within the filesystem
handle mapping from database ID to file address
transfer blocks of data between buffer pool and filesystem
also attempts to handle file access error problems (retry)

Builds higher-level operations on top of OS file operations.

<< ∧ >>

❖ File Management (cont)

Typical file operations provided by the operating system:

fd = open(fileName,mode)
  // open a named file for reading/writing/appending
close(fd)
  // close an open file, via its descriptor 
nread = read(fd, buf, nbytes)
  // attempt to read data from file into buffer 
nwritten = write(fd, buf, nbytes)
  // attempt to write data from buffer to file
lseek(fd, offset, seek_type)
  // move file pointer to relative/absolute file offset
fsync(fd)
  // flush contents of file buffers to disk

<< ∧ >>

❖ DBMS File Organisation

How is data for DB objects arranged in the file system?

Different DBMSs make different choices, e.g.

by-pass the file system and use a raw disk partition
have a single very large file containing all DB data
have several large files, with tables spread across them
have multiple data files, one for each table
have multiple files for each table
etc.

<< ∧ >>

❖ Single-file DBMS

Consider a single file for the entire database (e.g. SQLite)

Objects are allocated to regions (segments) of the file.

If an object grows too large for allocated segment, allocate an extension.

What happens to allocated space when objects are removed?

<< ∧ >>

❖ Single-file DBMS (cont)

Allocating space in Unix files is easy:

simply seek to the place you want and write the data
if nothing there already, data is appended to the file
if something there already, it gets overwritten

If the seek goes way beyond the end of the file:

Unix does not (yet) allocate disk space for the "hole"
allocates disk space only when data is written in the hole

With the above, a disk/file manager is easy to implement.

<< ∧ >>

❖ Single-file Storage Manager

Consider the following simple single-file DBMS layout:

[Diagram:Pics/storage/single-file-example.png]

E.g.

SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]

TableMap = [ ("employee",20,500), ("project",620,40) ]

<< ∧ >>

❖ Single-file Storage Manager (cont)

Each file segment consists of a number fixed-size blocks

The following data/constant definitions are useful

#define PAGESIZE 2048   // bytes per page

typedef long PageId;    // PageId is block index
                        // pageOffset=PageId*PAGESIZE

typedef char *Page;     // pointer to page/block buffer

Typical PAGESIZE values: 1024, 2048, 4096, 8192

<< ∧ >>

❖ Single-file Storage Manager (cont)

Storage Manager data structures for opened DBs & Tables

typedef struct DBrec {
   char *dbname;    // copy of database name
   int fd;          // the database file
   SpaceMap map;    // map of free/used areas 
   NameTable names; // map names to areas + sizes
} *DB;

typedef struct Relrec {
   char *relname;   // copy of table name
   int   start;     // page index of start of table data
   int   npages;    // number of pages of table data
   ...
} *Rel;

<< ∧ >>

❖ Example: Scanning a Relation

With the above disk manager, our example:

select name from Employee

might be implemented as something like

DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
Page buffer = malloc(PAGESIZE*sizeof(char));
for (int i = 0; i < r->npages; i++) {
   PageId pid = r->start+i;
   get_page(db, pid, buffer);
   for each tuple in buffer {
      get tuple data and extract name
      add (name) to result tuples
   }
}

<< ∧ >>

❖ Single-File Storage Manager

// start using DB, buffer meta-data
DB openDatabase(char *name) { 
   DB db = new(struct DBrec);
   db->dbname = strdup(name);
   db->fd = open(name,O_RDWR);
   db->map = readSpaceTable(db->fd);
   db->names = readNameTable(db->fd);
   return db;
}
// stop using DB and update all meta-data
void closeDatabase(DB db) {
   writeSpaceTable(db->fd,db->map);
   writeNameTable(db->fd,db->map);
   fsync(db->fd);
   close(db->fd);
   free(db->dbname);
   free(db);
}

<< ∧ >>

❖ Single-File Storage Manager (cont)

// set up struct describing relation
Rel openRelation(DB db, char *rname) {
   Rel r = new(struct Relrec);
   r->relname = strdup(rname);
   // get relation data from map tables
   r->start = ...;
   r->npages = ...;
   return r;
}

// stop using a relation
void closeRelation(Rel r) {
   free(r->relname);
   free(r);
}

<< ∧ >>

❖ Single-File Storage Manager (cont)

// assume that Page = byte[PageSize]
// assume that PageId = block number in file

// read page from file into memory buffer
void get_page(DB db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   read(db->fd, buf, PAGESIZE);
}

// write page from memory buffer to file
void put_page(DB db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   write(db->fd, buf, PAGESIZE);
}

<< ∧ >>

❖ Single-File Storage Manager (cont)

Managing contents of space mapping table can be complex:

// assume an array of (offset,length,status) records

// allocate n new pages
PageId allocate_pages(DB db, int n) {
   if (no existing free chunks are large enough) {
      int endfile = lseek(db->fd, 0, SEEK_END);
      addNewEntry(db->map, endfile, n);
   } else {
      grab "worst fit" chunk
      split off unused section as new chunk
   }
}

<< ∧ >>

❖ Single-File Storage Manager (cont)

Similar complexity for freeing chunks

// drop n pages starting from p
void deallocate_pages(DB db, PageId p, int n) {
   if (no adjacent free chunks) {
      markUnused(db->map, p, n);
   } else {
      merge adjacent free chunks
      compress mapping table
   }
}

Changes take effect when closeDatabase() executed.

<< ∧ >>

❖ Multiple-file Disk Manager

Most DBMSs don't use a single large file for all data.

They typically provide:

multiple files partitioned physically or logically
mapping from DB-level objects to files (e.g. via meta-data)

Precise file structure varies between individual DBMSs.

Using multiple files (one file per relation) can be easier, e.g.

adding a new relation
extending the size of a relation
computing page offsets within a relation

<< ∧ >>

❖ Multiple-file Disk Manager (cont)

Example of single-file vs multiple-file:

[Diagram:Pics/storage/single-vs-multi.png]

Consider how you would compute file offset of page[i] in table[1] ...

<< ∧ >>

❖ Multiple-file Disk Manager (cont)

Structure of PageId for data pages in such systems ...

If system uses one file per table, PageId contains:

relation indentifier (which can be mapped to filename)
page number (to identify page within the file)

If system uses several files per table, PageId contains:

relation identifier
file identifier (combined with relid, gives filename)
page number (to identify page within the file)

<< ∧ >>

❖ PostgreSQL Storage Manager

PostgreSQL uses the following file organisation ...

<< ∧ >>

❖ PostgreSQL Storage Manager (cont)

Components of storage subsystem:

mapping from relations to files (RelFileNode)
abstraction for open relation pool (storage/smgr)
functions for managing files (storage/smgr/md.c)
file-descriptor pool (storage/file)

PostgreSQL has two basic kinds of files:

heap files containing data (tuples)
index files containing index entries

Note: smgr designed for many storage devices; only disk handler provided

<< ∧ >>

❖ Relations as Files

PostgreSQL identifies relation files via their OIDs.

The core data structure for this is RelFileNode:

typedef struct RelFileNode {
    Oid  spcNode;  // tablespace
    Oid  dbNode;   // database
    Oid  relNode;  // relation
} RelFileNode;

Global (shared) tables (e.g. pg_database) have

spcNode == GLOBALTABLESPACE_OID
dbNode == 0

<< ∧ >>

❖ Relations as Files (cont)

The relpath function maps RelFileNode to file:


char *relpath(RelFileNode r)  // simplified
{
   char *path = malloc(ENOUGH_SPACE);

   if (r.spcNode == GLOBALTABLESPACE_OID) {
      /* Shared system relations live in PGDATA/global */
      Assert(r.dbNode == 0);
      sprintf(path, "%s/global/%u",
              DataDir, r.relNode);
   }
   else if (r.spcNode == DEFAULTTABLESPACE_OID) {
      /* The default tablespace is PGDATA/base */
      sprintf(path, "%s/base/%u/%u",
              DataDir, r.dbNode, r.relNode);
   }
   else {
      /* All other tablespaces accessed via symlinks */
      sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir
              r.spcNode, r.dbNode, r.relNode);
   }
   return path;
}

<< ∧ >>

❖ File Descriptor Pool

Unix has limits on the number of concurrently open files.

PostgreSQL maintains a pool of open file descriptors:

to hide this limitation from higher level functions
to minimise expensive open() operations

File names are simply strings: typedef char *FileName

Open files are referenced via: typedef int File

A File is an index into a table of "virtual file descriptors".

<< ∧ >>

❖ File Descriptor Pool (cont)

Interface to file descriptor (pool):


File FileNameOpenFile(FileName fileName,
                      int fileFlags, int fileMode);
     // open a file in the database directory ($PGDATA/base/...)
File OpenTemporaryFile(bool interXact);
     // open temp file; flag: close at end of transaction?
void FileClose(File file);
void FileUnlink(File file);
int  FileRead(File file, char *buffer, int amount);
int  FileWrite(File file, char *buffer, int amount);
int  FileSync(File file);
long FileSeek(File file, long offset, int whence);
int  FileTruncate(File file, long offset);

Analogous to Unix syscalls open(), close(), read(), write(), lseek(), ...

<< ∧ >>

❖ File Descriptor Pool (cont)

Virtual file descriptors (Vfd)

physically stored in dynamically-allocated array
also arranged into list by recency-of-use

VfdCache[0] holds list head/tail pointers.

<< ∧ >>

❖ File Descriptor Pool (cont)

Virtual file descriptor records (simplified):


typedef struct vfd
{
    s_short  fd;              // current FD, or VFD_CLOSED if none
    u_short  fdstate;         // bitflags for VFD's state
    File     nextFree;        // link to next free VFD, if in freelist
    File     lruMoreRecently; // doubly linked recency-of-use list
    File     lruLessRecently;
    long     seekPos;         // current logical file position
    char     *fileName;       // name of file, or NULL for unused VFD
    // NB: fileName is malloc'd, and must be free'd when closing the VFD
    int      fileFlags;       // open(2) flags for (re)opening the file
    int      fileMode;        // mode to pass to open(2)
} Vfd;

<< ∧ >>

❖ File Manager

Reminder: PostgreSQL file organisation

<< ∧ >>

❖ File Manager (cont)

PostgreSQL stores each table

in the directory PGDATA/pg_database.oid
often in multiple data files (aka forks)

[Diagram:Pics/storage/one-table-files.png]

<< ∧ >>

❖ File Manager (cont)

Data files (Oid, Oid.1, ...):

sequence of fixed-size blocks/pages (typically 8KB)
each page contains tuple data and admin data (see later)
max size of data files 1GB (Unix limitation)

<< ∧ >>

❖ File Manager (cont)

Free space map (Oid_fsm):

indicates where free space is in data pages
"free" space is only free after VACUUM
(DELETE simply marks tuples as no longer in use xmax)

Visibility map (Oid_vm):

indicates pages where all tuples are "visible"
(visible = accessible to all currently active transactions)
such pages can be ignored by VACUUM

<< ∧ >>

❖ File Manager (cont)

The "magnetic disk storage manager" (storage/smgr/md.c)

manages its own pool of open file descriptors (Vfd's)
may use several Vfd's to access data, if several forks
manages mapping from PageID to file+offset.

PostgreSQL PageID values are structured:

typedef struct
{
    RelFileNode rnode;    // which relation/file
    ForkNumber  forkNum;  // which fork (of reln)
    BlockNumber blockNum; // which page/block 
} BufferTag;

<< ∧ >>

❖ File Manager (cont)

Access to a block of data proceeds (roughly) as follows:

// pageID set from pg_catalog tables
// buffer obtained from Buffer pool
getBlock(BufferTag pageID, Buffer buf)
{
   Vfd vf;  off_t offset;
   (vf, offset) = findBlock(pageID)
   lseek(vf.fd, offset, SEEK_SET)
   vf.seekPos = offset;
   nread = read(vf.fd, buf, BLOCKSIZE)
   if (nread < BLOCKSIZE) ... we have a problem
}

BLOCKSIZE is a global configurable constant (default: 8192)

<< ∧ >>

❖ File Manager (cont)

findBlock(BufferTag pageID) returns (Vfd, off_t)
{
   offset = pageID.blockNum * BLOCKSIZE
   fileName = relpath(pageID.rnode)
   if (pageID.forkNum > 0)
      fileName = fileName+"."+pageID.forkNum
   if (fileName is not in Vfd pool)
      fd = allocate new Vfd for fileName
   else
      fd = use Vfd from pool
   if (offset > fd.fileSize) {
      fd = allocate new Vfd for next fork
      offset = offset - fd.fileSize
   }
   return (fd, offset)
}

<< ∧ >>

❖ Buffer Pool

Aim of buffer pool:

hold pages read from database files, for possible re-use

Used by:

access methods which read/write data pages
e.g. sequential scan, indexed retrieval, hashing

Uses:

file manager functions to access data files

Note: we use the terms page and block interchangably

<< ∧ >>

❖ Buffer Pool (cont)

<< ∧ >>

❖ Buffer Pool (cont)

Buffer pool operations: (both take single PageID argument)

request_page(pid), release_page(pid), ...

To some extent ...

request_page() replaces getBlock()
release_page() replaces putBlock()

Buffer pool data structures:

Page frames[NBUFS]
FrameData directory[NBUFS]
Page is byte[BUFSIZE]

<< ∧ >>

❖ Buffer Pool (cont)

<< ∧ >>

❖ Buffer Pool (cont)

For each frame, we need to know: (FrameData)

which Page it contains, or whether empty/free
whether it has been modified since loading (dirty bit)
how many transactions are currently using it (pin count)
time-stamp for most recent access (assists with replacement)

Pages are referenced by PageID ...

PageID = BufferTag = (rnode, forkNum, blockNum)

<< ∧ >>

❖ Buffer Pool (cont)

How scans are performed without Buffer Pool:

Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
   pageID = makePageID(db,Rel,i);
   getBlock(pageID, buf);
   for (j = 0; j < nTuples(buf); j++)
      process(buf, j)
}

Requires N page reads.

If we read it again, N page reads.

<< ∧ >>

❖ Buffer Pool (cont)

How scans are performed with Buffer Pool:

Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
   pageID = makePageID(db,Rel,i);
   bufID = request_page(pageID);
   buf = frames[bufID]
   for (j = 0; j < nTuples(buf); j++)
      process(buf, j)
   release_page(pageID);
}

Requires N page reads on the first pass.

If we read it again, 0 ≤ page reads ≤ N

<< ∧ >>

❖ Buffer Pool (cont)

Implementation of request_page()

int request_page(PageID pid)
{
   if (pid in Pool)
      bufID = index for pid in Pool
   else {
      if (no free frames in Pool)
         evict a page // free a frame
      bufID = allocate free frame
      directory[bufID].page = pid
      directory[bufID].pin_count = 0
      directory[bufID].dirty_bit = 0
   }
   directory[bufID].pin_count++
   return bufID
}

<< ∧ >>

❖ Buffer Pool (cont)

The release_page(pid) operation:

Decrement pin count for specified page

Note: no effect on disk or buffer contents until replacement required

The mark_page(pid) operation:

Set dirty bit on for specified page

Note: doesn't actually write to disk; just indicates that page changed

The flush_page(pid) operation:

Write the specified page to disk (using write_page)

Note: not generally used by higher levels of DBMS

<< ∧ >>

❖ Buffer Pool (cont)

Evicting a page ...

find frame(s) preferably satisfying
- pin count = 0 (i.e. nobody using it)
- dirty bit = 0 (not modified)
if selected frame was modified, flush frame to disk
flag directory entry as "frame empty"

If multiple frames can potentially be released

need a policy to decide which is best choice

<< ∧ >>

❖ Page Replacement Policies

Several schemes are commonly in use:

Least Recently Used (LRU)
Most Recently Used (MRU)
First in First Out (FIFO)
Random

LRU / MRU require knowledge of when pages were last accessed

how to keep track of "last access" time?
base on request/release operations or on real page usage?

<< ∧ >>

❖ Page Replacement Policies (cont)

Cost benefit from buffer pool (with n frames) is determined by:

number of available frames (more ⇒ better)
replacement strategy vs page access pattern

Example (a): sequential scan, LRU or MRU, n ≥ b

First scan costs b reads; subsequent scans are "free".

Example (b): sequential scan, MRU, n < b

First scan costs b reads; subsequent scans cost b - n reads.

Example (c): sequential scan, LRU, n < b

All scans cost b reads; known as sequential flooding.

<< ∧ >>

❖ Effect of Buffer Management

Consider a query to find customers who are also employees:

select c.name
from   Customer c, Employee e
where  c.ssn = e.ssn;

This might be implemented inside the DBMS via nested loops:

for each tuple t1 in Customer {
    for each tuple t2 in Employee {
        if (t1.ssn == t2.ssn)
            append (t1.name) to result set
    }
}

<< ∧ >>

❖ Effect of Buffer Management (cont)

In terms of page-level operations, the algorithm looks like:

Rel rC = openRelation("Customer");
Rel rE = openRelation("Employee");
for (int i = 0; i < nPages(rC); i++) {
    PageID pid1 = makePageID(db,rC,i);
    Page p1 = request_page(pid1);
    for (int j = 0; j < nPages(rE); j++) {
        PageID pid2 = makePageID(db,rE,j);
        Page p2 = request_page(pid2);
        // compare all pairs of tuples from p1,p2
        // construct solution set from matching pairs
        release_page(pid2);
    }
    release_page(pid1);
}

<< ∧ >>

❖ PostgreSQL Buffer Manager

PostgreSQL buffer manager:

provides a shared pool of memory buffers for all backends
all access methods get data from disk via buffer manager

Buffers are located in a large region of shared memory.

Definitions: src/include/storage/buf*.h

Functions: src/backend/storage/buffer/*.c

Buffer code is also used by backends who want a private buffer pool

<< ∧ >>

❖ PostgreSQL Buffer Manager (cont)

Buffer pool consists of:

BufferDescriptors

shared fixed array (size NBuffers) of BufferDesc

BufferBlocks

shared fixed array (size NBuffers) of 8KB frames

Buffer = index values in above arrays

indexes: global buffers 1..NBuffers; local buffers negative

Size of buffer pool is set in postgresql.conf, e.g.

shared_buffers = 16MB   # min 128KB, 16*8KB buffers

<< ∧ >>

❖ PostgreSQL Buffer Manager (cont)

PostgreSQL buffer descriptors:

typedef struct BufferDesc
{
   BufferTag  tag;           /* ID of page contained in buffer */
   int        buf_id;        /* buffer's index number (from 0) */

   /* state of the tag, containing:
               flags, refcount and usagecount */
   pg_atomic_uint32 state;

   int         wait_backend_pgprocno;   /* pin-count waiter */
   int         freeNext;      /* link in freelist chain */
   LWLock      content_lock;  /* to lock access to buffer contents */
} BufferDesc;

<< ∧ >>

❖ PostgreSQL Buffer Manager (cont)

[Diagram:Pics/storage/buffer-memory.png]

<< ∧ >>

❖ PostgreSQL Buffer Manager (cont)

include/storage/buf.h

basic buffer manager data types (e.g. Buffer)

include/storage/bufmgr.h

definitions for buffer manager function interface
(i.e. functions that other parts of the system call to use buffer manager)

include/storage/buf_internals.h

definitions for buffer manager internals (e.g. BufferDesc)

Code: backend/storage/buffer/*.c

Commentary: backend/storage/buffer/README

<< ∧ >>

❖ Page/Tuple Management

<< ∧ >>

❖ Some terminology

Terminology used in these slides ...

Record = sequence of bytes stored on disk (data for one tuple)
Tuple = "interpretable" version of a Record in memory
Page = copy of page from file on disk
PageId = index of Page within file = pid
pageOffsetInFile = pid * PAGESIZE
TupleId = index of record within page = tid
RecordId = (PageId, TupleId) = rid
recOffsetInPage = page.directory[tid].offset
Relation = descriptor for open relation

<< ∧ >>

❖ Reminder: Views of Data

[Diagram:Pics/storage/views-of-data.png]

Each tuple is represented by a record in some page

<< ∧ >>

❖ Page Formats

A Page is simply an array of bytes (byte[]).

Want to interpret/manipulate it as a collection of Records.

Typical operations on pages and records:

buf = request_page(rel,pid) ... get page via its PageId
rec = get_record(buf,tid) ... get record from buffer
rid = insert_record(rel,pid,rec) ... add new record
update_record(rel,rid,rec) ... update value of record
delete_record(rel,rid) ... remove record

Note: rid = (PageId,TupleId), rel = open relation

<< ∧ >>

❖ Page Formats (cont)

Factors affecting Page formats:

determined by record size flexibility (fixed, variable)
how free space within Page is managed
whether some data is stored outside Page
- does Page have an associated overflow chain?
- are large data values stored elsewhere? (e.g. TOAST)
- can one tuple span multiple Pages?

Implementation of Page operations critically depends on format.

<< ∧ >>

❖ Page Formats (cont)

For fixed-length records, use record slots.

insert: place new record in first available slot
delete: mark slot as free, or set xmax

[Diagram:Pics/storage/fixed-length-slots.png]

<< ∧ >>

❖ Page Formats (cont)

For variable-length records, must use record directory

directory[i] gives location within page of i ^th record

An important aspect of using record directory

location of tuple within page can change, tuple index does not change

Issue with variable-length records

managing space withing the page (esp. after deletions)
recording used and unused regions of the page

We refer to tuple index within directory as TupleId tid

<< ∧ >>

❖ Page Formats (cont)

Possibilities for handling free-space within block:

compacted (one region of free space)
fragmented (distributed free space)

In practice, a combination is useful:

normally fragmented (cheap to maintain)
compacted when needed (e.g. record won't fit)

<< ∧ >>

❖ Page Formats (cont)

Compacted free space ... before inserting record 7

<< ∧ >>

❖ Page Formats (cont)

After inserting record 7 (80 bytes) ...

<< ∧ >>

❖ Page Formats (cont)

Fragmented free space ... before inserting record 7

<< ∧ >>

❖ Page Formats (cont)

After inserting record 7 (80 bytes) ...

<< ∧ >>

❖ Storage Utilisation

How many records can fit in a page? (denoted c = capacity)

Depends on:

page size ... typical values: 1KB, 2KB, 4KB, 8KB
record size ... typical values: 64B, 200B, app-dependent
page header data ... typically: 4B - 32B
slot directory ... depends on how many records

We typically consider average record size (R)

Given c, HeaderSize + c*SlotSize + c*R ≤ PageSize

<< ∧ >>

❖ Overflows

Sometimes, it may not be possible to insert a record into a page:

no free-space fragment large enough
overall free-space in page is not large enough
the record is larger than the page
no more free directory slots in page

For case (1), can first try to compact free-space within the page.

If still insufficient space, we need an alternative solution ...

<< ∧ >>

❖ Overflows (cont)

File organisation determines how cases (2)..(4) are handled.

If records may be inserted anywhere that there is free space

cases (2) and (4) can be handled by making a new page
case (3) requires either spanned records or "overflow file"

If file organisation determines record placement (e.g. hashed file)

cases (2) and (4) require an "overflow page"
case (3) requires an "overflow file"

With overflow pages, rid structure may need modifying (rel,page,ovfl,rec)

<< ∧ >>

❖ Overflows (cont)

Overflow pages for full buckets in a hashed file:

<< ∧ >>

❖ Overflows (cont)

Overflow file for very large records and BLOBs:

<< ∧ >>

❖ PostgreSQL Page Representation

Functions: src/backend/storage/page/*.c

Definitions: src/include/storage/bufpage.h

Each page is 8KB (default BLCKSZ) and contains:

header (free space pointers, flags, xact data)
array of (offset,length) pairs for tuples in page
free space region (between array and tuple data)
actual tuples themselves (inserted from end towards start)
(optionally) region for special data (e.g. index data)

Large data items are stored in separate (TOAST) files (implicit)

Also supports ~SQL-standard BLOBs (explicit large data items)

<< ∧ >>

❖ PostgreSQL Page Representation (cont)

PostgreSQL page layout:

[Diagram:Pics/storage/pg-page-struct.png]

<< ∧ >>

❖ PostgreSQL Page Representation (cont)

Page-related data types:


// a Page is simply a pointer to start of buffer
typedef Pointer Page;

// indexes into the tuple directory
typedef uint16  LocationIndex;

// entries in tuple directory (line pointer array)
typedef struct ItemIdData
{
   unsigned   lp_off:15,  // tuple offset from start of page
              lp_flags:2, // unused,normal,redirect,dead
              lp_len:15;  // length of tuple (bytes)
} ItemIdData;

<< ∧ >>

❖ PostgreSQL Page Representation (cont)

Page-related data types: (cont)


typedef struct PageHeaderData  (simplified)
{
  ...                        // transaction-related data
  uint16        pd_checksum; // checksum
  uint16        pd_flags;    // flag bits (e.g. free, full, ...
  LocationIndex pd_lower;    // offset to start of free space
  LocationIndex pd_upper;    // offset to end of free space
  LocationIndex pd_special;  // offset to start of special space
  uint16        pd_pagesize_version;
  ItemIdData    pd_linp[1];  // beginning of line pointer array
} PageHeaderData;

typedef PageHeaderData *PageHeader;

<< ∧ >>

❖ PostgreSQL Page Representation (cont)

Operations on Pages:

void PageInit(Page page, Size pageSize, ...)

initialize a Page buffer to empty page
in particular, sets pd_lower and pd_upper

OffsetNumber PageAddItem(Page page, Item item, Size size, ...)

insert one tuple (or index entry) into a Page
fails if: not enough free space, too many tuples

void PageRepairFragmentation(Page page)

compact tuple storage to give one large free space region

<< ∧ >>

❖ PostgreSQL Page Representation (cont)

PostgreSQL has two kinds of pages:

heap pages which contain tuples
index pages which contain index entries

Both kinds of page have the same page layout.

One important difference:

index entries tend be a smaller than tuples
can typically fit more index entries per page

<< ∧ >>

❖ TOAST Files

Each data file has a corresponding TOAST file (if needed)

Tuples in data pages contain rids for long values

TOAST = The Oversized Attribute Storage Technique

<< ∧ >>

❖ Tuples

Each page contains a collection of tuples

What do tuples contain? How are they structured internally?

<< ∧ >>

❖ Records vs Tuples

A table is defined by a collection of attributes (schema), e.g.

create table Employee (
   id integer primary key,  name varchar(20),
   job  varchar(10),  dept number(4)
);

Tuple = collection of attribute values for such a schema, e.g.

(33357462, 'Neil Young', 'Musician', 0277)

Record = sequence of bytes, containing data for one tuple, e.g.

Bytes need to be interpreted relative to schema to get tuple

<< ∧ >>

❖ Operations on Records

Common operation on records ... access record via RecordId:

Record get_record(Relation rel, RecordId rid) {
    (pid,tid) = rid;
    Page *buf = request_page(rel, pid);
    return get_bytes(buf, tid);
}

Gives a sequence of bytes, which needs to be interpreted, e.g.

Relation rel = ... // relation schema
Record rec = get_record(rel,rid)
Tuple t = makeTuple(rel,rec)

Once we have a tuple, we can access individual attributes/fields

<< ∧ >>

❖ Operations on Tuples

Once we have a record, we need to interpret it as a tuple ...

Tuple t = makeTuple(Relation rel, Record rec)

convert record to tuple data structure for relation rel

Once we have a tuple, we want to examine its contents ...

Typ getTypField(Tuple t, int fieldNum)

extract the fno'th field from a Tuple as a value of type Typ

E.g. int x = getIntField(t,1), char *s = getStrField(t,2)

<< ∧ >>

❖ Scanning

Access methods typically involve iterators, e.g.

Scan s = start_scan(Relation r, ...)

commence a scan of relation r
Scan may include condition to implement WHERE-clause
Scan holds data on progress through file (e.g. current page)

Tuple next_tuple(Scan s)

returns next Tuple after last accessed one
returns NULL if no more Tuples left in the relation

<< ∧ >>

❖ Example Query

Example: simple scan of a table ...

select name from Employee

implemented as:

DB db = openDatabase("myDB");
Relation r = openRel(db,"Employee");
Scan s = start_scan(r);
Tuple t;  // current tuple
while ((t = next_tuple(s)) != NULL)
{
   char *name = getStrField(t,2);
   printf("%s\n", name);
}

<< ∧ >>

❖ Fixed-length Records

Encoding scheme for fixed-length records:

record format (length + offsets) stored in catalogue
data values stored in fixed-size slots in data pages

Since record format is frequently used at query time, should be in memory.

<< ∧ >>

❖ Variable-length Records

Some encoding schemes for variable-length records:

Prefix each field by length
Terminate fields by delimiter
Array of offsets

<< ∧ >>

❖ Converting Records to Tuples

A Record is an array of bytes (byte[])

representing the data values from a typed Tuple

A Tuple is a collection of named,typed values (cf. C struct)

Information on how to interpret the bytes as typed values

will be contained in schema data in DBMS catalogue
may be stored in the header for the data file
may be stored partly in the record and partly in the schema

For variable-length records, some formatting info ...

must be stored in the record or in the page directory

<< ∧ >>

❖ Converting Records to Tuples (cont)

DBMSs typically define a fixed set of field types, e.g.

DATE, FLOAT, INTEGER, NUMBER(n), VARCHAR(n), ...

This determines implementation-level data types:

DATE time_t

FLOAT float,double

INTEGER int,long

NUMBER(n) int[] (?)

VARCHAR(n) char[]

<< ∧ >>

❖ Converting Records to Tuples (cont)

A Tuple could be defined as

a list of field descriptors for a record instance
(where a FieldDesc gives (offset,length,type) information)
along with a reference to the Record data

typedef struct {
  ushort    nfields;   // number of fields/attrs
  ushort    data_off;  // offset in struct for data
  FieldDesc fields[];  // field descriptions
  Record    data;      // pointer to record in buffer
} Tuple;

Fields are derived from relation descriptor + record instance data.

<< ∧ >>

❖ Converting Records to Tuples (cont)

Tuple data could be

a pointer to bytes stored elsewhere in memory

<< ∧ >>

❖ Converting Records to Tuples (cont)

Or, tuple data could be ...

appended to Tuple struct (used widely in PostgreSQL)

<< ∧ >>

❖ PostgreSQL Tuples

Definitions: include/postgres.h, include/access/*tup*.h

Functions: backend/access/common/*tup*.c e.g.

HeapTuple heap_form_tuple(desc,values[],isnull[])
heap_deform_tuple(tuple,desc,values[],isnull[])

PostgreSQL defines tuples via:

a contiguous chunk of memory
starting with a header giving e.g. #fields, nulls
followed by the data values (as sequence of Datum)

<< ∧ >>

❖ PostgreSQL Tuples (cont)

Tuple structure:

[Diagram:Pics/storage/pg-tuple-struct.png]

<< ∧ >>

❖ PostgreSQL Tuples (cont)

Tuple-related data types:

// representation of a data value
typedef uintptr_t Datum;

The actual data value:

may be stored in the Datum (e.g. int)
may have a header with length (for varlen attributes)
may be stored in a TOAST file

<< ∧ >>

❖ PostgreSQL Tuples (cont)

Tuple-related data types: (cont)

// TupleDescData: schema-related information for HeapTuples

typedef struct tupleDescData
{
  int         natts;          // number of attributes in the tuple 
  Oid         tdtypeid;       // composite type ID for tuple type 
  int32       tdtypmod;       // typmod for tuple type 
  int         tdrefcount;     // reference count (-1 if not counting)
  TupleConstr *constr;        // constraints, or NULL if none 
  FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER];
  // attrs[N] is description of attribute number N+1 
} *TupleDesc;

<< ∧ >>

❖ PostgreSQL Tuples (cont)

HeapTupleData contains information about a stored tuple

typedef HeapTupleData *HeapTuple;

typedef struct HeapTupleData
{
  uint32           t_len;  // length of *t_data 
  ItemPointerData t_self;  // SelfItemPointer 
  Oid         t_tableOid;  // table the tuple came from 
  HeapTupleHeader t_data;  // -> tuple header and data 
} HeapTupleData;

HeapTupleHeader is a pointer to a location in a buffer

<< ∧ >>

❖ PostgreSQL Tuples (cont)

PostgreSQL stores a single block of data for tuple

containing a tuple header, followed by data byte[]

typedef struct HeapTupleHeaderData // simplified
{
  HeapTupleFields t_heap;
  ItemPointerData t_ctid;      // TID of this tuple or newer version
  uint16          t_infomask2; // #attributes + flags
  uint16          t_infomask;  // flags e.g. has_null, has_varwidth
  uint8           t_hoff;      // sizeof header incl. bitmap+padding
  // above is fixed size (23 bytes) for all heap tuples
  bits8           t_bits[1];   // bitmap of NULLs, variable length
  // OID goes here if HEAP_HASOID is set in t_infomask
  // actual data follows at end of struct
} HeapTupleHeaderData;

<< ∧

❖ PostgreSQL Tuples (cont)

Tuple-related data types: (cont)

typedef struct HeapTupleFields  // simplified
{
  TransactionId t_xmin;  // inserting xact ID
  TransactionId t_xmax;  // deleting or locking xact ID
  union {
    CommandId   t_cid;   // inserting or deleting command ID
    TransactionId t_xvac;// old-style VACUUM FULL xact ID 
  } t_field3;
} HeapTupleFields;

Note that not all system fields from stored tuple appear

oid is stored after the tuple header, if used
both xmin/xmax are stored, but only one of cmin/cmax

	HDD	SSD
Cost/byte	~ $13 / TB	~ $35 / TB
Read latency	~ 6ms	~ 50µs
Write latency	~ 6ms	~ 900µs
R/W rate	150MB/s	450MB/s
Read unit	block (e.g. 1KB)	byte
Writing	write a block	write on empty block

`DATE`		`time_t`
`FLOAT`		`float,double`
`INTEGER`		`int,long`
`NUMBER(`n`)`		`int[]` (?)
`VARCHAR(`n`)`		`char[]`