B: Cost models, Storage management

COMP9315 24T1 ♢ Lectures Part B♢ [0/109]
❖ Cost Models

An important aspect of this course is

Cost can be measured in terms of Primary assumptions in our cost models: Page = fixed-size block of data; size determined by storage medium
COMP9315 24T1 ♢ Lectures Part B♢ [1/109]
❖ Cost Models (cont)

Since time cost is affected by many factors

we do not consider time cost in our analyses.

For comparing methods, page cost is better

Measures the number of page read/write requests made.

Estimating costs with multiple concurrent operations and  buffering is difficult!!

COMP9315 24T1 ♢ Lectures Part B♢ [2/109]
❖ Cost Models (cont)


Terminology:

COMP9315 24T1 ♢ Lectures Part B♢ [3/109]
❖ Cost Models (cont)

In developing cost models, we make assumptions on how DBMSs store data:

[Diagram:Pics/scansortproj/file-struct0.png]

COMP9315 24T1 ♢ Lectures Part B♢ [4/109]
❖ Cost Models (cont)

Typical DBMS/table parameter values:

Quantity Symbol E.g. Value
total # tuples r 106
record size R 128 bytes
total # pages b 105
page size B 8192 bytes
# tuples per page c 60
page read/write time Tr ,Tw 10 msec
cost to process
one page in memory
- ≅ 0
COMP9315 24T1 ♢ Lectures Part B♢ [5/109]
❖ Cost Models (cont)

Our cost models are "rough" (based on assumptions)

But do give an O(x)  feel for how expensive operations are.

Example "rough" estimation: how many piano tuners in Sydney?

Actual number of piano tuners in Yellow Pages = 120

Example borrowed from Alan Fekete at Sydney University.

COMP9315 24T1 ♢ Lectures Part B♢ [6/109]
❖ Storage Technology

Persistent storage is

Computational storage is Access cost HDD:RAM ≅ 100000:1, e.g.
COMP9315 24T1 ♢ Lectures Part B♢ [7/109]
❖ Storage Technology (cont)

Hard disks are well-established, cheap, high-volume, ...

Alternative bulk storage: SSD

Feasible for long-term, high-update environments?
COMP9315 24T1 ♢ Lectures Part B♢ [8/109]
❖ Storage Technology (cont)

Comparison of HDD and SSD properties:

  HDD SSD
Cost/byte ~ $13 / TB ~ $35 / TB
Read latency ~ 6ms ~ 50µs
Write latency ~ 6ms ~ 900µs
R/W rate 150MB/s 450MB/s
Read unit block (e.g. 1KB) byte
Writing write a block write on empty block


Will SSDs ever completely replace HDDs?

COMP9315 24T1 ♢ Lectures Part B♢ [9/109]
❖ Storage Management

Part of DBMS related to storage management:

[Diagram:Pics/storage/dbmsarch.png]

COMP9315 24T1 ♢ Lectures Part B♢ [10/109]
❖ Storage Management (cont)


Aims of storage management in DBMS:

COMP9315 24T1 ♢ Lectures Part B♢ [11/109]
❖ Storage Management

Topics in storage management ...

COMP9315 24T1 ♢ Lectures Part B♢ [12/109]
❖ Views of Data in Query Evaluation

[Diagram:Pics/storage/query-ops.png]

COMP9315 24T1 ♢ Lectures Part B♢ [13/109]
❖ Views of Data in Query Evaluation (cont)

Representing database objects during query evaluation:

Addressing in DBMSs:
COMP9315 24T1 ♢ Lectures Part B♢ [14/109]
❖ File Management


Aims of file management subsystem:

Builds higher-level operations on top of OS file operations.

COMP9315 24T1 ♢ Lectures Part B♢ [15/109]
❖ File Management (cont)

Typical file operations provided by the operating system:

fd = open(fileName,mode)
  // open a named file for reading/writing/appending
close(fd)
  // close an open file, via its descriptor 
nread = read(fd, buf, nbytes)
  // attempt to read data from file into buffer 
nwritten = write(fd, buf, nbytes)
  // attempt to write data from buffer to file
lseek(fd, offset, seek_type)
  // move file pointer to relative/absolute file offset
fsync(fd)
  // flush contents of file buffers to disk

COMP9315 24T1 ♢ Lectures Part B♢ [16/109]
❖ DBMS File Organisation

How is data for DB objects arranged in the file system?

Different DBMSs make different choices, e.g.

COMP9315 24T1 ♢ Lectures Part B♢ [17/109]
❖ Single-file DBMS

Consider a single file for the entire database (e.g. SQLite)

Objects are allocated to regions (segments) of the file.

[Diagram:Pics/storage/single-file.png]

If an object grows too large for allocated segment, allocate an extension.

What happens to allocated space when objects are removed?

COMP9315 24T1 ♢ Lectures Part B♢ [18/109]
❖ Single-file DBMS (cont)

Allocating space in Unix files is easy:

If the seek goes way beyond the end of the file: With the above, a disk/file manager is easy to implement.
COMP9315 24T1 ♢ Lectures Part B♢ [19/109]
❖ Single-file Storage Manager

Consider the following simple single-file DBMS layout:

[Diagram:Pics/storage/single-file-example.png]

E.g.

SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]

TableMap = [ ("employee",20,500), ("project",620,40) ]

COMP9315 24T1 ♢ Lectures Part B♢ [20/109]
❖ Single-file Storage Manager (cont)

Each file segment consists of a number fixed-size blocks

The following data/constant definitions are useful

#define PAGESIZE 2048   // bytes per page

typedef long PageId;    // PageId is block index
                        // pageOffset=PageId*PAGESIZE

typedef char *Page;     // pointer to page/block buffer


Typical PAGESIZE values:  1024,  2048,  4096,  8192

COMP9315 24T1 ♢ Lectures Part B♢ [21/109]
❖ Single-file Storage Manager (cont)

Storage Manager data structures for opened DBs & Tables

typedef struct DBrec {
   char *dbname;    // copy of database name
   int fd;          // the database file
   SpaceMap map;    // map of free/used areas 
   NameTable names; // map names to areas + sizes
} *DB;

typedef struct Relrec {
   char *relname;   // copy of table name
   int   start;     // page index of start of table data
   int   npages;    // number of pages of table data
   ...
} *Rel;

COMP9315 24T1 ♢ Lectures Part B♢ [22/109]
❖ Example: Scanning a Relation

With the above disk manager, our example:

select name from Employee

might be implemented as something like

DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
Page buffer = malloc(PAGESIZE*sizeof(char));
for (int i = 0; i < r->npages; i++) {
   PageId pid = r->start+i;
   get_page(db, pid, buffer);
   for each tuple in buffer {
      get tuple data and extract name
      add (name) to result tuples
   }
}

COMP9315 24T1 ♢ Lectures Part B♢ [23/109]
❖ Single-File Storage Manager

// start using DB, buffer meta-data
DB openDatabase(char *name) { 
   DB db = new(struct DBrec);
   db->dbname = strdup(name);
   db->fd = open(name,O_RDWR);
   db->map = readSpaceTable(db->fd);
   db->names = readNameTable(db->fd);
   return db;
}
// stop using DB and update all meta-data
void closeDatabase(DB db) {
   writeSpaceTable(db->fd,db->map);
   writeNameTable(db->fd,db->map);
   fsync(db->fd);
   close(db->fd);
   free(db->dbname);
   free(db);
}

COMP9315 24T1 ♢ Lectures Part B♢ [24/109]
❖ Single-File Storage Manager (cont)

// set up struct describing relation
Rel openRelation(DB db, char *rname) {
   Rel r = new(struct Relrec);
   r->relname = strdup(rname);
   // get relation data from map tables
   r->start = ...;
   r->npages = ...;
   return r;
}

// stop using a relation
void closeRelation(Rel r) {
   free(r->relname);
   free(r);
}

COMP9315 24T1 ♢ Lectures Part B♢ [25/109]
❖ Single-File Storage Manager (cont)

// assume that Page = byte[PageSize]
// assume that PageId = block number in file

// read page from file into memory buffer
void get_page(DB db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   read(db->fd, buf, PAGESIZE);
}

// write page from memory buffer to file
void put_page(DB db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   write(db->fd, buf, PAGESIZE);
}

COMP9315 24T1 ♢ Lectures Part B♢ [26/109]
❖ Single-File Storage Manager (cont)

Managing contents of space mapping table can be complex:

// assume an array of (offset,length,status) records

// allocate n new pages
PageId allocate_pages(DB db, int n) {
   if (no existing free chunks are large enough) {
      int endfile = lseek(db->fd, 0, SEEK_END);
      addNewEntry(db->map, endfile, n);
   } else {
      grab "worst fit" chunk
      split off unused section as new chunk
   }
}

COMP9315 24T1 ♢ Lectures Part B♢ [27/109]
❖ Single-File Storage Manager (cont)

Similar complexity for freeing chunks

// drop n pages starting from p
void deallocate_pages(DB db, PageId p, int n) {
   if (no adjacent free chunks) {
      markUnused(db->map, p, n);
   } else {
      merge adjacent free chunks
      compress mapping table
   }
}

Changes take effect when closeDatabase() executed.

COMP9315 24T1 ♢ Lectures Part B♢ [28/109]
❖ Multiple-file Disk Manager

Most DBMSs don't use a single large file for all data.

They typically provide:

Precise file structure varies between individual DBMSs.

Using multiple files (one file per relation) can be easier, e.g.

COMP9315 24T1 ♢ Lectures Part B♢ [29/109]
❖ Multiple-file Disk Manager (cont)

Example of single-file vs multiple-file:

[Diagram:Pics/storage/single-vs-multi.png]


Consider how you would compute file offset of page[i] in table[1] ...

COMP9315 24T1 ♢ Lectures Part B♢ [30/109]
❖ Multiple-file Disk Manager (cont)

Structure of PageId for data pages in such systems ...

If system uses one file per table, PageId contains:

If system uses several files per table, PageId contains:
COMP9315 24T1 ♢ Lectures Part B♢ [31/109]
❖ PostgreSQL Storage Manager

PostgreSQL uses the following file organisation ...

[Diagram:Pics/storage/pg-file-arch.png]

COMP9315 24T1 ♢ Lectures Part B♢ [32/109]
❖ PostgreSQL Storage Manager (cont)

Components of storage subsystem:

PostgreSQL has two basic kinds of files: Note: smgr designed for many storage devices; only disk handler provided
COMP9315 24T1 ♢ Lectures Part B♢ [33/109]
❖ Relations as Files

PostgreSQL identifies relation files via their OIDs.

The core data structure for this is RelFileNode:

typedef struct RelFileNode {
    Oid  spcNode;  // tablespace
    Oid  dbNode;   // database
    Oid  relNode;  // relation
} RelFileNode;

Global (shared) tables (e.g. pg_database) have

COMP9315 24T1 ♢ Lectures Part B♢ [34/109]
❖ Relations as Files (cont)

The relpath function maps RelFileNode to file:


char *relpath(RelFileNode r)  // simplified
{
   char *path = malloc(ENOUGH_SPACE);

   if (r.spcNode == GLOBALTABLESPACE_OID) {
      /* Shared system relations live in PGDATA/global */
      Assert(r.dbNode == 0);
      sprintf(path, "%s/global/%u",
              DataDir, r.relNode);
   }
   else if (r.spcNode == DEFAULTTABLESPACE_OID) {
      /* The default tablespace is PGDATA/base */
      sprintf(path, "%s/base/%u/%u",
              DataDir, r.dbNode, r.relNode);
   }
   else {
      /* All other tablespaces accessed via symlinks */
      sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir
              r.spcNode, r.dbNode, r.relNode);
   }
   return path;
}

COMP9315 24T1 ♢ Lectures Part B♢ [35/109]
❖ File Descriptor Pool

Unix has limits on the number of concurrently open files.

PostgreSQL maintains a pool of open file descriptors:

File names are simply strings: typedef char *FileName

Open files are referenced via: typedef int File

A File is an index into a table of "virtual file descriptors".

COMP9315 24T1 ♢ Lectures Part B♢ [36/109]
❖ File Descriptor Pool (cont)

Interface to file descriptor (pool):


File FileNameOpenFile(FileName fileName,
                      int fileFlags, int fileMode);
     // open a file in the database directory ($PGDATA/base/...)
File OpenTemporaryFile(bool interXact);
     // open temp file; flag: close at end of transaction?
void FileClose(File file);
void FileUnlink(File file);
int  FileRead(File file, char *buffer, int amount);
int  FileWrite(File file, char *buffer, int amount);
int  FileSync(File file);
long FileSeek(File file, long offset, int whence);
int  FileTruncate(File file, long offset);

Analogous to Unix syscalls open(), close(), read(), write(), lseek(), ...

COMP9315 24T1 ♢ Lectures Part B♢ [37/109]
❖ File Descriptor Pool (cont)

Virtual file descriptors (Vfd)

VfdCache[0] holds list head/tail pointers.
COMP9315 24T1 ♢ Lectures Part B♢ [38/109]
❖ File Descriptor Pool (cont)

Virtual file descriptor records (simplified):


typedef struct vfd
{
    s_short  fd;              // current FD, or VFD_CLOSED if none
    u_short  fdstate;         // bitflags for VFD's state
    File     nextFree;        // link to next free VFD, if in freelist
    File     lruMoreRecently; // doubly linked recency-of-use list
    File     lruLessRecently;
    long     seekPos;         // current logical file position
    char     *fileName;       // name of file, or NULL for unused VFD
    // NB: fileName is malloc'd, and must be free'd when closing the VFD
    int      fileFlags;       // open(2) flags for (re)opening the file
    int      fileMode;        // mode to pass to open(2)
} Vfd;

COMP9315 24T1 ♢ Lectures Part B♢ [39/109]
❖ File Manager

Reminder: PostgreSQL file organisation

[Diagram:Pics/storage/pg-file-arch.png]

COMP9315 24T1 ♢ Lectures Part B♢ [40/109]
❖ File Manager (cont)

PostgreSQL stores each table

[Diagram:Pics/storage/one-table-files.png]

COMP9315 24T1 ♢ Lectures Part B♢ [41/109]
❖ File Manager (cont)

Data files   (Oid, Oid.1, ...):


[Diagram:Pics/storage/heap-file.png]

COMP9315 24T1 ♢ Lectures Part B♢ [42/109]
❖ File Manager (cont)

Free space map   (Oid_fsm):

Visibility map   (Oid_vm):
COMP9315 24T1 ♢ Lectures Part B♢ [43/109]
❖ File Manager (cont)

The "magnetic disk storage manager" (storage/smgr/md.c)

PostgreSQL PageID values are structured:

typedef struct
{
    RelFileNode rnode;    // which relation/file
    ForkNumber  forkNum;  // which fork (of reln)
    BlockNumber blockNum; // which page/block 
} BufferTag;

COMP9315 24T1 ♢ Lectures Part B♢ [44/109]
❖ File Manager (cont)

Access to a block of data proceeds (roughly) as follows:

// pageID set from pg_catalog tables
// buffer obtained from Buffer pool
getBlock(BufferTag pageID, Buffer buf)
{
   Vfd vf;  off_t offset;
   (vf, offset) = findBlock(pageID)
   lseek(vf.fd, offset, SEEK_SET)
   vf.seekPos = offset;
   nread = read(vf.fd, buf, BLOCKSIZE)
   if (nread < BLOCKSIZE) ... we have a problem
}

BLOCKSIZE is a global configurable constant (default: 8192)

COMP9315 24T1 ♢ Lectures Part B♢ [45/109]
❖ File Manager (cont)

findBlock(BufferTag pageID) returns (Vfd, off_t)
{
   offset = pageID.blockNum * BLOCKSIZE
   fileName = relpath(pageID.rnode)
   if (pageID.forkNum > 0)
      fileName = fileName+"."+pageID.forkNum
   if (fileName is not in Vfd pool)
      fd = allocate new Vfd for fileName
   else
      fd = use Vfd from pool
   if (offset > fd.fileSize) {
      fd = allocate new Vfd for next fork
      offset = offset - fd.fileSize
   }
   return (fd, offset)
}

COMP9315 24T1 ♢ Lectures Part B♢ [46/109]
❖ Buffer Pool

Aim of buffer pool:

Used by: Uses:
Note: we use the terms page and block interchangably
COMP9315 24T1 ♢ Lectures Part B♢ [47/109]
❖ Buffer Pool (cont)

[Diagram:Pics/storage/pool.png]

COMP9315 24T1 ♢ Lectures Part B♢ [48/109]
❖ Buffer Pool (cont)

Buffer pool operations:   (both take single PageID argument)

To some extent ...

Buffer pool data structures:

COMP9315 24T1 ♢ Lectures Part B♢ [49/109]
❖ Buffer Pool (cont)

[Diagram:Pics/storage/buffer-pool.png]

COMP9315 24T1 ♢ Lectures Part B♢ [50/109]
❖ Buffer Pool (cont)

For each frame, we need to know:   (FrameData)


Pages are referenced by PageID ...
COMP9315 24T1 ♢ Lectures Part B♢ [51/109]
❖ Buffer Pool (cont)

How scans are performed without Buffer Pool:

Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
   pageID = makePageID(db,Rel,i);
   getBlock(pageID, buf);
   for (j = 0; j < nTuples(buf); j++)
      process(buf, j)
}

Requires N page reads.

If we read it again, N page reads.

COMP9315 24T1 ♢ Lectures Part B♢ [52/109]
❖ Buffer Pool (cont)

How scans are performed with Buffer Pool:

Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
   pageID = makePageID(db,Rel,i);
   bufID = request_page(pageID);
   buf = frames[bufID]
   for (j = 0; j < nTuples(buf); j++)
      process(buf, j)
   release_page(pageID);
}

Requires N page reads on the first pass.

If we read it again, 0 ≤ page reads ≤ N

COMP9315 24T1 ♢ Lectures Part B♢ [53/109]
❖ Buffer Pool (cont)

Implementation of request_page()

int request_page(PageID pid)
{
   if (pid in Pool)
      bufID = index for pid in Pool
   else {
      if (no free frames in Pool)
         evict a page // free a frame
      bufID = allocate free frame
      directory[bufID].page = pid
      directory[bufID].pin_count = 0
      directory[bufID].dirty_bit = 0
   }
   directory[bufID].pin_count++
   return bufID
}

COMP9315 24T1 ♢ Lectures Part B♢ [54/109]
❖ Buffer Pool (cont)

The release_page(pid) operation:

Note: no effect on disk or buffer contents until replacement required

The mark_page(pid) operation:

Note: doesn't actually write to disk; just indicates that page changed

The flush_page(pid) operation:

Note: not generally used by higher levels of DBMS

COMP9315 24T1 ♢ Lectures Part B♢ [55/109]
❖ Buffer Pool (cont)


Evicting a page ...

If multiple frames can potentially be released
COMP9315 24T1 ♢ Lectures Part B♢ [56/109]
❖ Page Replacement Policies


Several schemes are commonly in use:

LRU / MRU require knowledge of when pages were last accessed
COMP9315 24T1 ♢ Lectures Part B♢ [57/109]
❖ Page Replacement Policies (cont)

Cost benefit from buffer pool (with n frames) is determined by:

Example (a): sequential scan, LRU or MRU, n ≥ b

First scan costs b  reads; subsequent scans are "free".

Example (b): sequential scan, MRU, n < b

First scan costs b  reads; subsequent scans cost b - n  reads.

Example (c): sequential scan, LRU, n < b

All scans cost b  reads; known as sequential flooding.

COMP9315 24T1 ♢ Lectures Part B♢ [58/109]
❖ Effect of Buffer Management

Consider a query to find customers who are also employees:

select c.name
from   Customer c, Employee e
where  c.ssn = e.ssn;

This might be implemented inside the DBMS via nested loops:

for each tuple t1 in Customer {
    for each tuple t2 in Employee {
        if (t1.ssn == t2.ssn)
            append (t1.name) to result set
    }
}

COMP9315 24T1 ♢ Lectures Part B♢ [59/109]
❖ Effect of Buffer Management (cont)

In terms of page-level operations, the algorithm looks like:

Rel rC = openRelation("Customer");
Rel rE = openRelation("Employee");
for (int i = 0; i < nPages(rC); i++) {
    PageID pid1 = makePageID(db,rC,i);
    Page p1 = request_page(pid1);
    for (int j = 0; j < nPages(rE); j++) {
        PageID pid2 = makePageID(db,rE,j);
        Page p2 = request_page(pid2);
        // compare all pairs of tuples from p1,p2
        // construct solution set from matching pairs
        release_page(pid2);
    }
    release_page(pid1);
}

COMP9315 24T1 ♢ Lectures Part B♢ [60/109]
❖ PostgreSQL Buffer Manager

PostgreSQL buffer manager:

Buffers are located in a large region of shared memory.

Definitions:  src/include/storage/buf*.h

Functions:  src/backend/storage/buffer/*.c


Buffer code is also used by backends who want a private buffer pool

COMP9315 24T1 ♢ Lectures Part B♢ [61/109]
❖ PostgreSQL Buffer Manager (cont)

Buffer pool consists of:

BufferDescriptors

BufferBlocks Buffer = index values in above arrays Size of buffer pool is set in postgresql.conf, e.g.

shared_buffers = 16MB   # min 128KB, 16*8KB buffers

COMP9315 24T1 ♢ Lectures Part B♢ [62/109]
❖ PostgreSQL Buffer Manager (cont)

PostgreSQL buffer descriptors:

typedef struct BufferDesc
{
   BufferTag  tag;           /* ID of page contained in buffer */
   int        buf_id;        /* buffer's index number (from 0) */

   /* state of the tag, containing:
               flags, refcount and usagecount */
   pg_atomic_uint32 state;

   int         wait_backend_pgprocno;   /* pin-count waiter */
   int         freeNext;      /* link in freelist chain */
   LWLock      content_lock;  /* to lock access to buffer contents */
} BufferDesc;

COMP9315 24T1 ♢ Lectures Part B♢ [63/109]
❖ PostgreSQL Buffer Manager (cont)

[Diagram:Pics/storage/buffer-memory.png]

COMP9315 24T1 ♢ Lectures Part B♢ [64/109]
❖ PostgreSQL Buffer Manager (cont)

include/storage/buf.h

include/storage/bufmgr.h include/storage/buf_internals.h Code: backend/storage/buffer/*.c

Commentary: backend/storage/buffer/README

COMP9315 24T1 ♢ Lectures Part B♢ [65/109]
❖ Page/Tuple Management


[Diagram:Pics/storage/dbmsarch1.png]

COMP9315 24T1 ♢ Lectures Part B♢ [66/109]
❖ Some terminology

Terminology used in these slides ...

COMP9315 24T1 ♢ Lectures Part B♢ [67/109]
❖ Reminder: Views of Data

[Diagram:Pics/storage/views-of-data.png]


Each tuple is represented by a record in some page

COMP9315 24T1 ♢ Lectures Part B♢ [68/109]
❖ Page Formats

A Page is simply an array of bytes (byte[]).

Want to interpret/manipulate it as a collection of Records.

Typical operations on pages and records:


Note:  rid = (PageId,TupleId),  rel = open relation
COMP9315 24T1 ♢ Lectures Part B♢ [69/109]
❖ Page Formats (cont)

Factors affecting Page formats:

Implementation of Page operations critically depends on format.
COMP9315 24T1 ♢ Lectures Part B♢ [70/109]
❖ Page Formats (cont)

For fixed-length records, use record slots.

[Diagram:Pics/storage/fixed-length-slots.png]

COMP9315 24T1 ♢ Lectures Part B♢ [71/109]
❖ Page Formats (cont)

For variable-length records, must use record directory

An important aspect of using record directory Issue with variable-length records We refer to tuple index within directory as   TupleId tid
COMP9315 24T1 ♢ Lectures Part B♢ [72/109]
❖ Page Formats (cont)

Possibilities for handling free-space within block:

In practice, a combination is useful:
COMP9315 24T1 ♢ Lectures Part B♢ [73/109]
❖ Page Formats (cont)

Compacted free space ... before inserting record 7

[Diagram:Pics/storage/insert1b.png]

COMP9315 24T1 ♢ Lectures Part B♢ [74/109]
❖ Page Formats (cont)

After inserting record 7 (80 bytes) ...

[Diagram:Pics/storage/insert1c.png]

COMP9315 24T1 ♢ Lectures Part B♢ [75/109]
❖ Page Formats (cont)

Fragmented free space ... before inserting record 7

[Diagram:Pics/storage/insert2b.png]

COMP9315 24T1 ♢ Lectures Part B♢ [76/109]
❖ Page Formats (cont)

After inserting record 7 (80 bytes) ...

[Diagram:Pics/storage/insert2c.png]

COMP9315 24T1 ♢ Lectures Part B♢ [77/109]
❖ Storage Utilisation

How many records can fit in a page? (denoted c = capacity)

Depends on:

We typically consider average record size (R)

Given c,   HeaderSize + c*SlotSize + c*R  ≤  PageSize

COMP9315 24T1 ♢ Lectures Part B♢ [78/109]
❖ Overflows

Sometimes, it may not be possible to insert a record into a page:

  1. no free-space fragment large enough
  2. overall free-space in page is not large enough
  3. the record is larger than the page
  4. no more free directory slots in page
For case (1), can first try to compact free-space within the page.

If still insufficient space, we need an alternative solution ...

COMP9315 24T1 ♢ Lectures Part B♢ [79/109]
❖ Overflows (cont)

File organisation determines how cases (2)..(4) are handled.

If records may be inserted anywhere that there is free space

If file organisation determines record placement (e.g. hashed file) With overflow pages, rid structure may need modifying (rel,page,ovfl,rec)
COMP9315 24T1 ♢ Lectures Part B♢ [80/109]
❖ Overflows (cont)

Overflow pages for full buckets in a hashed file:

[Diagram:Pics/storage/ovflow-files.png]

COMP9315 24T1 ♢ Lectures Part B♢ [81/109]
❖ Overflows (cont)

Overflow file for very large records and BLOBs:


[Diagram:Pics/storage/ovflow-blobs.png]

COMP9315 24T1 ♢ Lectures Part B♢ [82/109]
❖ PostgreSQL Page Representation

Functions: src/backend/storage/page/*.c

Definitions: src/include/storage/bufpage.h

Each page is 8KB (default BLCKSZ) and contains:

Large data items are stored in separate (TOAST) files   (implicit)

Also supports ~SQL-standard BLOBs   (explicit large data items)

COMP9315 24T1 ♢ Lectures Part B♢ [83/109]
❖ PostgreSQL Page Representation (cont)

PostgreSQL page layout:

[Diagram:Pics/storage/pg-page-struct.png]

COMP9315 24T1 ♢ Lectures Part B♢ [84/109]
❖ PostgreSQL Page Representation (cont)

Page-related data types:


// a Page is simply a pointer to start of buffer
typedef Pointer Page;

// indexes into the tuple directory
typedef uint16  LocationIndex;

// entries in tuple directory (line pointer array)
typedef struct ItemIdData
{
   unsigned   lp_off:15,  // tuple offset from start of page
              lp_flags:2, // unused,normal,redirect,dead
              lp_len:15;  // length of tuple (bytes)
} ItemIdData;

COMP9315 24T1 ♢ Lectures Part B♢ [85/109]
❖ PostgreSQL Page Representation (cont)

Page-related data types: (cont)


typedef struct PageHeaderData  (simplified)
{
  ...                        // transaction-related data
  uint16        pd_checksum; // checksum
  uint16        pd_flags;    // flag bits (e.g. free, full, ...
  LocationIndex pd_lower;    // offset to start of free space
  LocationIndex pd_upper;    // offset to end of free space
  LocationIndex pd_special;  // offset to start of special space
  uint16        pd_pagesize_version;
  ItemIdData    pd_linp[1];  // beginning of line pointer array
} PageHeaderData;

typedef PageHeaderData *PageHeader;

COMP9315 24T1 ♢ Lectures Part B♢ [86/109]
❖ PostgreSQL Page Representation (cont)

Operations on Pages:

void PageInit(Page page, Size pageSize, ...)

OffsetNumber PageAddItem(Page page,
                    Item item, Size size, ...)
void PageRepairFragmentation(Page page)
COMP9315 24T1 ♢ Lectures Part B♢ [87/109]
❖ PostgreSQL Page Representation (cont)

PostgreSQL has two kinds of pages:

Both kinds of page have the same page layout.

One important difference:

COMP9315 24T1 ♢ Lectures Part B♢ [88/109]
❖ TOAST Files

Each data file has a corresponding TOAST file (if needed)

[Diagram:Pics/storage/toast-file.png]

Tuples in data pages contain rids for long values

TOAST = The Oversized Attribute Storage Technique

COMP9315 24T1 ♢ Lectures Part B♢ [89/109]
❖ Tuples

Each page contains a collection of tuples

[Diagram:Pics/storage/page-tuples.png]


What do tuples contain? How are they structured internally?

COMP9315 24T1 ♢ Lectures Part B♢ [90/109]
❖ Records vs Tuples

A table is defined by a collection of attributes (schema), e.g.

create table Employee (
   id integer primary key,  name varchar(20),
   job  varchar(10),  dept number(4)
);

Tuple = collection of attribute values for such a schema, e.g.

(33357462, 'Neil Young', 'Musician', 0277)

Record = sequence of bytes, containing data for one tuple, e.g.

[Diagram:Pics/storage/tuple-bytes.png]

Bytes need to be interpreted relative to schema to get tuple

COMP9315 24T1 ♢ Lectures Part B♢ [91/109]
❖ Operations on Records

Common operation on records ... access record via RecordId:

Record get_record(Relation rel, RecordId rid) {
    (pid,tid) = rid;
    Page *buf = request_page(rel, pid);
    return get_bytes(buf, tid);
}

Gives a sequence of bytes, which needs to be interpreted, e.g.

Relation rel = ... // relation schema
Record rec = get_record(rel,rid)
Tuple t = makeTuple(rel,rec)

Once we have a tuple, we can access individual attributes/fields

COMP9315 24T1 ♢ Lectures Part B♢ [92/109]
❖ Operations on Tuples

Once we have a record, we need to interpret it as a tuple ...

Tuple t = makeTuple(Relation rel, Record rec)


Once we have a tuple, we want to examine its contents ...

Typ   getTypField(Tuple t, int fieldNum)

E.g.   int x = getIntField(t,1),   char *s = getStrField(t,2)
COMP9315 24T1 ♢ Lectures Part B♢ [93/109]
❖ Scanning

Access methods typically involve iterators, e.g.


Scan s = start_scan(Relation r, ...)


Tuple next_tuple(Scan s)

COMP9315 24T1 ♢ Lectures Part B♢ [94/109]
❖ Example Query

Example: simple scan of a table ...

select name from Employee

implemented as:

DB db = openDatabase("myDB");
Relation r = openRel(db,"Employee");
Scan s = start_scan(r);
Tuple t;  // current tuple
while ((t = next_tuple(s)) != NULL)
{
   char *name = getStrField(t,2);
   printf("%s\n", name);
}

COMP9315 24T1 ♢ Lectures Part B♢ [95/109]
❖ Fixed-length Records

Encoding scheme for fixed-length records:

[Diagram:Pics/storage/fixed-length.png]

Since record format is frequently used at query time, should be in memory.

COMP9315 24T1 ♢ Lectures Part B♢ [96/109]
❖ Variable-length Records

Some encoding schemes for variable-length records:

COMP9315 24T1 ♢ Lectures Part B♢ [97/109]
❖ Converting Records to Tuples

A Record is an array of bytes (byte[])

A Tuple is a collection of named,typed values  (cf. C struct)

Information on how to interpret the bytes as typed values

For variable-length records, some formatting info ...
COMP9315 24T1 ♢ Lectures Part B♢ [98/109]
❖ Converting Records to Tuples (cont)

DBMSs typically define a fixed set of field types, e.g.

DATE,  FLOAT,  INTEGER,  NUMBER(n),  VARCHAR(n), ...

This determines implementation-level data types:

DATE time_t
FLOAT float,double
INTEGER int,long
NUMBER(n) int[] (?)
VARCHAR(n) char[]

COMP9315 24T1 ♢ Lectures Part B♢ [99/109]
❖ Converting Records to Tuples (cont)

A Tuple could be defined as

typedef struct {
  ushort    nfields;   // number of fields/attrs
  ushort    data_off;  // offset in struct for data
  FieldDesc fields[];  // field descriptions
  Record    data;      // pointer to record in buffer
} Tuple;

Fields are derived from relation descriptor + record instance data.

COMP9315 24T1 ♢ Lectures Part B♢ [100/109]
❖ Converting Records to Tuples (cont)

Tuple data could be

[Diagram:Pics/storage/rec8.png]

COMP9315 24T1 ♢ Lectures Part B♢ [101/109]
❖ Converting Records to Tuples (cont)

Or, tuple data could be ...

[Diagram:Pics/storage/rec9.png]

COMP9315 24T1 ♢ Lectures Part B♢ [102/109]
❖ PostgreSQL Tuples

Definitions: include/postgres.h,  include/access/*tup*.h

Functions: backend/access/common/*tup*.c  e.g.

PostgreSQL defines tuples via:
COMP9315 24T1 ♢ Lectures Part B♢ [103/109]
❖ PostgreSQL Tuples (cont)

Tuple structure:

[Diagram:Pics/storage/pg-tuple-struct.png]

COMP9315 24T1 ♢ Lectures Part B♢ [104/109]
❖ PostgreSQL Tuples (cont)

Tuple-related data types:

// representation of a data value
typedef uintptr_t Datum;

The actual data value:

COMP9315 24T1 ♢ Lectures Part B♢ [105/109]
❖ PostgreSQL Tuples (cont)

Tuple-related data types: (cont)

// TupleDescData: schema-related information for HeapTuples

typedef struct tupleDescData
{
  int         natts;          // number of attributes in the tuple 
  Oid         tdtypeid;       // composite type ID for tuple type 
  int32       tdtypmod;       // typmod for tuple type 
  int         tdrefcount;     // reference count (-1 if not counting)
  TupleConstr *constr;        // constraints, or NULL if none 
  FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER];
  // attrs[N] is description of attribute number N+1 
} *TupleDesc;

COMP9315 24T1 ♢ Lectures Part B♢ [106/109]
❖ PostgreSQL Tuples (cont)

HeapTupleData contains information about a stored tuple

typedef HeapTupleData *HeapTuple;

typedef struct HeapTupleData
{
  uint32           t_len;  // length of *t_data 
  ItemPointerData t_self;  // SelfItemPointer 
  Oid         t_tableOid;  // table the tuple came from 
  HeapTupleHeader t_data;  // -> tuple header and data 
} HeapTupleData;

HeapTupleHeader is a pointer to a location in a buffer

COMP9315 24T1 ♢ Lectures Part B♢ [107/109]
❖ PostgreSQL Tuples (cont)

PostgreSQL stores a single block of data for tuple

  • containing a tuple header, followed by data byte[]

    typedef struct HeapTupleHeaderData // simplified
    {
      HeapTupleFields t_heap;
      ItemPointerData t_ctid;      // TID of this tuple or newer version
      uint16          t_infomask2; // #attributes + flags
      uint16          t_infomask;  // flags e.g. has_null, has_varwidth
      uint8           t_hoff;      // sizeof header incl. bitmap+padding
      // above is fixed size (23 bytes) for all heap tuples
      bits8           t_bits[1];   // bitmap of NULLs, variable length
      // OID goes here if HEAP_HASOID is set in t_infomask
      // actual data follows at end of struct
    } HeapTupleHeaderData;
    

  • COMP9315 24T1 ♢ Lectures Part B♢ [108/109]
    ❖ PostgreSQL Tuples (cont)

    Tuple-related data types: (cont)

    typedef struct HeapTupleFields  // simplified
    {
      TransactionId t_xmin;  // inserting xact ID
      TransactionId t_xmax;  // deleting or locking xact ID
      union {
        CommandId   t_cid;   // inserting or deleting command ID
        TransactionId t_xvac;// old-style VACUUM FULL xact ID 
      } t_field3;
    } HeapTupleFields;
    

    Note that not all system fields from stored tuple appear

    COMP9315 24T1 ♢ Lectures Part B♢ [109/109]


    Produced: 6 May 2024