Week 03 Lectures


Storage Manager


Storage Management2/100

Levels of DBMS related to storage management:

[Diagram:Pics/storage/dbmsarch-small.png]


... Storage Management3/100

Aims of storage management in DBMS:


Views of Data in Query Evaluation4/100

[Diagram:Pics/storage/query-ops-small.png]


... Views of Data in Query Evaluation5/100

Representing database objects during query execution:

Addressing in DBMSs:


Storage Management6/100

Topics in storage management ...


Storage Technology


Storage Technology8/100

Persistent storage is

Computational storage is Access cost HDD:RAM ≅ 100000:1, e.g.


... Storage Technology9/100

Hard disks are well-established, cheap, high-volume, ...

Alternative bulk storage: SSD

Feasible for long-term, high-update environments?


... Storage Technology10/100

Comparison of HDD and SSD properties:

  HDD SDD
Cost/byte ~ 2c / GB ~ 13c / GB
Read latency ~ 10ms ~ 50µs
Write latency ~ 10ms ~ 900µs
Read unit block (e.g. 1KB) byte
Writing write a block write on empty block


Will SSDs ever replace HDDs for large-scale database storage?


Cost Models11/100

Throughout this course, we compare costs of DB operations

Important aspects in determining cost:

Complicating factors in determining costs: More details later ...


File Management12/100

Aims of file management subsystem:

Builds higher-level operations on top of OS file operations.


... File Management13/100

Typical file operations provided by the operating system:

fd = open(fileName,mode)
  // open a named file for reading/writing/appending
close(fd)
  // close an open file, via its descriptor 
nread = read(fd, buf, nbytes)
  // attempt to read data from file into buffer 
nwritten = write(fd, buf, nbytes)
  // attempt to write data from buffer to file
lseek(fd, offset, seek_type)
  // move file pointer to relative/absolute file offset
fsync(fd)
  // flush contents of file buffers to disk


DBMS File Organisation14/100

How is data for DB objects arranged in the file system?

Different DBMSs make different choices, e.g.


Single-file Storage Manager15/100

Consider the following simple single-file DBMS layout:

[Diagram:Pics/storage/single-file-example-small.png]

E.g.

SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]

NameMap = [ ("employee",20,350), ("project",620,40) ]


... Single-file Storage Manager16/100

Each file segment consists of a number fixed-size blocks

The following data/constant definitions are useful

#define PAGESIZE 4096   // bytes per page

typedef long PageId;    // PageId is block index
                        // pageOffset=PageId*PAGESIZE

typedef char *Page;     // pointer to page/block buffer


Typical PAGESIZE values:  1024,  2048,  4096,  8192


... Single-file Storage Manager17/100

Storage Manager data structures for opened DBs & Tables

typedef struct DBrec {
   char *dbname;   // copy of database name
   int fd;         // the database file
   SpaceMap map;   // map of free/used areas 
   NameMap names;  // map names to areas + sizes
} *DB;

typedef struct Relrec {
   char *relname;  // copy of table name
   int   start;    // page index of start of table data
   int   npages;   // number of pages of table data
   ...
} *Rel;


Example: Scanning a Relation18/100

With the above disk manager, our example:

select name from Employee

might be implemented as something like

DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
Page buffer = malloc(PAGESIZE*sizeof(char));
for (int i = 0; i < r->npages; i++) {
   PageId pid = r->start+i;
   get_page(db, pid, buffer);
   for each tuple in buffer {
      get tuple data and extract name
      add (name) to result tuples
   }
}


Single-File Storage Manager19/100

// start using DB, buffer meta-data
DB openDatabase(char *name) { 
   DB db = new(struct DBrec);
   db->dbname = strdup(name);
   db->fd = open(name,O_RDWR);
   db->map = readSpaceMap(db->fd);
   db->names = readNameMap(db->fd);
   return db;
}
// set up struct describing relation
Rel openRelation(DB db, char *rname) {
   Rel r = new(struct Relrec);
   r->relname = strdup(rname);
   // get relation data from map tables
   r->start = ...;
   r->npages = ...;
   return r;
}


... Single-File Storage Manager20/100

// assume that Page = byte[PageSize]
// assume that PageId = block number in file

// read page from file into memory buffer
void get_page(DB db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   read(db->fd, buf, PAGESIZE);
}

// write page from memory buffer to file
void put_page(Db db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   write(db->fd, buf, PAGESIZE);
}


Exercise 1: Relation Scan Cost21/100

Consider a table R(x,y,z) with 105 tuples, implemented as

Calculate the total time-cost for answering the query:

insert into S select * from R where x > 10;

if 50% of the tuples satisfy the condition.


DBMS Parameters22/100

Our view of relations in DBMSs:

[Diagram:Pics/file-struct/file-struct0-small.png]


... DBMS Parameters23/100

Typical DBMS/table parameter values:

Quantity Symbol E.g. Value
total # tuples r 106
record size R 128 bytes
total # pages b 105
page size B 8192 bytes
# tuples per page c 60
page read/write time Tr ,Tw 10 msec
cost to process
one page in memory
- ≅ 0


Multiple-file Disk Manager24/100

Most DBMSs don't use a single large file for all data.

They typically provide:

Precise file structure varies between individual DBMSs.

Using multiple files (one file per relation) can be easier, e.g.


... Multiple-file Disk Manager25/100

Example of single-file vs multiple-file:

[Diagram:Pics/storage/single-vs-multi-small.png]


Consider how you would compute file offset of page[i] in table[1] ...


... Multiple-file Disk Manager26/100

Structure of PageId for data pages in such systems ...

If system uses one file per table, PageId contains:

If system uses several files per table, PageId contains:


PostgreSQL Storage Manager27/100

PostgreSQL uses the following file organisation ...

[Diagram:Pics/storage/pg-file-arch-small.png]


... PostgreSQL Storage Manager28/100

Components of storage subsystem:

PostgreSQL has two basic kinds of files: Note: smgr designed for many storage devices; only disk handler provided


Relations as Files29/100

PostgreSQL identifies relation files via their OIDs.

The core data structure for this is RelFileNode:

typedef struct RelFileNode {
    Oid  spcNode;  // tablespace
    Oid  dbNode;   // database
    Oid  relNode;  // relation
} RelFileNode;

Global (shared) tables (e.g. pg_database) have


... Relations as Files30/100

The relpath function maps RelFileNode to file:


char *relpath(RelFileNode r)  // simplified
{
   char *path = malloc(ENOUGH_SPACE);

   if (r.spcNode == GLOBALTABLESPACE_OID) {
      /* Shared system relations live in PGDATA/global */
      Assert(r.dbNode == 0);
      sprintf(path, "%s/global/%u",
              DataDir, r.relNode);
   }
   else if (r.spcNode == DEFAULTTABLESPACE_OID) {
      /* The default tablespace is PGDATA/base */
      sprintf(path, "%s/base/%u/%u",
              DataDir, r.dbNode, r.relNode);
   }
   else {
      /* All other tablespaces accessed via symlinks */
      sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir
              r.spcNode, r.dbNode, r.relNode);
   }
   return path;
}


Exercise 2: PostgreSQL Files31/100

In your PostgreSQL server


File Descriptor Pool32/100

Unix has limits on the number of concurrently open files.

PostgreSQL maintains a pool of open file descriptors:

File names are simply strings: typedef char *FileName

Open files are referenced via: typedef int File

A File is an index into a table of "virtual file descriptors".

Defs: include/storage/fd.h
Code: backend/storage/file/fd.c


... File Descriptor Pool33/100

Interface to file descriptor (pool):


File PathNameOpenFile(char *fileName, int flags)
      // open a file with default pg.conf mode
File PathNameOpenFilePerm(char *fName, int flags, int mode)
     // open a file in the DB directory ($PGDATA/base/...)
File OpenTemporaryFile(bool interXact)
     // open temp file flag: close at end of transaction?
void FileClose(File file)
void FileUnlink(File file)
int  FileRead(File file, char *buffer, int amount)
int  FileWrite(File file, char *buffer, int amount)
int  FileSync(File file)
long FileSeek(File file, long offset, int whence)
int  FileTruncate(File file, long offset)

Analogous to Unix syscalls open(), close(), read(), ...


... File Descriptor Pool34/100

Virtual file descriptor records (simplified):


typedef struct vfd
{
    s_short  fd;              // current FD, or VFD_CLOSED if none
    u_short  fdstate;         // bitflags for Vfd's state
    File     nextFree;        // link to next free Vfd, if in freelist
    File     lruMoreRecently; // doubly linked recency-of-use list
    File     lruLessRecently;
    long     seekPos;         // current logical file position
    char     *fileName;       // name of file, or NULL for unused Vfd
    // NB: fileName is malloc'd, and must be free'd when closing the Vfd
    int      fileFlags;       // open(2) flags for (re)opening the file
    int      fileMode;        // mode to pass to open(2)
} Vfd;


... File Descriptor Pool35/100

Virtual file descriptors (Vfd)

VfdCache[0] holds list head/tail pointers.


Exercise 3: Opening a Vfd36/100

Consider the following call to open a file

f = PathNameOpenFilePerm(
   "/srvr/jas/pgsql/data/base/13645/12348",
   O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
   0600
)


Sketch implementation of PathNameOpenFilePerm()


File Manager37/100

Reminder: PostgreSQL file organisation

[Diagram:Pics/storage/pg-file-arch-small.png]


... File Manager38/100

PostgreSQL stores each table

[Diagram:Pics/storage/one-table-files-small.png]


... File Manager39/100

Data files   (Oid, Oid.1, ...):


[Diagram:Pics/storage/heap-file-small.png]


... File Manager40/100

Free space map   (Oid_fsm):

Visibility map   (Oid_vm):


... File Manager41/100

The "magnetic disk storage manager" (storage/smgr/md.c)

PostgreSQL PageID values are structured:

typedef struct
{
    RelFileNode rnode;    // which relation/file
    ForkNumber  forkNum;  // which fork (of reln)
    BlockNumber blockNum; // which page/block 
} BufferTag;


... File Manager42/100

Access to a block of data proceeds (roughly) as follows:

// pageID set from pg_catalog tables
// buffer obtained from Buffer pool
getBlock(BufferTag pageID, Buffer buf)
{
   File fid;  off_t offset;  int fd;
   (fid, offset) = findBlock(pageID)
   fd = VfdCache[fid].fd;
   lseek(fd, offset, SEEK_SET)
   VfdCache[fid].seekPos = offset;
   nread = read(fd, buf, BLOCKSIZE)
   if (nread < BLOCKSIZE) ... we have a problem
}

BLOCKSIZE is a global configurable constant (default: 8192)


... File Manager43/100

findBlock(BufferTag pageID) returns (Vfd, off_t)
{
   offset = pageID.blockNum * BLOCKSIZE
   fileName = relpath(pageID.rnode)
   if (pageID.forkNum > 0)
      fileName = fileName+"."+pageID.forkNum
   fid = PathNameOpenFIle(fileName, O_READ);
   fSize = VfdCache[fid].fileSize;
   if (offset > fSize) {
      fid = allocate new Vfd for next fork
      offset = offset - fd.fileSize
   }
   return (fd, offset)
}


Buffer Pool


Buffer Pool45/100

Aim of buffer pool:

Used by: Uses:
Note: we use the terms page and block interchangably


... Buffer Pool46/100

[Diagram:Pics/storage/pool-small.png]


... Buffer Pool47/100

Buffer pool operations:   (both take single PageID argument)

To some extent ...

Buffer pool data structures:


... Buffer Pool48/100

[Diagram:Pics/storage/buffer-pool-small.png]


... Buffer Pool49/100

For each frame, we need to know:   (FrameData)


Pages are referenced by PageID ...


... Buffer Pool50/100

How scans are performed without Buffer Pool:

Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
   pageID = makePageID(db,Rel,i);
   getBlock(pageID, buf);
   for (j = 0; j < nTuples(buf); j++)
      process(buf, j)
}

Requires N page reads.

If we read it again, N page reads.


... Buffer Pool51/100

How scans are performed with Buffer Pool:

Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
   pageID = makePageID(db,Rel,i);
   bufID = request_page(pageID);
   buf = frames[bufID]
   for (j = 0; j < nTuples(buf); j++)
      process(buf, j)
   release_page(pageID);
}

Requires N page reads on the first pass.

If we read it again, 0 ≤ page reads ≤ N


... Buffer Pool52/100

Buffer pool data structures:

typedef char Page[PAGESIZE];
typedef ... PageID;  // defined earlier

typedef struct _FrameData {
   PageID  pid;        // which page is in frame
   int     pin_count;  // how many processes using page
   int     dirty;      // page modified since loaded?
   Time    last_used;  // when page was last accessed
} FrameData;

Page frames[NBUFS];    // actual buffers
FrameData directory[NBUFS];


... Buffer Pool53/100

Implementation of request_page()

int request_page(PageID pid)
{
   bufID = findInPool(pid)
   if (bufId == NOT_FOUND) {
      if (no free frames in Pool) {
         bufID = findFrameToReplace()
         if (directory[bufID].dirty)
            old = directory[bufID].page
            put_page(old, frames[bufID])
      }
      bufID = index of freed frame
      directory[bufID].page = pid
      directory[bufID].pin_count = 0
      directory[bufID].dirty = 0
      get_page(pid, frames[bufID])
   }
   directory[bufID].pin_count++
   return bufID
}


... Buffer Pool54/100

The release_page(pid) operation:

Note: no effect on disk or buffer contents until replacement required

The mark_page(pid) operation:

Note: doesn't actually write to disk; indicates that page changed

The flush_page(pid) operation:

Note: not generally used by higher levels of DBMS


... Buffer Pool55/100

Evicting a page ...

If multiple frames can potentially be released


Page Replacement Policies56/100

Several schemes are commonly in use:

LRU / MRU require knowledge of when pages were last accessed


... Page Replacement Policies57/100

Cost benefit from buffer pool (with n frames) is determined by:

Example (a): sequential scan, LRU or MRU, n ≥ b

First scan costs b reads; subsequent scans are "free".

Example (b): sequential scan, MRU, n < b

First scan costs b reads; subsequent scans cost b - n reads.

Example (c): sequential scan, LRU, n < b

All scans cost b reads; known as sequential flooding.


Effect of Buffer Management 58/100

Consider a query to find customers who are also employees:

select c.name
from   Customer c, Employee e
where  c.ssn = e.ssn;

This might be implemented inside the DBMS via nested loops:

for each tuple t1 in Customer {
    for each tuple t2 in Employee {
        if (t1.ssn == t2.ssn)
            append (t1.name) to result set
    }
}


... Effect of Buffer Management 59/100

In terms of page-level operations, the algorithm looks like:

Rel rC = openRelation("Customer");
Rel rE = openRelation("Employee");
for (int i = 0; i < nPages(rC); i++) {
    PageID pid1 = makePageID(db,rC,i);
    Page p1 = request_page(pid1);
    for (int j = 0; j < nPages(rE); j++) {
        PageID pid2 = makePageID(db,rE,j);
        Page p2 = request_page(pid2);
        // compare all pairs of tuples from p1,p2
        // construct solution set from matching pairs
        release_page(pid2);
    }
    release_page(pid1);
}


Exercise 4: Buffer Cost Benefit (i)60/100

Assume that:

Compute how many page reads occur ... For the last two, buffer pool has n=3 slots (n < bC and n < bE)


Exercise 5: Buffer Cost Benefit (ii)61/100

If the tables were larger, the above analysis would be tedious.

Write a C program to simulate buffer pool usage


PostgreSQL Buffer Manager62/100

PostgreSQL buffer manager:

Buffers are located in a large region of shared memory.

Definitions:  src/include/storage/buf*.h

Functions:  src/backend/storage/buffer/*.c

Commentary: backend/storage/buffer/README


Buffer code is also used by backends who want a private buffer pool


... PostgreSQL Buffer Manager63/100

Buffer pool consists of:

BufferDescriptors    (i.e. directory)

BufferBlocks    (i.e. frames) Buffer = index in above arrays Size of buffer pool is set in postgresql.conf, e.g.

shared_buffers = 16MB   # min 128KB, 16*8KB buffers


... PostgreSQL Buffer Manager64/100

[Diagram:Pics/storage/buffer-memory-small.png]


Buffer Pool Data Types65/100

typedef struct buftag {
   RelFileNode rnode;     /* physical relation identifier */
   ForkNumber  forkNum;
   BlockNumber blockNum;  /* relative to start of reln */
} BufferTag;

typedef struct BufferDesc { (simplified)
   BufferTag tag;        // ID of page contained in buffer
   int       buf_id;     // buffer's index number (from 0)
   Bits32    state;      // dirty, refcount, usage
   int       freeNext;   // link in freelist chain
   ...                   // others related to concurrency
} BufferDesc;


Buffer Pool Functions66/100

Buffer manager interface:

Buffer ReadBuffer(Relation r, BlockNumber n)

BufferDesc *BufferAlloc(
                        Relation r, ForkNumber f,
                        BlockNumber n, bool *found)


... Buffer Pool Functions67/100

Buffer manager interface (cont):

void ReleaseBuffer(Buffer buf)

void MarkBufferDirty(Buffer buf)


... Buffer Pool Functions68/100

Additional buffer manager functions:

Page BufferGetPage(Buffer buf)

BufferIsPinned(Buffer buf) CheckPointBuffers etc. etc. etc.


Clock-sweep Replacement Strategy69/100

PostgreSQL page replacement strategy: clock-sweep


Exercise 6: PostgreSQL Buffer Pool70/100

Consider an initally empty buffer pool with only 3 slots.

Show the state of the pool after each of the following:

Req R0, Req S0, Rel S0, Req S1, Rel S1, Req S2,
Rel S2, Rel R0, Req R1, Req S0, Rel S0, Req S1,
Rel S1, Req S2, Rel S2, Rel R1, Req R2, Req S0,
Rel S0, Req S1, Rel S1, Req S2, Rel S2, Rel R2

Treat BufferDesc entries as

(tag, usage_count, refcount, freeNext)

Assume freeList and nextVictim global variables.


Pages


Page/Tuple Management72/100


[Diagram:Pics/storage/dbmsarch1-small.png]


Pages73/100

Database applications view data as:

The disk and buffer manager provide the following view: Page format = how space/tuples are organised within a page


Page Formats74/100

Ultimately, a Page is simply an array of bytes (byte[]).

We want to interpret/manipulate it as a collection of Records.

Typical operations on Pages:

Note: rid contains (PageId,TupIndex), so no explicit pid needed


... Page Formats75/100

Factors affecting Page formats:

Implementation of Page operations critically depends on format.


... Page Formats76/100

For fixed-length records, use record slots.

[Diagram:Pics/storage/rec-slots-small.png]


Exercise 7: Fixed-length Records77/100

Give examples of table definitions

create table R ( ...);

What are the common features of each type of table?


Exercise 8: Inserting/Deleting Fixed-length Records78/100

For each of the following Page formats:

Implement


Page Formats79/100

For variable-length records, must use slot directory.

Possibilities for handling free-space within block:

In practice, a combination is useful:
Important aspect of using slot directory


... Page Formats80/100

Compacted free space:  

[Diagram:Pics/storage/free-list-small.png]

Note: "pointers" are implemented as word offsets within block.


... Page Formats81/100

Fragmented free space:  

[Diagram:Pics/storage/free-list1-small.png]


... Page Formats82/100

Initial page state (compacted free space) ...

[Diagram:Pics/storage/insert1a-small.png]


... Page Formats83/100

Before inserting record 7 (compacted free space) ...

[Diagram:Pics/storage/insert1b-small.png]


... Page Formats84/100

After inserting record 7 (80 bytes) ...

[Diagram:Pics/storage/insert1c-small.png]


... Page Formats85/100

Initial page state (fragmented free space) ...

[Diagram:Pics/storage/insert2a-small.png]


... Page Formats86/100

Before inserting record 7 (fragmented free space) ...

[Diagram:Pics/storage/insert2b-small.png]


... Page Formats87/100

After inserting record 7 (80 bytes) ...

[Diagram:Pics/storage/insert2c-small.png]


Exercise 9: Inserting Variable-length Records88/100

For both of the following page formats

  1. variable-length records, with compacted free space
  2. variable-length records, with fragmented free space
implement the insert() function.

Use the above page format, but also assume:


Storage Utilisation89/100

How many records can fit in a page? (denoted C = capacity)

Depends on:

We typically consider average record size (R)

Given C,   HeaderSize + C*SlotSize + C*R  ≤  PageSize


Exercise 10: Space Utilisation90/100

Consider the following page/record information:

Calculate C = average number of records per page.


Overflows91/100

Sometimes, it may not be possible to insert a record into a page:

  1. no free-space fragment large enough
  2. overall free-space is not large enough
  3. the record is larger than the page
  4. no more free directory slots in page
For case (1), can first try to compact free-space within the page.

If still insufficient space, we need an alternative solution ...


... Overflows92/100

File organisation determines how cases (2)..(4) are handled.

If records may be inserted anywhere that there is free space

If file organisation determines record placement (e.g. hashed file) With overflow pages, rid structure may need modifying (rel,page,ovfl,rec)


... Overflows93/100

Overflow files for very large records and BLOBs:

[Diagram:Pics/storage/ovflow-file-small.png]

Record-based handling of overflows:

[Diagram:Pics/storage/ovflow-record-small.png]

We discuss overflow pages in more detail when covering Hash Files.


PostgreSQL Page Representation94/100

Functions: src/backend/storage/page/*.c

Definitions: src/include/storage/bufpage.h

Each page is 8KB (default BLCKSZ) and contains:

Large data items are stored in separate (TOAST) files   (implicit)

Also supports ~SQL-standard BLOBs   (explicit large data items)


... PostgreSQL Page Representation95/100

PostgreSQL page layout:

[Diagram:Pics/storage/pg-page-struct-small.png]


... PostgreSQL Page Representation96/100

Page-related data types:


// a Page is simply a pointer to start of buffer
typedef Pointer Page;

// indexes into the tuple directory
typedef uint16  LocationIndex;

// entries in tuple directory (line pointer array)
typedef struct ItemIdData
{
   unsigned   lp_off:15,    // tuple offset from start of page
              lp_flags:2,   // unused,normal,redirect,dead
              lp_len:15;    // length of tuple (bytes)
} ItemIdData;


... PostgreSQL Page Representation97/100

Page-related data types: (cont)


typedef struct PageHeaderData
{
   XLogRecPtr    pd_lsn;      // xact log record for last change
   uint16        pd_tli;      // xact log reference information
   uint16        pd_flags;    // flag bits (e.g. free, full, ...
   LocationIndex pd_lower;    // offset to start of free space
   LocationIndex pd_upper;    // offset to end of free space
   LocationIndex pd_special;  // offset to start of special space
   uint16        pd_pagesize_version;
   TransactionId pd_prune_xid;// is pruning useful in data page?
   ItemIdData    pd_linp[1];  // beginning of line pointer array
} PageHeaderData;

typedef PageHeaderData *PageHeader;


... PostgreSQL Page Representation98/100

Operations on Pages:

void PageInit(Page page, Size pageSize, ...)

OffsetNumber PageAddItem(Page page,
                    Item item, Size size, ...)
void PageRepairFragmentation(Page page)


... PostgreSQL Page Representation99/100

PostgreSQL has two kinds of pages:

Both kinds of page have the same page layout.

One important difference:


Exercise 11: PostgreSQL Pages100/100

Draw diagrams of a PostgreSQL heap page

Show the values in the tuple header.

Assume that there is no special space in the page.


Produced: 3 Mar 2020