COMP9315 Week 02 Monday Lecture

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [0/51]
❖ Things To Note


COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [1/51]
❖ Cost Models

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [2/51]
❖ Cost Models

An important aspect of this course is

Cost can be measured in terms of Primary assumptions in our cost models: Page = fixed-size block of data; size determined by storage medium
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [3/51]
❖ Cost Models (cont)

Since time cost is affected by many factors

we do not consider time cost in our analyses.

For comparing methods, page cost is better

Measures the number of page read/write requests made.

Estimating costs with multiple concurrent operations and  buffering is difficult!!

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [4/51]
❖ Cost Models (cont)


Terminology:

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [5/51]
❖ Cost Models (cont)

In developing cost models, we make assumptions on how DBMSs store data:

[Diagram:Pics/scansortproj/file-struct0.png]

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [6/51]
❖ Cost Models (cont)

Typical DBMS/table parameter values:

Quantity Symbol E.g. Value
total # tuples r 106
record size R 128 bytes
total # pages b 105
page size B 8192 bytes
# tuples per page c 60
page read/write time Tr ,Tw 10 msec
cost to process
one page in memory
- ≅ 0
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [7/51]
❖ Cost Models (cont)

Our cost models are "rough" (based on assumptions)

But do give an O(x)  feel for how expensive operations are.

Example "rough" estimation: how many piano tuners in Sydney?

Actual number of piano tuners in Yellow Pages = 120

Example borrowed from Alan Fekete at Sydney University.

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [8/51]
❖ Exercise: Relation Scan Cost

Consider a table R(x,y,z) with 105 tuples, implemented as

Calculate the total time-cost for answering the query:

insert into S select * from R where x > 10;

if 50% of the tuples satisfy the condition.

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [9/51]
❖ Storage Management

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [10/51]
❖ Storage Technology

Persistent storage is

Computational storage is Access cost HDD:RAM ≅ 100000:1, e.g.
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [11/51]
❖ Storage Technology (cont)

Hard disks are well-established, cheap, high-volume, ...

Alternative bulk storage: SSD

Feasible for long-term, high-update environments?
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [12/51]
❖ Storage Technology (cont)

Comparison of HDD and SSD properties:

  HDD SSD
Cost/byte ~ $13 / TB ~ $35 / TB
Read latency ~ 6ms ~ 50µs
Write latency ~ 6ms ~ 900µs
R/W rate 150MB/s 450MB/s
Read unit block (e.g. 1KB) byte
Writing write a block write on empty block


Will SSDs ever completely replace HDDs?

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [13/51]
❖ Storage Management

Part of DBMS related to storage management:

[Diagram:Pics/storage/dbmsarch.png]

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [14/51]
❖ Storage Management (cont)


Aims of storage management in DBMS:

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [15/51]
❖ Storage Management

Topics in storage management ...

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [16/51]
❖ Views of Data in Query Evaluation

[Diagram:Pics/storage/query-ops.png]

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [17/51]
❖ Views of Data in Query Evaluation (cont)

Representing database objects during query evaluation:

Addressing in DBMSs:
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [18/51]
❖ File Management


Aims of file management subsystem:

Builds higher-level operations on top of OS file operations.

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [19/51]
❖ File Management (cont)

Typical file operations provided by the operating system:

fd = open(fileName,mode)
  // open a named file for reading/writing/appending
close(fd)
  // close an open file, via its descriptor 
nread = read(fd, buf, nbytes)
  // attempt to read data from file into buffer 
nwritten = write(fd, buf, nbytes)
  // attempt to write data from buffer to file
lseek(fd, offset, seek_type)
  // move file pointer to relative/absolute file offset
fsync(fd)
  // flush contents of file buffers to disk

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [20/51]
❖ DBMS File Organisation

How is data for DB objects arranged in the file system?

Different DBMSs make different choices, e.g.

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [21/51]
❖ Single-file DBMS

Consider a single file for the entire database (e.g. SQLite)

Objects are allocated to regions (segments) of the file.

[Diagram:Pics/storage/single-file.png]

If an object grows too large for allocated segment, allocate an extension.

What happens to allocated space when objects are removed?

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [22/51]
❖ Single-file DBMS (cont)

Allocating space in Unix files is easy:

If the seek goes way beyond the end of the file: With the above, a disk/file manager is easy to implement.
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [23/51]
❖ Single-file Storage Manager

Consider the following simple single-file DBMS layout:

[Diagram:Pics/storage/single-file-example.png]

E.g.

SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]

TableMap = [ ("employee",20,500), ("project",620,40) ]

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [24/51]
❖ Single-file Storage Manager (cont)

Each file segment consists of a number fixed-size blocks

The following data/constant definitions are useful

#define PAGESIZE 2048   // bytes per page

typedef long PageId;    // PageId is block index
                        // pageOffset=PageId*PAGESIZE

typedef char *Page;     // pointer to page/block buffer


Typical PAGESIZE values:  1024,  2048,  4096,  8192

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [25/51]
❖ Single-file Storage Manager (cont)

Storage Manager data structures for opened DBs & Tables

typedef struct DBrec {
   char *dbname;    // copy of database name
   int fd;          // the database file
   SpaceMap map;    // map of free/used areas 
   NameTable names; // map names to areas + sizes
} *DB;

typedef struct Relrec {
   char *relname;   // copy of table name
   int   start;     // page index of start of table data
   int   npages;    // number of pages of table data
   ...
} *Rel;

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [26/51]
❖ Example: Scanning a Relation

With the above disk manager, our example:

select name from Employee

might be implemented as something like

DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
Page buffer = malloc(PAGESIZE*sizeof(char));
for (int i = 0; i < r->npages; i++) {
   PageId pid = r->start+i;
   get_page(db, pid, buffer);
   for each tuple in buffer {
      get tuple data and extract name
      add (name) to result tuples
   }
}

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [27/51]
❖ Single-File Storage Manager

// start using DB, buffer meta-data
DB openDatabase(char *name) { 
   DB db = new(struct DBrec);
   db->dbname = strdup(name);
   db->fd = open(name,O_RDWR);
   db->map = readSpaceTable(db->fd);
   db->names = readNameTable(db->fd);
   return db;
}
// stop using DB and update all meta-data
void closeDatabase(DB db) {
   writeSpaceTable(db->fd,db->map);
   writeNameTable(db->fd,db->map);
   fsync(db->fd);
   close(db->fd);
   free(db->dbname);
   free(db);
}

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [28/51]
❖ Single-File Storage Manager (cont)

// set up struct describing relation
Rel openRelation(DB db, char *rname) {
   Rel r = new(struct Relrec);
   r->relname = strdup(rname);
   // get relation data from map tables
   r->start = ...;
   r->npages = ...;
   return r;
}

// stop using a relation
void closeRelation(Rel r) {
   free(r->relname);
   free(r);
}

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [29/51]
❖ Single-File Storage Manager (cont)

// assume that Page = byte[PageSize]
// assume that PageId = block number in file

// read page from file into memory buffer
void get_page(DB db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   read(db->fd, buf, PAGESIZE);
}

// write page from memory buffer to file
void put_page(DB db, PageId p, Page buf) {
   lseek(db->fd, p*PAGESIZE, SEEK_SET);
   write(db->fd, buf, PAGESIZE);
}

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [30/51]
❖ Single-File Storage Manager (cont)

Managing contents of space mapping table can be complex:

// assume an array of (offset,length,status) records

// allocate n new pages
PageId allocate_pages(DB db, int n) {
   if (no existing free chunks are large enough) {
      int endfile = lseek(db->fd, 0, SEEK_END);
      addNewEntry(db->map, endfile, n);
   } else {
      grab "worst fit" chunk
      split off unused section as new chunk
   }
}

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [31/51]
❖ Single-File Storage Manager (cont)

Similar complexity for freeing chunks

// drop n pages starting from p
void deallocate_pages(DB db, PageId p, int n) {
   if (no adjacent free chunks) {
      markUnused(db->map, p, n);
   } else {
      merge adjacent free chunks
      compress mapping table
   }
}

Changes take effect when closeDatabase() executed.

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [32/51]
❖ Multiple-file Disk Manager

Most DBMSs don't use a single large file for all data.

They typically provide:

Precise file structure varies between individual DBMSs.

Using multiple files (one file per relation) can be easier, e.g.

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [33/51]
❖ Multiple-file Disk Manager (cont)

Example of single-file vs multiple-file:

[Diagram:Pics/storage/single-vs-multi.png]


Consider how you would compute file offset of page[i] in table[1] ...

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [34/51]
❖ Multiple-file Disk Manager (cont)

Structure of PageId for data pages in such systems ...

If system uses one file per table, PageId contains:

If system uses several files per table, PageId contains:
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [35/51]
❖ PostgreSQL Storage Manager

PostgreSQL uses the following file organisation ...

[Diagram:Pics/storage/pg-file-arch.png]

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [36/51]
❖ PostgreSQL Storage Manager (cont)

Components of storage subsystem:

PostgreSQL has two basic kinds of files: Note: smgr designed for many storage devices; only disk handler provided
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [37/51]
❖ Relations as Files

PostgreSQL identifies relation files via their OIDs.

The core data structure for this is RelFileNode:

typedef struct RelFileNode {
    Oid  spcNode;  // tablespace
    Oid  dbNode;   // database
    Oid  relNode;  // relation
} RelFileNode;

Global (shared) tables (e.g. pg_database) have

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [38/51]
❖ Relations as Files (cont)

The relpath function maps RelFileNode to file:


char *relpath(RelFileNode r)  // simplified
{
   char *path = malloc(ENOUGH_SPACE);

   if (r.spcNode == GLOBALTABLESPACE_OID) {
      /* Shared system relations live in PGDATA/global */
      Assert(r.dbNode == 0);
      sprintf(path, "%s/global/%u",
              DataDir, r.relNode);
   }
   else if (r.spcNode == DEFAULTTABLESPACE_OID) {
      /* The default tablespace is PGDATA/base */
      sprintf(path, "%s/base/%u/%u",
              DataDir, r.dbNode, r.relNode);
   }
   else {
      /* All other tablespaces accessed via symlinks */
      sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir
              r.spcNode, r.dbNode, r.relNode);
   }
   return path;
}

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [39/51]
❖ Exercise: PostgreSQL Files

In your PostgreSQL server

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [40/51]
❖ File Descriptor Pool

Unix has limits on the number of concurrently open files.

PostgreSQL maintains a pool of open file descriptors:

File names are simply strings: typedef char *FileName

Open files are referenced via: typedef int File

A File is an index into a table of "virtual file descriptors".

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [41/51]
❖ File Descriptor Pool (cont)

Interface to file descriptor (pool):


File FileNameOpenFile(FileName fileName,
                      int fileFlags, int fileMode);
     // open a file in the database directory ($PGDATA/base/...)
File OpenTemporaryFile(bool interXact);
     // open temp file; flag: close at end of transaction?
void FileClose(File file);
void FileUnlink(File file);
int  FileRead(File file, char *buffer, int amount);
int  FileWrite(File file, char *buffer, int amount);
int  FileSync(File file);
long FileSeek(File file, long offset, int whence);
int  FileTruncate(File file, long offset);

Analogous to Unix syscalls open(), close(), read(), write(), lseek(), ...

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [42/51]
❖ File Descriptor Pool (cont)

Virtual file descriptors (Vfd)

VfdCache[0] holds list head/tail pointers.
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [43/51]
❖ File Descriptor Pool (cont)

Virtual file descriptor records (simplified):


typedef struct vfd
{
    s_short  fd;              // current FD, or VFD_CLOSED if none
    u_short  fdstate;         // bitflags for VFD's state
    File     nextFree;        // link to next free VFD, if in freelist
    File     lruMoreRecently; // doubly linked recency-of-use list
    File     lruLessRecently;
    long     seekPos;         // current logical file position
    char     *fileName;       // name of file, or NULL for unused VFD
    // NB: fileName is malloc'd, and must be free'd when closing the VFD
    int      fileFlags;       // open(2) flags for (re)opening the file
    int      fileMode;        // mode to pass to open(2)
} Vfd;

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [44/51]
❖ File Manager

Reminder: PostgreSQL file organisation

[Diagram:Pics/storage/pg-file-arch.png]

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [45/51]
❖ File Manager (cont)

PostgreSQL stores each table

[Diagram:Pics/storage/one-table-files.png]

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [46/51]
❖ File Manager (cont)

Data files   (Oid, Oid.1, ...):


[Diagram:Pics/storage/heap-file.png]

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [47/51]
❖ File Manager (cont)

Free space map   (Oid_fsm):

Visibility map   (Oid_vm):
COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [48/51]
❖ File Manager (cont)

The "magnetic disk storage manager" (storage/smgr/md.c)

PostgreSQL PageID values are structured:

typedef struct
{
    RelFileNode rnode;    // which relation/file
    ForkNumber  forkNum;  // which fork (of reln)
    BlockNumber blockNum; // which page/block 
} BufferTag;

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [49/51]
❖ File Manager (cont)

Access to a block of data proceeds (roughly) as follows:

// pageID set from pg_catalog tables
// buffer obtained from Buffer pool
getBlock(BufferTag pageID, Buffer buf)
{
   Vfd vf;  off_t offset;
   (vf, offset) = findBlock(pageID)
   lseek(vf.fd, offset, SEEK_SET)
   vf.seekPos = offset;
   nread = read(vf.fd, buf, BLOCKSIZE)
   if (nread < BLOCKSIZE) ... we have a problem
}

BLOCKSIZE is a global configurable constant (default: 8192)

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [50/51]
❖ File Manager (cont)

findBlock(BufferTag pageID) returns (Vfd, off_t)
{
   offset = pageID.blockNum * BLOCKSIZE
   fileName = relpath(pageID.rnode)
   if (pageID.forkNum > 0)
      fileName = fileName+"."+pageID.forkNum
   if (fileName is not in Vfd pool)
      fd = allocate new Vfd for fileName
   else
      fd = use Vfd from pool
   if (offset > fd.fileSize) {
      fd = allocate new Vfd for next fork
      offset = offset - fd.fileSize
   }
   return (fd, offset)
}

COMP9315 24T1 ♢ Week 2 Monday Lecture ♢ [51/51]


Produced: 19 Feb 2024