COMP9315 Week 02 Thursday Lecture

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [0/54]
❖ Things to Note

Help Session ... Friday (tomorrow) 11.30am - 4.00pm ... K17-410

Scheduled power outage in K17 ... Fri 8pm - Sat 10 pm

Quiz 1 is due by Fri 8pm (7.30 to be safe)    (Beware Q1)

Assignment 1 discussion in Monday lecture

Separate links to Monday and Thursday lectures in Moodle (sigh)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [1/54]
❖ Buffer Pool

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [2/54]
❖ Buffer Pool

Aim of buffer pool:

Used by: Uses:
Note: we use the terms page and block interchangably
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [3/54]
❖ Buffer Pool (cont)

[Diagram:Pics/storage/pool.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [4/54]
❖ Buffer Pool (cont)

Buffer pool operations:   (both take single PageID argument)

To some extent ...

Buffer pool data structures:

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [5/54]
❖ Buffer Pool (cont)

[Diagram:Pics/storage/buffer-pool.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [6/54]
❖ Buffer Pool (cont)

For each frame, we need to know:   (FrameData)


Pages are referenced by PageID ...
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [7/54]
❖ Buffer Pool (cont)

How scans are performed without Buffer Pool:

Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
   pageID = makePageID(db,Rel,i);
   getBlock(pageID, buf);
   for (j = 0; j < nTuples(buf); j++)
      process(buf, j)
}

Requires N page reads.

If we read it again, N page reads.

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [8/54]
❖ Buffer Pool (cont)

How scans are performed with Buffer Pool:

Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
   pageID = makePageID(db,Rel,i);
   bufID = request_page(pageID);
   buf = frames[bufID]
   for (j = 0; j < nTuples(buf); j++)
      process(buf, j)
   release_page(pageID);
}

Requires N page reads on the first pass.

If we read it again, 0 ≤ page reads ≤ N

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [9/54]
❖ Buffer Pool (cont)

Implementation of request_page()

int request_page(PageID pid)
{
   if (pid in Pool)
      bufID = index for pid in Pool
   else {
      if (no free frames in Pool)
         evict a page // free a frame
      bufID = allocate free frame
      directory[bufID].page = pid
      directory[bufID].pin_count = 0
      directory[bufID].dirty_bit = 0
   }
   directory[bufID].pin_count++
   return bufID
}

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [10/54]
❖ Buffer Pool (cont)

The release_page(pid) operation:

Note: no effect on disk or buffer contents until replacement required

The mark_page(pid) operation:

Note: doesn't actually write to disk; just indicates that page changed

The flush_page(pid) operation:

Note: not generally used by higher levels of DBMS

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [11/54]
❖ Buffer Pool (cont)


Evicting a page ...

If multiple frames can potentially be released
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [12/54]
❖ Page Replacement Policies


Several schemes are commonly in use:

LRU / MRU require knowledge of when pages were last accessed
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [13/54]
❖ Page Replacement Policies (cont)

Cost benefit from buffer pool (with n frames) is determined by:

Example (a): sequential scan, LRU or MRU, n ≥ b

First scan costs b  reads; subsequent scans are "free".

Example (b): sequential scan, MRU, n < b

First scan costs b  reads; subsequent scans cost b - n  reads.

Example (c): sequential scan, LRU, n < b

All scans cost b  reads; known as sequential flooding.

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [14/54]
❖ Exercise:


Consider the following scenarios:

Show state of buffer pool during two sequential scans of file.
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [15/54]
❖ Effect of Buffer Management

Consider a query to find customers who are also employees:

select c.name
from   Customer c, Employee e
where  c.ssn = e.ssn;

This might be implemented inside the DBMS via nested loops:

for each tuple t1 in Customer {
    for each tuple t2 in Employee {
        if (t1.ssn == t2.ssn)
            append (t1.name) to result set
    }
}

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [16/54]
❖ Effect of Buffer Management (cont)

In terms of page-level operations, the algorithm looks like:

Rel rC = openRelation("Customer");
Rel rE = openRelation("Employee");
for (int i = 0; i < nPages(rC); i++) {
    PageID pid1 = makePageID(db,rC,i);
    Page p1 = request_page(pid1);
    for (int j = 0; j < nPages(rE); j++) {
        PageID pid2 = makePageID(db,rE,j);
        Page p2 = request_page(pid2);
        // compare all pairs of tuples from p1,p2
        // construct solution set from matching pairs
        release_page(pid2);
    }
    release_page(pid1);
}

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [17/54]
❖ Exercise: Buffer Cost Benefit (i)

Assume that:

Compute how many page reads occur ... For the last two, buffer pool has n=3 slots (n < bC and n < bE)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [18/54]
❖ Exercise: Buffer Cost Benefit (ii)

If the tables were larger, the above analysis would be tedious.

Write a C program to simulate buffer pool usage

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [19/54]
❖ PostgreSQL Buffer Manager

PostgreSQL buffer manager:

Buffers are located in a large region of shared memory.

Definitions:  src/include/storage/buf*.h

Functions:  src/backend/storage/buffer/*.c


Buffer code is also used by backends who want a private buffer pool

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [20/54]
❖ PostgreSQL Buffer Manager (cont)

Buffer pool consists of:

BufferDescriptors

BufferBlocks Buffer = index values in above arrays Size of buffer pool is set in postgresql.conf, e.g.

shared_buffers = 16MB   # min 128KB, 16*8KB buffers

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [21/54]
❖ PostgreSQL Buffer Manager (cont)

PostgreSQL buffer descriptors:

typedef struct BufferDesc
{
   BufferTag  tag;           /* ID of page contained in buffer */
   int        buf_id;        /* buffer's index number (from 0) */

   /* state of the tag, containing:
               flags, refcount and usagecount */
   pg_atomic_uint32 state;

   int         wait_backend_pgprocno;   /* pin-count waiter */
   int         freeNext;      /* link in freelist chain */
   LWLock      content_lock;  /* to lock access to buffer contents */
} BufferDesc;

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [22/54]
❖ PostgreSQL Buffer Manager (cont)

[Diagram:Pics/storage/buffer-memory.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [23/54]
❖ PostgreSQL Buffer Manager (cont)

include/storage/buf.h

include/storage/bufmgr.h include/storage/buf_internals.h Code: backend/storage/buffer/*.c

Commentary: backend/storage/buffer/README

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [24/54]
❖ Pages

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [25/54]
❖ Page/Tuple Management


[Diagram:Pics/storage/dbmsarch1.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [26/54]
❖ Some terminology

Terminology used in these slides ...

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [27/54]
❖ Reminder: Views of Data

[Diagram:Pics/storage/views-of-data.png]


Each tuple is represented by a record in some page

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [28/54]
❖ Page Formats

A Page is simply an array of bytes (byte[]).

Want to interpret/manipulate it as a collection of Records.

Typical operations on pages and records:


Note:  rid = (PageId,TupleId),  rel = open relation
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [29/54]
❖ Exercise: get_record(rel,rid)

Give an implementation of a function

Record get_record(Relation rel, RecordId rid)

which takes two parameters

and returns the record corresponding to that rid
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [30/54]
❖ Exercise: get_record(rel,rid) (cont)

Factors affecting Page formats:

Implementation of Page operations critically depends on format.
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [31/54]
❖ Exercise: Fixed-length Records (i)

How records are managed in Pages ...

Give examples of table definitions

create table R (...);


What are the common features of each type of table?

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [32/54]
❖ Exercise: Fixed-length Records (i) (cont)

For fixed-length records, use record slots.

[Diagram:Pics/storage/fixed-length-slots.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [33/54]
❖ Exercise: Fixed-length Records (ii)

For the two fixed-length record page formats ...

Implement

Ignore buffer pool (i.e. use get_page() and put_page())
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [34/54]
❖ Page Formats

For variable-length records, must use record directory

An important aspect of using record directory Issue with variable-length records We refer to tuple index within directory as   TupleId tid
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [35/54]
❖ Page Formats (cont)

Possibilities for handling free-space within block:

In practice, a combination is useful:
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [36/54]
❖ Page Formats (cont)

Compacted free space ... before inserting record 7

[Diagram:Pics/storage/insert1b.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [37/54]
❖ Page Formats (cont)

After inserting record 7 (80 bytes) ...

[Diagram:Pics/storage/insert1c.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [38/54]
❖ Page Formats (cont)

Fragmented free space ... before inserting record 7

[Diagram:Pics/storage/insert2b.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [39/54]
❖ Page Formats (cont)

After inserting record 7 (80 bytes) ...

[Diagram:Pics/storage/insert2c.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [40/54]
❖ Exercise: Inserting Variable-length Records

For both of the following page formats

  1. variable-length records, with compacted free space
  2. variable-length records, with fragmented free space
implement the insert() function.

Use the above page format, but also assume:

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [41/54]
❖ Storage Utilisation

How many records can fit in a page? (denoted c = capacity)

Depends on:

We typically consider average record size (R)

Given c,   HeaderSize + c*SlotSize + c*R  ≤  PageSize

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [42/54]
❖ Exercise: Space Utilisation

Consider the following page/record information:

Calculate c = average number of records per page.
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [43/54]
❖ Overflows

Sometimes, it may not be possible to insert a record into a page:

  1. no free-space fragment large enough
  2. overall free-space in page is not large enough
  3. the record is larger than the page
  4. no more free directory slots in page
For case (1), can first try to compact free-space within the page.

If still insufficient space, we need an alternative solution ...

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [44/54]
❖ Overflows (cont)

File organisation determines how cases (2)..(4) are handled.

If records may be inserted anywhere that there is free space

If file organisation determines record placement (e.g. hashed file) With overflow pages, rid structure may need modifying (rel,page,ovfl,rec)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [45/54]
❖ Overflows (cont)

Overflow pages for full buckets in a hashed file:

[Diagram:Pics/storage/ovflow-files.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [46/54]
❖ Overflows (cont)

Overflow file for very large records and BLOBs:


[Diagram:Pics/storage/ovflow-blobs.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [47/54]
❖ PostgreSQL Page Representation

Functions: src/backend/storage/page/*.c

Definitions: src/include/storage/bufpage.h

Each page is 8KB (default BLCKSZ) and contains:

Large data items are stored in separate (TOAST) files   (implicit)

Also supports ~SQL-standard BLOBs   (explicit large data items)

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [48/54]
❖ PostgreSQL Page Representation (cont)

PostgreSQL page layout:

[Diagram:Pics/storage/pg-page-struct.png]

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [49/54]
❖ PostgreSQL Page Representation (cont)

Page-related data types:


// a Page is simply a pointer to start of buffer
typedef Pointer Page;

// indexes into the tuple directory
typedef uint16  LocationIndex;

// entries in tuple directory (line pointer array)
typedef struct ItemIdData
{
   unsigned   lp_off:15,  // tuple offset from start of page
              lp_flags:2, // unused,normal,redirect,dead
              lp_len:15;  // length of tuple (bytes)
} ItemIdData;

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [50/54]
❖ PostgreSQL Page Representation (cont)

Page-related data types: (cont)


typedef struct PageHeaderData  (simplified)
{
  ...                        // transaction-related data
  uint16        pd_checksum; // checksum
  uint16        pd_flags;    // flag bits (e.g. free, full, ...
  LocationIndex pd_lower;    // offset to start of free space
  LocationIndex pd_upper;    // offset to end of free space
  LocationIndex pd_special;  // offset to start of special space
  uint16        pd_pagesize_version;
  ItemIdData    pd_linp[1];  // beginning of line pointer array
} PageHeaderData;

typedef PageHeaderData *PageHeader;

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [51/54]
❖ PostgreSQL Page Representation (cont)

Operations on Pages:

void PageInit(Page page, Size pageSize, ...)

OffsetNumber PageAddItem(Page page,
                    Item item, Size size, ...)
void PageRepairFragmentation(Page page)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [52/54]
❖ PostgreSQL Page Representation (cont)

PostgreSQL has two kinds of pages:

Both kinds of page have the same page layout.

One important difference:

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [53/54]
❖ TOAST Files

Each data file has a corresponding TOAST file (if needed)

[Diagram:Pics/storage/toast-file.png]

Tuples in data pages contain rids for long values

TOAST = The Oversized Attribute Storage Technique

COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [54/54]


Produced: 22 Feb 2024