COMP9315 Week 02 Thursday Lecture
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [0/54]
Help Session ... Friday (tomorrow) 11.30am - 4.00pm ... K17-410
- fix your PostgreSQL servers so you can do the assignment
Scheduled power outage in K17 ... Fri 8pm - Sat 10 pm
- all CSE servers down for the duration (incl. Webcms, vxdb, vlab)
- K17 is locked down for the duration
Quiz 1 is due by Fri 8pm (7.30 to be safe)
Assignment 1 discussion in Monday lecture
- make sure you look at the assignment before then
Separate links to Monday and Thursday lectures in Moodle (sigh)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [1/54]
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [2/54]
Aim of buffer pool:
- hold pages read from database files, for possible re-use
Used by:
- access methods which read/write data pages
- e.g. sequential scan, indexed retrieval, hashing
Uses:
- file manager functions to access data files
Note: we use the terms
page and
block interchangably
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [3/54]
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [4/54]
Buffer pool operations: (both take single PageID
argument)
-
request_page(pid)
, release_page(pid)
, ...
To some extent ...
-
request_page()
replaces getBlock()
-
release_page()
replaces putBlock()
Buffer pool data structures:
-
Page frames[NBUFS]
-
FrameData directory[NBUFS]
-
Page
is byte[BUFSIZE]
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [5/54]
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [6/54]
For each frame, we need to know: (FrameData
)
- which Page it contains, or whether empty/free
- whether it has been modified since loading (dirty bit)
- how many transactions are currently using it (pin count)
- time-stamp for most recent access (assists with replacement)
Pages are referenced by PageID ...
- PageID = BufferTag = (rnode, forkNum, blockNum)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [7/54]
How scans are performed without Buffer Pool:
Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
pageID = makePageID(db,Rel,i);
getBlock(pageID, buf);
for (j = 0; j < nTuples(buf); j++)
process(buf, j)
}
Requires N
page reads.
If we read it again, N
page reads.
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [8/54]
How scans are performed with Buffer Pool:
Buffer buf;
int N = numberOfBlocks(Rel);
for (i = 0; i < N; i++) {
pageID = makePageID(db,Rel,i);
bufID = request_page(pageID);
buf = frames[bufID]
for (j = 0; j < nTuples(buf); j++)
process(buf, j)
release_page(pageID);
}
Requires N
page reads on the first pass.
If we read it again, 0 ≤ page reads ≤ N
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [9/54]
Implementation of request_page()
int request_page(PageID pid)
{
if (pid in Pool)
bufID = index for pid in Pool
else {
if (no free frames in Pool)
evict a page
bufID = allocate free frame
directory[bufID].page = pid
directory[bufID].pin_count = 0
directory[bufID].dirty_bit = 0
}
directory[bufID].pin_count++
return bufID
}
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [10/54]
The release_page(pid)
operation:
- Decrement pin count for specified page
Note: no effect on disk or buffer contents until replacement required
The mark_page(pid)
operation:
- Set dirty bit on for specified page
Note: doesn't actually write to disk; just indicates that page changed
The flush_page(pid)
operation:
- Write the specified page to disk (using
write_page
)
Note: not generally used by higher levels of DBMS
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [11/54]
Evicting a page ...
- find frame(s) preferably satisfying
- pin count = 0 (i.e. nobody using it)
- dirty bit = 0 (not modified)
- if selected frame was modified, flush frame to disk
- flag directory entry as "frame empty"
If multiple frames can potentially be released
- need a policy to decide which is best choice
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [12/54]
❖ Page Replacement Policies | |
Several schemes are commonly in use:
- Least Recently Used (LRU)
- Most Recently Used (MRU)
- First in First Out (FIFO)
- Random
LRU / MRU require knowledge of when pages were last accessed
- how to keep track of "last access" time?
- base on request/release operations or on real page usage?
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [13/54]
❖ Page Replacement Policies (cont) | |
Cost benefit from buffer pool (with n frames) is determined by:
- number of available frames (more ⇒ better)
- replacement strategy vs page access pattern
Example (a): sequential scan, LRU or MRU,
n ≥ b
First scan costs b reads; subsequent scans are "free".
Example (b): sequential scan, MRU, n < b
First scan costs b reads; subsequent scans cost b - n reads.
Example (c): sequential scan, LRU, n < b
All scans cost b reads; known as sequential flooding.
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [14/54]
Consider the following scenarios:
- initially empty buffer pool with n=4 frames
- file with b=3 pages
- file with b=5 pages
- LRU page replacement strategy
- MRU page replacement strategy
Show state of buffer pool during two sequential scans of file.
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [15/54]
❖ Effect of Buffer Management | |
Consider a query to find customers who are also employees:
select c.name
from Customer c, Employee e
where c.ssn = e.ssn;
This might be implemented inside the DBMS via nested loops:
for each tuple t1 in Customer {
for each tuple t2 in Employee {
if (t1.ssn == t2.ssn)
append (t1.name) to result set
}
}
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [16/54]
❖ Effect of Buffer Management (cont) | |
In terms of page-level operations, the algorithm looks like:
Rel rC = openRelation("Customer");
Rel rE = openRelation("Employee");
for (int i = 0; i < nPages(rC); i++) {
PageID pid1 = makePageID(db,rC,i);
Page p1 = request_page(pid1);
for (int j = 0; j < nPages(rE); j++) {
PageID pid2 = makePageID(db,rE,j);
Page p2 = request_page(pid2);
release_page(pid2);
}
release_page(pid1);
}
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [17/54]
❖ Exercise: Buffer Cost Benefit (i) | |
Assume that:
- the
Customer
relation has bC pages (e.g. 10)
- the
Employee
relation has bE pages (e.g. 4)
Compute how many page reads occur ...
- if we have only 2 buffers (i.e. effectively no buffer pool)
- if we have 20 buffers
- when a buffer pool with MRU replacement strategy is used
- when a buffer pool with LRU replacement strategy is used
For the last two, buffer pool has
n=3 slots
(n < bC and n < bE)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [18/54]
❖ Exercise: Buffer Cost Benefit (ii) | |
If the tables were larger, the above analysis would be tedious.
Write a C program to simulate buffer pool usage
- assuming a nested loop join as above
-
argv[1]
gives number of pages in "outer" table
-
argv[2]
gives number of pages in "inner" table
-
argv[3]
gives number of slots in buffer pool
-
argv[4]
gives replacement strategy (LRU,MRU,FIFO-Q)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [19/54]
❖ PostgreSQL Buffer Manager | |
PostgreSQL buffer manager:
- provides a shared pool of memory buffers for all backends
- all access methods get data from disk via buffer manager
Buffers are located in a large region of shared memory.
Definitions: src/include/storage/buf*.h
Functions: src/backend/storage/buffer/*.c
Buffer code is also used by backends who want a private buffer pool
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [20/54]
❖ PostgreSQL Buffer Manager (cont) | |
Buffer pool consists of:
BufferDescriptors
- shared fixed array (size
NBuffers
) of BufferDesc
BufferBlocks
- shared fixed array (size
NBuffers
) of 8KB frames
Buffer
= index values in above arrays
- indexes: global buffers
1..NBuffers
; local buffers negative
Size of buffer pool is set in
postgresql.conf, e.g.
shared_buffers = 16MB
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [21/54]
❖ PostgreSQL Buffer Manager (cont) | |
PostgreSQL buffer descriptors:
typedef struct BufferDesc
{
BufferTag tag;
int buf_id;
pg_atomic_uint32 state;
int wait_backend_pgprocno;
int freeNext;
LWLock content_lock;
} BufferDesc;
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [22/54]
❖ PostgreSQL Buffer Manager (cont) | |
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [23/54]
❖ PostgreSQL Buffer Manager (cont) | |
include/storage/buf.h
- basic buffer manager data types (e.g.
Buffer
)
include/storage/bufmgr.h
- definitions for buffer manager function interface
(i.e. functions that other parts of the system call to use buffer manager)
include/storage/buf_internals.h
- definitions for buffer manager internals (e.g.
BufferDesc
)
Code:
backend/storage/buffer/*.c
Commentary: backend/storage/buffer/README
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [24/54]
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [25/54]
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [26/54]
Terminology used in these slides ...
Record
= sequence of bytes stored on disk (data for one tuple)
Tuple
= "interpretable" version of a Record
in memory
Page
= copy of page from file on disk
PageId
= index of Page within file = pid
pageOffsetInFile
= pid * PAGESIZE
TupleId
= index of record within page = tid
RecordId
= (PageId
, TupleId
) = rid
recOffsetInPage
= page.directory[tid].offset
Relation
= descriptor for open relation
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [27/54]
❖ Reminder: Views of Data | |
Each tuple is represented by a record in some page
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [28/54]
A Page
is simply an array of bytes (byte[]
).
Want to interpret/manipulate it as a collection of Record
s.
Typical operations on pages and records:
-
buf = request_page(rel,pid)
... get page via its PageId
-
rec = get_record(buf,tid)
... get record from buffer
-
rid = insert_record(rel,pid,rec)
... add new record
-
update_record(rel,rid,rec)
... update value of record
-
delete_record(rel,rid)
... remove record
Note: rid
= (PageId,TupleId)
, rel
= open relation
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [29/54]
❖ Exercise: get_record(rel,rid) | |
Give an implementation of a function
Record get_record(Relation rel, RecordId rid)
which takes two parameters
- an open relation descriptor (
rel
)
- a record id (
rid
)
and returns the record corresponding to that
rid
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [30/54]
❖ Exercise: get_record(rel,rid) (cont) | |
Factors affecting Page
formats:
- determined by record size flexibility (fixed, variable)
- how free space within
Page
is managed
- whether some data is stored outside
Page
- does
Page
have an associated overflow chain?
- are large data values stored elsewhere? (e.g. TOAST)
- can one tuple span multiple
Page
s?
Implementation of
Page
operations critically depends on format.
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [31/54]
❖ Exercise: Fixed-length Records (i) | |
How records are managed in Pages ...
- depends on whether records are fixed-length or variable-length
Give examples of table definitions
- which result in fixed-length records
- which result in variable-length records
create table R (...);
What are the common features of each type of table?
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [32/54]
❖ Exercise: Fixed-length Records (i) (cont) | |
For fixed-length records, use record slots.
- insert: place new record in first available slot
- delete: mark slot as free, or set xmax
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [33/54]
❖ Exercise: Fixed-length Records (ii) | |
For the two fixed-length record page formats ...
Implement
- a suitable data structure to represent a
Page
- insertion ...
rid = insert_record(rel,pid,rec)
- deletion ...
delete_record(rel,rid)
Ignore buffer pool
(i.e. use get_page()
and put_page()
)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [34/54]
For variable-length records, must use record directory
-
directory[i]
gives location within page of i th record
An important aspect of using record directory
- location of tuple within page can change, tuple index does not change
Issue with variable-length records
- managing space withing the page (esp. after deletions)
- recording used and unused regions of the page
We refer to tuple index within directory as
TupleId tid
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [35/54]
Possibilities for handling free-space within block:
- compacted (one region of free space)
- fragmented (distributed free space)
In practice, a combination is useful:
- normally fragmented (cheap to maintain)
- compacted when needed (e.g. record won't fit)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [36/54]
Compacted free space ... before inserting record 7
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [37/54]
After inserting record 7 (80 bytes) ...
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [38/54]
Fragmented free space ... before inserting record 7
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [39/54]
After inserting record 7 (80 bytes) ...
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [40/54]
❖ Exercise: Inserting Variable-length Records | |
For both of the following page formats
- variable-length records, with compacted free space
- variable-length records, with fragmented free space
implement the
insert()
function.
Use the above page format, but also assume:
- page size is 1024 bytes
- tuples start on 4-byte boundaries
- references into page are all 8-bits (1 byte) long
- a function
recSize(rec)
gives size in bytes
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [41/54]
How many records can fit in a page? (denoted c = capacity)
Depends on:
- page size ... typical values: 1KB, 2KB, 4KB, 8KB
- record size ... typical values: 64B, 200B, app-dependent
- page header data ... typically: 4B - 32B
- slot directory ... depends on how many records
We typically consider
average record size (
R)
Given c, HeaderSize + c*SlotSize + c*R ≤ PageSize
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [42/54]
❖ Exercise: Space Utilisation | |
Consider the following page/record information:
- page size = 1KB = 1024 bytes = 210 bytes
- records:
(w:int,x:varchar(20),y:char(10),z:int)
- records are all aligned on 4-byte boundaries
-
x
field padded to ensure z
starts on 4-byte boundary
- each record has 4 field-offsets at start of record (each 1 byte)
-
char(10)
field rounded up to 12-bytes to preserve alignment
- maximum size of
x
values = 20 bytes; average size = 16 bytes
- page has 32-bytes of header information, starting at byte 0
- only insertions, no deletions or updates
Calculate
c = average number of records per page.
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [43/54]
Sometimes, it may not be possible to insert a record into a page:
- no free-space fragment large enough
- overall free-space in page is not large enough
- the record is larger than the page
- no more free directory slots in page
For case (1), can first try to compact free-space within the page.
If still insufficient space, we need an alternative solution ...
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [44/54]
File organisation determines how cases (2)..(4) are handled.
If records may be inserted anywhere that there is free space
- cases (2) and (4) can be handled by making a new page
- case (3) requires either spanned records or "overflow file"
If file organisation determines record placement (e.g. hashed file)
- cases (2) and (4) require an "overflow page"
- case (3) requires an "overflow file"
With overflow pages, rid structure may need modifying (rel,page,ovfl,rec)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [45/54]
Overflow pages for full buckets in a hashed file:
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [46/54]
Overflow file for very large records and BLOBs:
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [47/54]
❖ PostgreSQL Page Representation | |
Functions: src/backend/storage/page/*.c
Definitions: src/include/storage/bufpage.h
Each page is 8KB (default BLCKSZ
) and contains:
- header (free space pointers, flags, xact data)
- array of (offset,length) pairs for tuples in page
- free space region (between array and tuple data)
- actual tuples themselves (inserted from end towards start)
- (optionally) region for special data (e.g. index data)
Large data items are stored in separate (TOAST) files
(implicit)
Also supports ~SQL-standard BLOBs (explicit large data items)
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [48/54]
❖ PostgreSQL Page Representation (cont) | |
PostgreSQL page layout:
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [49/54]
❖ PostgreSQL Page Representation (cont) | |
Page-related data types:
typedef Pointer Page;
typedef uint16 LocationIndex;
typedef struct ItemIdData
{
unsigned lp_off:15,
lp_flags:2,
lp_len:15;
} ItemIdData;
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [50/54]
❖ PostgreSQL Page Representation (cont) | |
Page-related data types: (cont)
typedef struct PageHeaderData
{
...
uint16 pd_checksum;
uint16 pd_flags;
LocationIndex pd_lower;
LocationIndex pd_upper;
LocationIndex pd_special;
uint16 pd_pagesize_version;
ItemIdData pd_linp[1];
} PageHeaderData;
typedef PageHeaderData *PageHeader;
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [51/54]
❖ PostgreSQL Page Representation (cont) | |
Operations on Page
s:
void PageInit(Page page, Size pageSize, ...)
- initialize a
Page
buffer to empty page
- in particular, sets
pd_lower
and pd_upper
OffsetNumber PageAddItem(Page page,
Item item, Size size, ...)
- insert one tuple (or index entry) into a
Page
- fails if: not enough free space, too many tuples
void PageRepairFragmentation(Page page)
- compact tuple storage to give one large free space region
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [52/54]
❖ PostgreSQL Page Representation (cont) | |
PostgreSQL has two kinds of pages:
- heap pages which contain tuples
- index pages which contain index entries
Both kinds of page have the same page layout.
One important difference:
- index entries tend be a smaller than tuples
- can typically fit more index entries per page
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [53/54]
Each data file has a corresponding TOAST file (if needed)
Tuples in data pages contain rid
s for long values
TOAST = The Oversized Attribute Storage Technique
COMP9315 24T1 ♢ Week 2 Thursday Lecture ♢ [54/54]
Produced: 22 Feb 2024