Week 03 Lectures
Storage Manager |
Storage Management | 2/100 |
Levels of DBMS related to storage management:
... Storage Management | 3/100 |
Aims of storage management in DBMS:
Views of Data in Query Evaluation | 4/100 |
... Views of Data in Query Evaluation | 5/100 |
Representing database objects during query execution:
DB
Rel
Page
Tuple
PageID = FileID+Offset
Offset
TupleID = PageID+Index
Index
Storage Management | 6/100 |
Topics in storage management ...
Storage Technology |
Storage Technology | 8/100 |
Persistent storage is
... Storage Technology | 9/100 |
Hard disks are well-established, cheap, high-volume, ...
Alternative bulk storage: SSD
... Storage Technology | 10/100 |
Comparison of HDD and SSD properties:
HDD | SDD | |
Cost/byte | ~ 2c / GB | ~ 13c / GB |
Read latency | ~ 10ms | ~ 50µs |
Write latency | ~ 10ms | ~ 900µs |
Read unit | block (e.g. 1KB) | byte |
Writing | write a block | write on empty block |
Will SSDs ever replace HDDs for large-scale database storage?
Cost Models | 11/100 |
Throughout this course, we compare costs of DB operations
Important aspects in determining cost:
File Management | 12/100 |
Aims of file management subsystem:
Builds higher-level operations on top of OS file operations.
... File Management | 13/100 |
Typical file operations provided by the operating system:
fd = open(fileName,mode)// open a named file for reading/writing/appending close(fd)// close an open file, via its descriptor nread = read(fd, buf, nbytes)// attempt to read data from file into buffer nwritten = write(fd, buf, nbytes)// attempt to write data from buffer to file lseek(fd, offset, seek_type)// move file pointer to relative/absolute file offset fsync(fd)// flush contents of file buffers to disk
DBMS File Organisation | 14/100 |
How is data for DB objects arranged in the file system?
Different DBMSs make different choices, e.g.
Single-file Storage Manager | 15/100 |
Consider the following simple single-file DBMS layout:
E.g.
SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]
NameMap = [ ("employee",20,350), ("project",620,40) ]
... Single-file Storage Manager | 16/100 |
Each file segment consists of a number fixed-size blocks
The following data/constant definitions are useful
#define PAGESIZE 4096// bytes per page typedef long PageId;// PageId is block index // pageOffset=PageId*PAGESIZE typedef char *Page;// pointer to page/block buffer
Typical PAGESIZE
... Single-file Storage Manager | 17/100 |
Storage Manager data structures for opened DBs & Tables
typedef struct DBrec { char *dbname;// copy of database name int fd;// the database file SpaceMap map;// map of free/used areas NameMap names;// map names to areas + sizes } *DB; typedef struct Relrec { char *relname;// copy of table name int start;// page index of start of table data int npages;// number of pages of table data ... } *Rel;
Example: Scanning a Relation | 18/100 |
With the above disk manager, our example:
select name from Employee
might be implemented as something like
DB db = openDatabase("myDB"); Rel r = openRelation(db,"Employee"); Page buffer = malloc(PAGESIZE*sizeof(char)); for (int i = 0; i < r->npages; i++) { PageId pid = r->start+i; get_page(db, pid, buffer); for each tuple in buffer { get tuple data and extract name add (name) to result tuples } }
Single-File Storage Manager | 19/100 |
// start using DB, buffer meta-data DB openDatabase(char *name) { DB db = new(struct DBrec); db->dbname = strdup(name); db->fd = open(name,O_RDWR); db->map = readSpaceMap(db->fd); db->names = readNameMap(db->fd); return db; }// set up struct describing relation Rel openRelation(DB db, char *rname) { Rel r = new(struct Relrec); r->relname = strdup(rname);// get relation data from map tables r->start = ...; r->npages = ...; return r; }
... Single-File Storage Manager | 20/100 |
// assume that Page = byte[PageSize] // assume that PageId = block number in file // read page from file into memory buffer void get_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); read(db->fd, buf, PAGESIZE); }// write page from memory buffer to file void put_page(Db db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); write(db->fd, buf, PAGESIZE); }
Exercise 1: Relation Scan Cost | 21/100 |
Consider a table R(x,y,z) with 105 tuples, implemented as
insert into S select * from R where x > 10;
if 50% of the tuples satisfy the condition.
DBMS Parameters | 22/100 |
Our view of relations in DBMSs:
... DBMS Parameters | 23/100 |
Typical DBMS/table parameter values:
Symbol | E.g. Value | |||
r | 106 | |||
R | 128 bytes | |||
b | 105 | |||
B | 8192 bytes | |||
c | 60 | |||
Tr ,Tw | 10 msec | |||
one page in memory |
- | ≅ 0 |
Multiple-file Disk Manager | 24/100 |
Most DBMSs don't use a single large file for all data.
They typically provide:
Using multiple files (one file per relation) can be easier, e.g.
... Multiple-file Disk Manager | 25/100 |
Example of single-file vs multiple-file:
Consider how you would compute file offset of page[i] in table[1] ...
... Multiple-file Disk Manager | 26/100 |
Structure of PageId
If system uses one file per table, PageId
PageId
PostgreSQL Storage Manager | 27/100 |
PostgreSQL uses the following file organisation ...
... PostgreSQL Storage Manager | 28/100 |
Components of storage subsystem:
RelFileNode
storage/smgr
storage/smgr/md.c
storage/file
smgr
Relations as Files | 29/100 |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode { Oid spcNode;// tablespace Oid dbNode;// database Oid relNode;// relation } RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OID
dbNode == 0
... Relations as Files | 30/100 |
The relpath
RelFileNode
char *relpath(RelFileNode r)// simplified { char *path = malloc(ENOUGH_SPACE); if (r.spcNode == GLOBALTABLESPACE_OID) {/* Shared system relations live in PGDATA/global */ Assert(r.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, r.relNode); } else if (r.spcNode == DEFAULTTABLESPACE_OID) {/* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, r.dbNode, r.relNode); } else {/* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir r.spcNode, r.dbNode, r.relNode); } return path; }
Exercise 2: PostgreSQL Files | 31/100 |
In your PostgreSQL server
$PGDATA
pizza
People
People
File Descriptor Pool | 32/100 |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()
typedef char *FileName
Open files are referenced via: typedef int File
A File
Defs: include/storage/fd.h
Code: backend/storage/file/fd.c
... File Descriptor Pool | 33/100 |
Interface to file descriptor (pool):
File PathNameOpenFile(char *fileName, int flags)// open a file with default pg.conf mode File PathNameOpenFilePerm(char *fName, int flags, int mode)// open a file in the DB directory ($PGDATA/base/...) File OpenTemporaryFile(bool interXact)// open temp file flag: close at end of transaction? void FileClose(File file) void FileUnlink(File file) int FileRead(File file, char *buffer, int amount) int FileWrite(File file, char *buffer, int amount) int FileSync(File file) long FileSeek(File file, long offset, int whence) int FileTruncate(File file, long offset)
Analogous to Unix syscalls open()
close()
read()
... File Descriptor Pool | 34/100 |
Virtual file descriptor records (simplified):
typedef struct vfd { s_short fd;// current FD, or VFD_CLOSED if none u_short fdstate;// bitflags for Vfd's state File nextFree;// link to next free Vfd, if in freelist File lruMoreRecently;// doubly linked recency-of-use list File lruLessRecently; long seekPos;// current logical file position char *fileName;// name of file, or NULL for unused Vfd // NB: fileName is malloc'd, and must be free'd when closing the Vfd int fileFlags;// open(2) flags for (re)opening the file int fileMode;// mode to pass to open(2) } Vfd;
... File Descriptor Pool | 35/100 |
Virtual file descriptors (Vfd
VfdCache[0]
Exercise 3: Opening a Vfd | 36/100 |
Consider the following call to open a file
f = PathNameOpenFilePerm( "/srvr/jas/pgsql/data/base/13645/12348", O_RDWR | O_CREAT | O_EXCL | PG_BINARY, 0600 )
Sketch implementation of PathNameOpenFilePerm()
File Manager | 37/100 |
Reminder: PostgreSQL file organisation
... File Manager | 38/100 |
PostgreSQL stores each table
PGDATA/pg_database.oid
... File Manager | 39/100 |
Data files (Oid, Oid.1, ...):
... File Manager | 40/100 |
Free space map (Oid_fsm):
VACUUM
DELETE
xmax
VACUUM
... File Manager | 41/100 |
The "magnetic disk storage manager" (storage/smgr/md.c
PageID
PageID
typedef struct { RelFileNode rnode;// which relation/file ForkNumber forkNum;// which fork (of reln) BlockNumber blockNum;// which page/block } BufferTag;
... File Manager | 42/100 |
Access to a block of data proceeds (roughly) as follows:
// pageID set from pg_catalog tables // buffer obtained from Buffer pool getBlock(BufferTag pageID, Buffer buf) { File fid; off_t offset; int fd; (fid, offset) = findBlock(pageID) fd = VfdCache[fid].fd; lseek(fd, offset, SEEK_SET) VfdCache[fid].seekPos = offset; nread = read(fd, buf, BLOCKSIZE) if (nread < BLOCKSIZE)... we have a problem }
BLOCKSIZE
... File Manager | 43/100 |
findBlock(BufferTag pageID) returns (Vfd, off_t) { offset = pageID.blockNum * BLOCKSIZE fileName = relpath(pageID.rnode) if (pageID.forkNum > 0) fileName = fileName+"."+pageID.forkNum fid = PathNameOpenFIle(fileName, O_READ); fSize = VfdCache[fid].fileSize; if (offset > fSize) { fid = allocate new Vfd for next fork offset = offset - fd.fileSize } return (fd, offset) }
Buffer Pool |
Buffer Pool | 45/100 |
Aim of buffer pool:
... Buffer Pool | 46/100 |
... Buffer Pool | 47/100 |
Buffer pool operations: (both take single PageID
request_page(pid)
release_page(pid)
request_page()
getBlock()
get_page()
release_page()
putBlock()
put_page()
Buffer pool data structures:
NBUFS
NBUFS
... Buffer Pool | 48/100 |
... Buffer Pool | 49/100 |
For each frame, we need to know: (FrameData
... Buffer Pool | 50/100 |
How scans are performed without Buffer Pool:
Buffer buf; int N = numberOfBlocks(Rel); for (i = 0; i < N; i++) { pageID = makePageID(db,Rel,i); getBlock(pageID, buf); for (j = 0; j < nTuples(buf); j++) process(buf, j) }
Requires N
If we read it again, N
... Buffer Pool | 51/100 |
How scans are performed with Buffer Pool:
Buffer buf; int N = numberOfBlocks(Rel); for (i = 0; i < N; i++) { pageID = makePageID(db,Rel,i); bufID = request_page(pageID); buf = frames[bufID] for (j = 0; j < nTuples(buf); j++) process(buf, j) release_page(pageID); }
Requires N
If we read it again, 0 ≤ page reads ≤ N
... Buffer Pool | 52/100 |
Buffer pool data structures:
typedef char Page[PAGESIZE]; typedef ... PageID;// defined earlier typedef struct _FrameData { PageID pid;// which page is in frame int pin_count;// how many processes using page int dirty;// page modified since loaded? Time last_used;// when page was last accessed } FrameData; Page frames[NBUFS];// actual buffers FrameData directory[NBUFS];
... Buffer Pool | 53/100 |
Implementation of request_page()
int request_page(PageID pid) { bufID = findInPool(pid) if (bufId == NOT_FOUND) { if (no free frames in Pool) { bufID = findFrameToReplace() if (directory[bufID].dirty) old = directory[bufID].page put_page(old, frames[bufID]) } bufID = index of freed frame directory[bufID].page = pid directory[bufID].pin_count = 0 directory[bufID].dirty = 0 get_page(pid, frames[bufID]) } directory[bufID].pin_count++ return bufID }
... Buffer Pool | 54/100 |
The release_page(pid)
The
The
Evicting a page ...
Several schemes are commonly in use:
Cost benefit from buffer pool (with n frames) is determined by:
First scan costs b reads; subsequent scans are "free".
Example (b): sequential scan, MRU, n < b
First scan costs b reads; subsequent scans cost b - n reads.
Example (c): sequential scan, LRU, n < b
All scans cost b reads; known as sequential flooding.
Consider a query to find customers who are also employees:
This might be implemented inside the DBMS via nested loops:
In terms of page-level operations, the algorithm looks like:
Assume that:
If the tables were larger, the above analysis would be tedious.
Write a C program to simulate buffer pool usage
PostgreSQL buffer manager:
Definitions:
Functions:
Commentary:
Buffer code is also used by backends who want a private buffer pool
Buffer pool consists of:
Buffer manager interface:
Buffer manager interface (cont):
Additional buffer manager functions:
PostgreSQL page replacement strategy: clock-sweep
Consider an initally empty buffer pool with only 3 slots.
Show the state of the pool after each of the following:
Treat
Assume
Database applications view data as:
Ultimately, a
We want to interpret/manipulate it as a collection of
Typical operations on
Factors affecting
For fixed-length records, use record slots.
Give examples of table definitions
What are the common features of each type of table?
For each of the following Page formats:
For variable-length records, must use slot directory.
Possibilities for handling free-space within block:
Compacted free space:
Note: "pointers" are implemented as word offsets within block.
Fragmented free space:
Initial page state (compacted free space) ...
Before inserting record 7 (compacted free space) ...
After inserting record 7 (80 bytes) ...
Initial page state (fragmented free space) ...
Before inserting record 7 (fragmented free space) ...
After inserting record 7 (80 bytes) ...
For both of the following page formats
Use the above page format, but also assume:
How many records can fit in a page? (denoted C = capacity)
Depends on:
Given C, HeaderSize + C*SlotSize + C*R ≤ PageSize
Consider the following page/record information:
Sometimes, it may not be possible to insert a record into a page:
If still insufficient space, we need an alternative solution ...
File organisation determines how cases (2)..(4) are handled.
If records may be inserted anywhere that there is free space
Overflow files for very large records and BLOBs:
Record-based handling of overflows:
We discuss overflow pages in more detail when covering Hash Files.
Functions:
Definitions:
Each page is 8KB (default
Also supports ~SQL-standard BLOBs (explicit large data items)
PostgreSQL page layout:
Page-related data types:
Page-related data types: (cont)
Operations on
PostgreSQL has two kinds of pages:
One important difference:
Draw diagrams of a PostgreSQL heap page
Assume that there is no special space in the page.
Produced: 3 Mar 2020
mark_page(pid)
operation:
Note: doesn't actually write to disk; indicates that page changed
flush_page(pid)
Note: not generally used by higher levels of DBMS
write()
... Buffer Pool 55/100
If multiple frames can potentially be released
Page Replacement Policies 56/100
LRU / MRU require knowledge of when pages were last accessed
... Page Replacement Policies 57/100
Example (a): sequential scan, LRU or MRU, n ≥ b
Effect of Buffer Management 58/100
select c.name
from Customer c, Employee e
where c.ssn = e.ssn;
for each tuple t1 in Customer {
for each tuple t2 in Employee {
if (t1.ssn == t2.ssn)
append (t1.name) to result set
}
}
... Effect of Buffer Management 59/100
Rel rC = openRelation("Customer");
Rel rE = openRelation("Employee");
for (int i = 0; i < nPages(rC); i++) {
PageID pid1 = makePageID(db,rC,i);
Page p1 = request_page(pid1);
for (int j = 0; j < nPages(rE); j++) {
PageID pid2 = makePageID(db,rE,j);
Page p2 = request_page(pid2);
Exercise 4: Buffer Cost Benefit (i) 60/100
Compute how many page reads occur ...
Customer
Employee
For the last two, buffer pool has n=3 slots (n < bC and n < bE)
Exercise 5: Buffer Cost Benefit (ii) 61/100
argv[1]
argv[2]
argv[3]
argv[4]
PostgreSQL Buffer Manager 62/100
Buffers are located in a large region of shared memory.
src/include/storage/buf*.h
src/backend/storage/buffer/*.c
backend/storage/buffer/README
... PostgreSQL Buffer Manager 63/100 BufferDescriptors
NBuffers
BufferDesc
BufferBlocks
NBuffers
BLCKSZ
Buffer
Size of buffer pool is set in postgresql.conf, e.g.
1..NBuffers
shared_buffers = 16MB
... PostgreSQL Buffer Manager 64/100
Buffer Pool Data Types 65/100
typedef struct buftag {
RelFileNode rnode;
BufferTag
BufferDesc
Buffer Pool Functions 66/100 Buffer ReadBuffer(Relation r, BlockNumber n)
r
Buffer
BufferDesc *BufferAlloc(
Relation r, ForkNumber f,
BlockNumber n, bool *found)
ReadBuffer
... Buffer Pool Functions 67/100 void ReleaseBuffer(Buffer buf)
ensures all activity on buffer is completed before returning
void MarkBufferDirty(Buffer buf)
... Buffer Pool Functions 68/100 Page BufferGetPage(Buffer buf)
BufferIsPinned(Buffer buf)
CheckPointBuffers
etc. etc. etc.
Clock-sweep Replacement Strategy 69/100
NextVictimBuffer
usage_count
NextVictimBuffer
Exercise 6: PostgreSQL Buffer Pool 70/100
Req R0, Req S0, Rel S0, Req S1, Rel S1, Req S2,
Rel S2, Rel R0, Req R1, Req S0, Rel S0, Req S1,
Rel S1, Req S2, Rel S2, Rel R1, Req R2, Req S0,
Rel S0, Req S1, Rel S1, Req S2, Rel S2, Rel R2
BufferDesc
(tag, usage_count, refcount, freeNext)
freeList
nextVictim
Pages
Page/Tuple Management 72/100
Pages 73/100
The disk and buffer manager provide the following view:
TupleId
RecordId
RID
TupleId
PageID
TupIndex
Page format = how space/tuples are organised within a page
PageID
Page Formats 74/100 Page
byte[]
Record
Page
Note: request_page(pid)
PageId
get_record(rid)
TupleId
rid = insert_record(pid,rec)
update_record(rid,rec)
delete_record(rid)
rid
(PageId,TupIndex)
pid
... Page Formats 75/100 Page
Implementation of Page
Page
Page
Page
Page
... Page Formats 76/100
Exercise 7: Fixed-length Records 77/100
create table R ( ...);
Exercise 8: Inserting/Deleting Fixed-length Records 78/100
Implement
Page
Page Formats 79/100
In practice, a combination is useful:
Important aspect of using slot directory
... Page Formats 80/100
... Page Formats 81/100
... Page Formats 82/100
... Page Formats 83/100
... Page Formats 84/100
... Page Formats 85/100
... Page Formats 86/100
... Page Formats 87/100
Exercise 9: Inserting Variable-length Records 88/100
implement the insert()
recSize(r)
Storage Utilisation 89/100
We typically consider average record size (R)
Exercise 10: Space Utilisation 90/100
Calculate C = average number of records per page.
(a:int,b:varchar(20),c:char(10),d:int)
c
d
char(10)
b
Overflows 91/100
For case (1), can first try to compact free-space within the page.
... Overflows 92/100
If file organisation determines record placement (e.g. hashed file)
With overflow pages, rid structure may need modifying (rel,page,ovfl,rec)
... Overflows 93/100
PostgreSQL Page Representation 94/100 src/backend/storage/page/*.c
src/include/storage/bufpage.h
BLCKSZ
Large data items are stored in separate (TOAST) files (implicit)
... PostgreSQL Page Representation 95/100
... PostgreSQL Page Representation 96/100
... PostgreSQL Page Representation 97/100
typedef struct PageHeaderData
{
XLogRecPtr pd_lsn;
... PostgreSQL Page Representation 98/100 Page
void PageInit(Page page, Size pageSize, ...)
Page
pd_lower
pd_upper
OffsetNumber PageAddItem(Page page,
Item item, Size size, ...)
Page
void PageRepairFragmentation(Page page)
... PostgreSQL Page Representation 99/100
Both kinds of page have the same page layout.
Exercise 11: PostgreSQL Pages 100/100
Show the values in the tuple header.
with lengths of 60, 80, and 70 bytes