❖ Things To Note |
give
❖ Cost Models |
An important aspect of this course is
❖ Cost Models (cont) |
Since time cost is affected by many factors
For comparing methods, page cost is better
Estimating costs with multiple concurrent operations and buffering is difficult!!
❖ Cost Models (cont) |
Terminology:
❖ Cost Models (cont) |
In developing cost models, we make assumptions on how DBMSs store data:
❖ Cost Models (cont) |
Typical DBMS/table parameter values:
Symbol | E.g. Value | |||
r | 106 | |||
R | 128 bytes | |||
b | 105 | |||
B | 8192 bytes | |||
c | 60 | |||
Tr ,Tw | 10 msec | |||
one page in memory |
- | ≅ 0 |
❖ Cost Models (cont) |
Our cost models are "rough" (based on assumptions)
But do give an O(x) feel for how expensive operations are.
Example "rough" estimation: how many piano tuners in Sydney?
Actual number of piano tuners in Yellow Pages = 120
Example borrowed from Alan Fekete at Sydney University.
❖ Exercise: Relation Scan Cost |
Consider a table R(x,y,z) with 105 tuples, implemented as
insert into S select * from R where x > 10;
if 50% of the tuples satisfy the condition.
❖ Storage Technology |
Persistent storage is
❖ Storage Technology (cont) |
Hard disks are well-established, cheap, high-volume, ...
Alternative bulk storage: SSD
❖ Storage Technology (cont) |
Comparison of HDD and SSD properties:
HDD | SSD | |
Cost/byte | ~ $13 / TB | ~ $35 / TB |
Read latency | ~ 6ms | ~ 50µs |
Write latency | ~ 6ms | ~ 900µs | R/W rate | 150MB/s | 450MB/s |
Read unit | block (e.g. 1KB) | byte |
Writing | write a block | write on empty block |
Will SSDs ever completely replace HDDs?
❖ Storage Management (cont) |
Aims of storage management in DBMS:
❖ Storage Management |
Topics in storage management ...
❖ Views of Data in Query Evaluation (cont) |
Representing database objects during query evaluation:
DB
Rel
Page
Tuple
PageID = FileID+Offset
Offset
TupleID = PageID+Index
Index
❖ File Management |
Aims of file management subsystem:
Builds higher-level operations on top of OS file operations.
❖ File Management (cont) |
Typical file operations provided by the operating system:
fd = open(fileName,mode) // open a named file for reading/writing/appending close(fd) // close an open file, via its descriptor nread = read(fd, buf, nbytes) // attempt to read data from file into buffer nwritten = write(fd, buf, nbytes) // attempt to write data from buffer to file lseek(fd, offset, seek_type) // move file pointer to relative/absolute file offset fsync(fd) // flush contents of file buffers to disk
❖ DBMS File Organisation |
How is data for DB objects arranged in the file system?
Different DBMSs make different choices, e.g.
❖ Single-file DBMS |
Consider a single file for the entire database (e.g. SQLite)
Objects are allocated to regions (segments) of the file.
If an object grows too large for allocated segment, allocate an extension.
What happens to allocated space when objects are removed?
❖ Single-file DBMS (cont) |
Allocating space in Unix files is easy:
❖ Single-file Storage Manager |
Consider the following simple single-file DBMS layout:
E.g.
SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]
TableMap = [ ("employee",20,500), ("project",620,40) ]
❖ Single-file Storage Manager (cont) |
Each file segment consists of a number fixed-size blocks
The following data/constant definitions are useful
#define PAGESIZE 2048 // bytes per page typedef long PageId; // PageId is block index // pageOffset=PageId*PAGESIZE typedef char *Page; // pointer to page/block buffer
Typical PAGESIZE
❖ Single-file Storage Manager (cont) |
Storage Manager data structures for opened DBs & Tables
typedef struct DBrec { char *dbname; // copy of database name int fd; // the database file SpaceMap map; // map of free/used areas NameTable names; // map names to areas + sizes } *DB; typedef struct Relrec { char *relname; // copy of table name int start; // page index of start of table data int npages; // number of pages of table data ... } *Rel;
❖ Example: Scanning a Relation |
With the above disk manager, our example:
select name from Employee
might be implemented as something like
DB db = openDatabase("myDB"); Rel r = openRelation(db,"Employee"); Page buffer = malloc(PAGESIZE*sizeof(char)); for (int i = 0; i < r->npages; i++) { PageId pid = r->start+i; get_page(db, pid, buffer); for each tuple in buffer { get tuple data and extract name add (name) to result tuples } }
❖ Single-File Storage Manager |
// start using DB, buffer meta-data DB openDatabase(char *name) { DB db = new(struct DBrec); db->dbname = strdup(name); db->fd = open(name,O_RDWR); db->map = readSpaceTable(db->fd); db->names = readNameTable(db->fd); return db; } // stop using DB and update all meta-data void closeDatabase(DB db) { writeSpaceTable(db->fd,db->map); writeNameTable(db->fd,db->map); fsync(db->fd); close(db->fd); free(db->dbname); free(db); }
❖ Single-File Storage Manager (cont) |
// set up struct describing relation Rel openRelation(DB db, char *rname) { Rel r = new(struct Relrec); r->relname = strdup(rname); // get relation data from map tables r->start = ...; r->npages = ...; return r; } // stop using a relation void closeRelation(Rel r) { free(r->relname); free(r); }
❖ Single-File Storage Manager (cont) |
// assume that Page = byte[PageSize] // assume that PageId = block number in file // read page from file into memory buffer void get_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); read(db->fd, buf, PAGESIZE); } // write page from memory buffer to file void put_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); write(db->fd, buf, PAGESIZE); }
❖ Single-File Storage Manager (cont) |
Managing contents of space mapping table can be complex:
// assume an array of (offset,length,status) records // allocate n new pages PageId allocate_pages(DB db, int n) { if (no existing free chunks are large enough) { int endfile = lseek(db->fd, 0, SEEK_END); addNewEntry(db->map, endfile, n); } else { grab "worst fit" chunk split off unused section as new chunk } }
❖ Single-File Storage Manager (cont) |
Similar complexity for freeing chunks
// drop n pages starting from p
void deallocate_pages(DB db, PageId p, int n) {
if (no adjacent free chunks) {
markUnused(db->map, p, n);
} else {
merge adjacent free chunks
compress mapping table
}
}
Changes take effect when closeDatabase()
❖ Multiple-file Disk Manager |
Most DBMSs don't use a single large file for all data.
They typically provide:
Using multiple files (one file per relation) can be easier, e.g.
❖ Multiple-file Disk Manager (cont) |
Example of single-file vs multiple-file:
Consider how you would compute file offset of page[i] in table[1] ...
❖ Multiple-file Disk Manager (cont) |
Structure of PageId
If system uses one file per table, PageId
PageId
❖ PostgreSQL Storage Manager (cont) |
Components of storage subsystem:
RelFileNode
storage/smgr
storage/smgr/md.c
storage/file
smgr
❖ Relations as Files |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode { Oid spcNode; // tablespace Oid dbNode; // database Oid relNode; // relation } RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OID
dbNode == 0
❖ Relations as Files (cont) |
The relpath
RelFileNode
char *relpath(RelFileNode r) // simplified { char *path = malloc(ENOUGH_SPACE); if (r.spcNode == GLOBALTABLESPACE_OID) { /* Shared system relations live in PGDATA/global */ Assert(r.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, r.relNode); } else if (r.spcNode == DEFAULTTABLESPACE_OID) { /* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, r.dbNode, r.relNode); } else { /* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir r.spcNode, r.dbNode, r.relNode); } return path; }
❖ Exercise: PostgreSQL Files |
In your PostgreSQL server
$PGDATA
pizza
People
People
❖ File Descriptor Pool |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()
typedef char *FileName
Open files are referenced via: typedef int File
A File
❖ File Descriptor Pool (cont) |
Interface to file descriptor (pool):
File FileNameOpenFile(FileName fileName, int fileFlags, int fileMode); // open a file in the database directory ($PGDATA/base/...) File OpenTemporaryFile(bool interXact); // open temp file; flag: close at end of transaction? void FileClose(File file); void FileUnlink(File file); int FileRead(File file, char *buffer, int amount); int FileWrite(File file, char *buffer, int amount); int FileSync(File file); long FileSeek(File file, long offset, int whence); int FileTruncate(File file, long offset);
Analogous to Unix syscalls open()
close()
read()
write()
lseek()
❖ File Descriptor Pool (cont) |
Virtual file descriptors (Vfd
VfdCache[0]
❖ File Descriptor Pool (cont) |
Virtual file descriptor records (simplified):
typedef struct vfd { s_short fd; // current FD, or VFD_CLOSED if none u_short fdstate; // bitflags for VFD's state File nextFree; // link to next free VFD, if in freelist File lruMoreRecently; // doubly linked recency-of-use list File lruLessRecently; long seekPos; // current logical file position char *fileName; // name of file, or NULL for unused VFD // NB: fileName is malloc'd, and must be free'd when closing the VFD int fileFlags; // open(2) flags for (re)opening the file int fileMode; // mode to pass to open(2) } Vfd;
❖ File Manager (cont) |
PostgreSQL stores each table
PGDATA/pg_database.oid
❖ File Manager (cont) |
Data files (Oid, Oid.1, ...):
❖ File Manager (cont) |
Free space map (Oid_fsm):
VACUUM
DELETE
xmax
VACUUM
❖ File Manager (cont) |
The "magnetic disk storage manager" (storage/smgr/md.c
PageID
PageID
typedef struct { RelFileNode rnode; // which relation/file ForkNumber forkNum; // which fork (of reln) BlockNumber blockNum; // which page/block } BufferTag;
❖ File Manager (cont) |
Access to a block of data proceeds (roughly) as follows:
// pageID set from pg_catalog tables // buffer obtained from Buffer pool getBlock(BufferTag pageID, Buffer buf) { Vfd vf; off_t offset; (vf, offset) = findBlock(pageID) lseek(vf.fd, offset, SEEK_SET) vf.seekPos = offset; nread = read(vf.fd, buf, BLOCKSIZE) if (nread < BLOCKSIZE) ... we have a problem }
BLOCKSIZE
❖ File Manager (cont) |
findBlock(BufferTag pageID) returns (Vfd, off_t) { offset = pageID.blockNum * BLOCKSIZE fileName = relpath(pageID.rnode) if (pageID.forkNum > 0) fileName = fileName+"."+pageID.forkNum if (fileName is not in Vfd pool) fd = allocate new Vfd for fileName else fd = use Vfd from pool if (offset > fd.fileSize) { fd = allocate new Vfd for next fork offset = offset - fd.fileSize } return (fd, offset) }
Produced: 19 Feb 2024