❖ Things To Note |
give❖ Cost Models |
An important aspect of this course is
❖ Cost Models (cont) |
Since time cost is affected by many factors
For comparing methods, page cost is better
Estimating costs with multiple concurrent operations and buffering is difficult!!
❖ Cost Models (cont) |
Terminology:
❖ Cost Models (cont) |
In developing cost models, we make assumptions on how DBMSs store data:
❖ Cost Models (cont) |
Typical DBMS/table parameter values:
| Symbol | E.g. Value | |||
| r | 106 | |||
| R | 128 bytes | |||
| b | 105 | |||
| B | 8192 bytes | |||
| c | 60 | |||
| Tr ,Tw | 10 msec | |||
one page in memory |
- | ≅ 0 |
❖ Cost Models (cont) |
Our cost models are "rough" (based on assumptions)
But do give an O(x) feel for how expensive operations are.
Example "rough" estimation: how many piano tuners in Sydney?
Actual number of piano tuners in Yellow Pages = 120
Example borrowed from Alan Fekete at Sydney University.
❖ Exercise: Relation Scan Cost |
Consider a table R(x,y,z) with 105 tuples, implemented as
insert into S select * from R where x > 10;
if 50% of the tuples satisfy the condition.
❖ Storage Technology |
Persistent storage is
❖ Storage Technology (cont) |
Hard disks are well-established, cheap, high-volume, ...
Alternative bulk storage: SSD
❖ Storage Technology (cont) |
Comparison of HDD and SSD properties:
| HDD | SSD | |
| Cost/byte | ~ $13 / TB | ~ $35 / TB |
| Read latency | ~ 6ms | ~ 50µs |
| Write latency | ~ 6ms | ~ 900µs | R/W rate | 150MB/s | 450MB/s |
| Read unit | block (e.g. 1KB) | byte |
| Writing | write a block | write on empty block |
Will SSDs ever completely replace HDDs?
❖ Storage Management (cont) |
Aims of storage management in DBMS:
❖ Storage Management |
Topics in storage management ...
❖ Views of Data in Query Evaluation (cont) |
Representing database objects during query evaluation:
DBRelPageTuplePageID = FileID+OffsetOffsetTupleID = PageID+IndexIndex❖ File Management |
Aims of file management subsystem:
Builds higher-level operations on top of OS file operations.
❖ File Management (cont) |
Typical file operations provided by the operating system:
fd = open(fileName,mode) // open a named file for reading/writing/appending close(fd) // close an open file, via its descriptor nread = read(fd, buf, nbytes) // attempt to read data from file into buffer nwritten = write(fd, buf, nbytes) // attempt to write data from buffer to file lseek(fd, offset, seek_type) // move file pointer to relative/absolute file offset fsync(fd) // flush contents of file buffers to disk
❖ DBMS File Organisation |
How is data for DB objects arranged in the file system?
Different DBMSs make different choices, e.g.
❖ Single-file DBMS |
Consider a single file for the entire database (e.g. SQLite)
Objects are allocated to regions (segments) of the file.
If an object grows too large for allocated segment, allocate an extension.
What happens to allocated space when objects are removed?
❖ Single-file DBMS (cont) |
Allocating space in Unix files is easy:
❖ Single-file Storage Manager |
Consider the following simple single-file DBMS layout:
E.g.
SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]
TableMap = [ ("employee",20,500), ("project",620,40) ]
❖ Single-file Storage Manager (cont) |
Each file segment consists of a number fixed-size blocks
The following data/constant definitions are useful
#define PAGESIZE 2048 // bytes per page typedef long PageId; // PageId is block index // pageOffset=PageId*PAGESIZE typedef char *Page; // pointer to page/block buffer
Typical PAGESIZE
❖ Single-file Storage Manager (cont) |
Storage Manager data structures for opened DBs & Tables
typedef struct DBrec {
char *dbname; // copy of database name
int fd; // the database file
SpaceMap map; // map of free/used areas
NameTable names; // map names to areas + sizes
} *DB;
typedef struct Relrec {
char *relname; // copy of table name
int start; // page index of start of table data
int npages; // number of pages of table data
...
} *Rel;
❖ Example: Scanning a Relation |
With the above disk manager, our example:
select name from Employee
might be implemented as something like
DB db = openDatabase("myDB");
Rel r = openRelation(db,"Employee");
Page buffer = malloc(PAGESIZE*sizeof(char));
for (int i = 0; i < r->npages; i++) {
PageId pid = r->start+i;
get_page(db, pid, buffer);
for each tuple in buffer {
get tuple data and extract name
add (name) to result tuples
}
}
❖ Single-File Storage Manager |
// start using DB, buffer meta-data DB openDatabase(char *name) { DB db = new(struct DBrec); db->dbname = strdup(name); db->fd = open(name,O_RDWR); db->map = readSpaceTable(db->fd); db->names = readNameTable(db->fd); return db; } // stop using DB and update all meta-data void closeDatabase(DB db) { writeSpaceTable(db->fd,db->map); writeNameTable(db->fd,db->map); fsync(db->fd); close(db->fd); free(db->dbname); free(db); }
❖ Single-File Storage Manager (cont) |
// set up struct describing relation Rel openRelation(DB db, char *rname) { Rel r = new(struct Relrec); r->relname = strdup(rname); // get relation data from map tables r->start = ...; r->npages = ...; return r; } // stop using a relation void closeRelation(Rel r) { free(r->relname); free(r); }
❖ Single-File Storage Manager (cont) |
// assume that Page = byte[PageSize] // assume that PageId = block number in file // read page from file into memory buffer void get_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); read(db->fd, buf, PAGESIZE); } // write page from memory buffer to file void put_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); write(db->fd, buf, PAGESIZE); }
❖ Single-File Storage Manager (cont) |
Managing contents of space mapping table can be complex:
// assume an array of (offset,length,status) records // allocate n new pages PageId allocate_pages(DB db, int n) { if (no existing free chunks are large enough) { int endfile = lseek(db->fd, 0, SEEK_END); addNewEntry(db->map, endfile, n); } else { grab "worst fit" chunk split off unused section as new chunk } }
❖ Single-File Storage Manager (cont) |
Similar complexity for freeing chunks
// drop n pages starting from p
void deallocate_pages(DB db, PageId p, int n) {
if (no adjacent free chunks) {
markUnused(db->map, p, n);
} else {
merge adjacent free chunks
compress mapping table
}
}
Changes take effect when closeDatabase()
❖ Multiple-file Disk Manager |
Most DBMSs don't use a single large file for all data.
They typically provide:
Using multiple files (one file per relation) can be easier, e.g.
❖ Multiple-file Disk Manager (cont) |
Example of single-file vs multiple-file:
Consider how you would compute file offset of page[i] in table[1] ...
❖ Multiple-file Disk Manager (cont) |
Structure of PageId
If system uses one file per table, PageId
PageId❖ PostgreSQL Storage Manager (cont) |
Components of storage subsystem:
RelFileNodestorage/smgrstorage/smgr/md.cstorage/filesmgr❖ Relations as Files |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode {
Oid spcNode; // tablespace
Oid dbNode; // database
Oid relNode; // relation
} RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OIDdbNode == 0❖ Relations as Files (cont) |
The relpathRelFileNode
char *relpath(RelFileNode r) // simplified { char *path = malloc(ENOUGH_SPACE); if (r.spcNode == GLOBALTABLESPACE_OID) { /* Shared system relations live in PGDATA/global */ Assert(r.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, r.relNode); } else if (r.spcNode == DEFAULTTABLESPACE_OID) { /* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, r.dbNode, r.relNode); } else { /* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir r.spcNode, r.dbNode, r.relNode); } return path; }
❖ Exercise: PostgreSQL Files |
In your PostgreSQL server
$PGDATApizzaPeoplePeople❖ File Descriptor Pool |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()typedef char *FileName
Open files are referenced via: typedef int File
A File
❖ File Descriptor Pool (cont) |
Interface to file descriptor (pool):
File FileNameOpenFile(FileName fileName,
int fileFlags, int fileMode);
// open a file in the database directory ($PGDATA/base/...)
File OpenTemporaryFile(bool interXact);
// open temp file; flag: close at end of transaction?
void FileClose(File file);
void FileUnlink(File file);
int FileRead(File file, char *buffer, int amount);
int FileWrite(File file, char *buffer, int amount);
int FileSync(File file);
long FileSeek(File file, long offset, int whence);
int FileTruncate(File file, long offset);
Analogous to Unix syscalls open()close()read()write()lseek()
❖ File Descriptor Pool (cont) |
Virtual file descriptors (Vfd
VfdCache[0]❖ File Descriptor Pool (cont) |
Virtual file descriptor records (simplified):
typedef struct vfd
{
s_short fd; // current FD, or VFD_CLOSED if none
u_short fdstate; // bitflags for VFD's state
File nextFree; // link to next free VFD, if in freelist
File lruMoreRecently; // doubly linked recency-of-use list
File lruLessRecently;
long seekPos; // current logical file position
char *fileName; // name of file, or NULL for unused VFD
// NB: fileName is malloc'd, and must be free'd when closing the VFD
int fileFlags; // open(2) flags for (re)opening the file
int fileMode; // mode to pass to open(2)
} Vfd;
❖ File Manager (cont) |
PostgreSQL stores each table
PGDATA/pg_database.oid
❖ File Manager (cont) |
Data files (Oid, Oid.1, ...):
❖ File Manager (cont) |
Free space map (Oid_fsm):
VACUUMDELETExmaxVACUUM❖ File Manager (cont) |
The "magnetic disk storage manager" (storage/smgr/md.c
PageIDPageID
typedef struct
{
RelFileNode rnode; // which relation/file
ForkNumber forkNum; // which fork (of reln)
BlockNumber blockNum; // which page/block
} BufferTag;
❖ File Manager (cont) |
Access to a block of data proceeds (roughly) as follows:
// pageID set from pg_catalog tables // buffer obtained from Buffer pool getBlock(BufferTag pageID, Buffer buf) { Vfd vf; off_t offset; (vf, offset) = findBlock(pageID) lseek(vf.fd, offset, SEEK_SET) vf.seekPos = offset; nread = read(vf.fd, buf, BLOCKSIZE) if (nread < BLOCKSIZE) ... we have a problem }
BLOCKSIZE
❖ File Manager (cont) |
findBlock(BufferTag pageID) returns (Vfd, off_t)
{
offset = pageID.blockNum * BLOCKSIZE
fileName = relpath(pageID.rnode)
if (pageID.forkNum > 0)
fileName = fileName+"."+pageID.forkNum
if (fileName is not in Vfd pool)
fd = allocate new Vfd for fileName
else
fd = use Vfd from pool
if (offset > fd.fileSize) {
fd = allocate new Vfd for next fork
offset = offset - fd.fileSize
}
return (fd, offset)
}
Produced: 19 Feb 2024