❖ PostgreSQL File Manager (cont) |
Components of storage subsystem:
RelFileNode
storage/smgr
storage/smgr/md.c
storage/file
smgr
❖ Relations as Files |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode { Oid spcNode; // tablespace Oid dbNode; // database Oid relNode; // relation } RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OID
dbNode == 0
❖ Relations as Files (cont) |
The relpath
RelFileNode
char *relpath(RelFileNode r) // simplified { char *path = malloc(ENOUGH_SPACE); if (r.spcNode == GLOBALTABLESPACE_OID) { /* Shared system relations live in PGDATA/global */ Assert(r.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, r.relNode); } else if (r.spcNode == DEFAULTTABLESPACE_OID) { /* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, r.dbNode, r.relNode); } else { /* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir r.spcNode, r.dbNode, r.relNode); } return path; }
❖ File Descriptor Pool |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()
typedef char *FileName
Open files are referenced via: typedef int File
A File
❖ File Descriptor Pool (cont) |
Interface to file descriptor (pool):
File FileNameOpenFile(FileName fileName, int fileFlags, int fileMode); // open a file in the database directory ($PGDATA/base/...) File OpenTemporaryFile(bool interXact); // open temp file; flag: close at end of transaction? void FileClose(File file); void FileUnlink(File file); int FileRead(File file, char *buffer, int amount); int FileWrite(File file, char *buffer, int amount); int FileSync(File file); long FileSeek(File file, long offset, int whence); int FileTruncate(File file, long offset);
Analogous to Unix syscalls open()
close()
read()
write()
lseek()
❖ File Descriptor Pool (cont) |
Virtual file descriptors (Vfd
VfdCache[0]
❖ File Descriptor Pool (cont) |
Virtual file descriptor records (simplified):
typedef struct vfd { s_short fd; // current FD, or VFD_CLOSED if none u_short fdstate; // bitflags for VFD's state File nextFree; // link to next free VFD, if in freelist File lruMoreRecently; // doubly linked recency-of-use list File lruLessRecently; long seekPos; // current logical file position char *fileName; // name of file, or NULL for unused VFD // NB: fileName is malloc'd, and must be free'd when closing the VFD int fileFlags; // open(2) flags for (re)opening the file int fileMode; // mode to pass to open(2) } Vfd;
❖ File Manager (cont) |
PostgreSQL stores each table
PGDATA/pg_database.oid
❖ File Manager (cont) |
Data files (Oid, Oid.1, ...):
❖ File Manager (cont) |
Free space map (Oid_fsm):
VACUUM
DELETE
xmax
VACUUM
❖ File Manager (cont) |
The "magnetic disk storage manager" (storage/smgr/md.c
PageID
PageID
typedef struct { RelFileNode rnode; // which relation/file ForkNumber forkNum; // which fork (of reln) BlockNumber blockNum; // which page/block } BufferTag;
❖ File Manager (cont) |
Access to a block of data proceeds (roughly) as follows:
// pageID set from pg_catalog tables // buffer obtained from Buffer pool getBlock(BufferTag pageID, Buffer buf) { Vfd vf; off_t offset; (vf, offset) = findBlock(pageID) lseek(vf.fd, offset, SEEK_SET) vf.seekPos = offset; nread = read(vf.fd, buf, BLOCKSIZE) if (nread < BLOCKSIZE) ... we have a problem }
BLOCKSIZE
❖ File Manager (cont) |
findBlock(BufferTag pageID) returns (Vfd, off_t) { offset = pageID.blockNum * BLOCKSIZE fileName = relpath(pageID.rnode) if (pageID.forkNum > 0) fileName = fileName+"."+pageID.forkNum if (fileName is not in Vfd pool) fd = allocate new Vfd for fileName else fd = use Vfd from pool if (pageID.forkNum > 0) { offset = offset - (pageID.forkNum*MAXFILESIZE) } return (fd, offset) }
Produced: 28 Feb 2021