COMP9315: Storage: Devices, Files, Pages, Tuples, Buffers, Catalogs
Storage Management |
Storage Management | 2/247 |
Aims of storage management in DBMS:
... Storage Management | 3/247 |
The storage manager provides mechanisms for:
DB
Rel
Page
Tuple
PageId
TupleId
... Storage Management | 4/247 |
Examples of references (addresses) used in DBMSs:
PageID
PageID = FileID + Offset
Offset
TupleID
TupleID = PageID + Index
Index
... Storage Management | 5/247 |
Levels of DBMS related to storage management:
... Storage Management | 6/247 |
Topics to be considered:
Views of Data | 7/247 |
Users and top-level query evaluator see data as
... Views of Data | 8/247 |
Relational operators and access methods see data as
... Views of Data | 9/247 |
File manager sees both DB objects and file store
... Views of Data | 10/247 |
Disk manager sees data as
On typical modern databases, handled by operating system filesystem.
Storage Manager Interface | 11/247 |
The storage manager provides higher levels of system
select name from Employee
is implemented as something like
DB db = openDatabase("myDB"); Rel r = openRel(db,"Employee"); Scan s = startScan(r); Tuple t; while ((t = nextTuple(s)) != NULL) { char *name = getField(t,"name"); printf("%s\n", name); }
... Storage Manager Interface | 12/247 |
The above shows several kinds of operations/mappings:
Data Flow in Query Evaluation | 13/247 |
... Data Flow in Query Evaluation | 14/247 |
Notes on typical implementation strategies:
int
PageId = (FileNum<24)||PageNum
get_page(r,i,buf)
get_page(pid,buf)
DB
Rel
structs
struct RelRec { int fd; int npages; int blksize; } typedef struct RelRec *Rel;
Files in DBMSs | 15/247 |
Data sets can be viewed at several levels of abstraction in DBMSs.
Logical view: a file is a named collection of data items (e.g. a table of tuples)
Abstract physical view: a file is a sequence of fixed-size data blocks.
Physical realisation: a collection of sectors scattered over ≥1 disks.
The abstraction used for managing this: PageId
... Files in DBMSs | 16/247 |
... Files in DBMSs | 17/247 |
Two possibilities for DBMS disk managers to handle data:
File System Interface | 18/247 |
Most access to data on disks in DBMSs is via a file system.
Typical operations provided by the operating system:
fd = open(fileName,mode)// open a named file for reading/writing/appending close(fd)// close an open file, via its descriptor nread = read(fd, buf, nbytes)// attempt to read data from file into buffer nwritten = write(fd, buf, nbytes)// attempt to write data from buffer to file lseek(fd, offset, seek_type)// move file pointer to relative/absolute file offset fsync(fd)// flush contents of file buffers to disk
Storage Technology | 19/247 |
At this point in memory technology development:
Computational Storage | 20/247 |
Characteristics of main memory (RAM):
load reg,byte_address store reg,byte_address
Cache memory has similar characteristics to RAM, but is
Bulk Data Storage | 21/247 |
Requirements for bulk data storage:
... Bulk Data Storage | 22/247 |
Several kinds of bulk data storage technology currently exist:
Magnetic Disks | 23/247 |
Classical/dominant bulk storage technology.
Characteristics:
Modest increase in speed; good reduction in cost.
Optical Disks | 24/247 |
Optical disks provides an alternative spinning disk storage technology.
Several varieties: CD-ROM, CD-R, CD-RW, DVD-RW
Compared to magnetic disks, CD's have
Flash Memory (SSD) | 25/247 |
Flash memory is a non-mechanical alternative to disk storage.
Compared to disks, flash memory has
... Flash Memory (SSD) | 26/247 |
Properties of flash memory require specialised file system
Example: updating data in flash storage
Comparing HDD and SSD | 27/247 |
Comparison of HDD and SSD properties:
HDD | SDD | |
Cost/byte | ~ 4c / GB | ~ 13c / GB |
Read latency | ~ 10ms | ~ 50µs |
Write latency | ~ 10ms | ~ 900µs |
Read unit | block (e.g. 1KB) | byte |
Writing | write a block | write on empty block |
Will SSDs ever replace HDDs?
Disk Management |
Disk Manager | 29/247 |
Aim:
... Disk Manager | 30/247 |
Basic disk management interface is simple:
void get_page(PageId p, Page buf)
PageId
Page
void put_page(PageId p, Page buf)
Page
PageId
PageId allocate_pages(int n)
n
void deallocate_page(PageId p, int n)
n
PageId
Disk Technology | 31/247 |
Disk architecture:
... Disk Technology | 32/247 |
Characteristics of disks:
read block at address (p,t,s) write block at address (p,t,s)
Disk Access Costs | 33/247 |
Access time includes:
... Disk Access Costs | 34/247 |
Example disk #1 characteristics:
... Disk Access Costs | 35/247 |
Time Tr to read one random block on disk #1:
Minimum Tr = 0 + 0 + 0.13 = 0.13 ms
Maximum Tr = 50 + 16.7 + 0.13 = 66.8 ms
Average Tr = 25 + (16.7/2) + 0.13 = 33.5 ms
... Disk Access Costs | 36/247 |
If operating system deals in 4KB blocks:
Tr(4-blocks) = 25 + (16.7/2) + 4×0.13 + 3×0.01 = 33.9 ms
Tr(1-block) = 25 + (16.7/2) + 0.13 = 33.5 ms
Note that the cost of reading 4KB is comparable to reading 1KB.
Sequential access reduces average block read cost significantly, but
... Disk Access Costs | 37/247 |
Example disk #2 characteristics:
If using 32-bit addresses, this leaves 8 bits (28=256 items/block).
Disk Characteristics | 38/247 |
Three important characteristics of disk subsystems:
... Disk Characteristics | 39/247 |
Increasing capacity:
Increasing Disk Capacity | 40/247 |
Compress data (e.g. LZ encoding)
+ more data fits on disk
- compression/expansion overhead
For large compressible data (e.g. text
For most relational data (e.g. int
char(8)
For high-performance memory caching, may never want to expand
(there is current research working on "computable" compressed data formats).
Improving Disk Access Costs | 41/247 |
Approach #1: Use knowledge of data access patterns.
E.g. two records frequently accessed together
⇒ put them in the same block (clustering)
E.g. records scanned sequentially
⇒ place them in "staggered" blocks, double-buffer
Arranging data to match access patterns can improve throughput by 10-20 times.
Approach #2: Avoid reading blocks for each item access.
E.g. buffer blocks in memory, assume likely re-use
Scheduled Disk Access | 42/247 |
Low-level disk manager (driver, controller):
Disk Layout | 43/247 |
If data sets are going to be frequently accessed in a pre-determined manner, arrange data on disk to minimise access time.
E.g. sequential scan
Modern systems generally don't, because of programmer complexity.
Unix has raw disk partitions: no file system, you write driver to manage disk.
Improving Writes | 44/247 |
Nonvolatile write buffers
Double Buffering | 45/247 |
Double-buffering exploits potential concurrency between disk and memory.
While reads/writes to disk are underway, other processing can be done.
With at least two buffers, can keep disk working full-time.
... Double Buffering | 46/247 |
Example: select sum(salary) from Employee
read A into buffer then process buffer content read B into buffer then process buffer content read C into buffer then process buffer content ...
Costs:
... Double Buffering | 47/247 |
Double-buffering approach:
read A into buffer1 process A in buffer1 and concurrently read B into buffer2 process B in buffer2 and concurrently read C into buffer1 ...
Costs:
General observation: use of multiple buffers can lead to substantial
cost savings.
We will see numerous examples where multiple memory buffers
are exploited.
Multiple Disk Systems | 48/247 |
Various strategies can be employed to improve capacity, performance and reliability when multiple disks are available.
RAID (redundant arrays on independent disks) defines a standard set of such techniques.
Essentially, multiple disks allow
(although there is obviously a trade-off between increased capacity and increased reliability via redundancy)
RAID Level 0 | 49/247 |
Uses striping to partition data for one file over several disks
E.g. for n disks, block i in the file is written to disk (i mod n)
Example: file with 6 data blocks striped onto 4 disks using (pid mod 4)
Increases capacity, improves data transfer rates, reduces reliability.
... RAID Level 0 | 50/247 |
The disk manager and RAID controller have to perform a mapping something like:
writePage(PageId)
to
disk = diskOf(PageId,ndisks) cyln = cylinderOf(PageId) plat = platterOf(PageId) sect = sectorOf(PageId) writeDiskPage(disk, cyln, plat, sect)
(We discuss later how the pid
RAID Level 1 | 51/247 |
Uses mirroring (or shadowing) to store multiple copies of each block.
Since disks can be read/written in parallel, transfer cost unchanged.
Multiple copies allows for single-disk failure with no data loss.
Example: file with 4 data blocks mirrored on two 2-disk partitions
Reduces capacity, improves reliability, no effect on data transfer rates.
... RAID Level 1 | 52/247 |
The disk manager and RAID controller have to perform a mapping something like:
writePage(PageId)
to
n = ndisksInPartition disk = diskOf(PageId,n) cyln = cylinderOf(PageId) plat = platterOf(PageId) sect = sectorOf(PageId) writeDiskPage(disk, cyln, plat, sect) writeDiskPage(disk+n, cyln, plat, sect)
RAID levels 2-6 | 53/247 |
The higher levels of raid incorporate various combinations of:
The differences are primarily in:
RAID level 6 can recover from smultaneous failures in two disks.
Disk Media Failure | 54/247 |
Rarely, a bit will be transferred to/from the disk incorrectly.
Error-correcting codes can check for and recover from this.
If recovery is not possible, the operation can simply be repeated.
If repeated reads/writes on the same block fail:
Database Objects | 55/247 |
DBMSs maintain various kinds of objects/information:
can be viewed as an super-object for all others | ||
global configuration information | ||
meta-information describing database contents | ||
named collections of tuples | ||
collections of typed field values | ||
access methods for efficient searching | ||
for handling rollback/recovery | ||
active elements |
... Database Objects | 56/247 |
The disk manager implements how DB objects are mapped to file system.
References to data objects typically reduce to e.g.
Offset
PageId
RecordID
PageId+Offset
Single-file DBMS | 57/247 |
One possible storage organisation is a single file for the entire database.
All objects are allocated to regions of this file.
Objects are allocated to regions (segments) of the file.
If an object grows too large for allocated segment, allocate an extension.
What happens to allocated space when objects are removed?
... Single-file DBMS | 58/247 |
Allocating space in Unix files is easy:
Single-file Storage Manager | 59/247 |
Consider the following simple single-file DBMS layout:
E.g.
SpaceMap = [ (0,10,U), (10,10,U), (20,600,U), (620,100,U), (720,20,F) ]
TableMap = [ ("employee",20,500), ("project",620,40) ]
... Single-file Storage Manager | 60/247 |
Each file segment consists of a number fixed-size blocks
The following data/constant definitions are useful
#define PAGESIZE 2048// bytes per page typedef long PageId;// PageId is block index // pageOffset=PageId*PAGESIZE typedef char *Page;// pointer to page/block buffer
Typical PAGESIZE
... Single-file Storage Manager | 61/247 |
Storage Manager data structures for opened DBs & Tables
typedef struct DBrec { char *dbname;// copy of database name int fd;// the database file SpaceMap map;// map of free/used areas NameTable names;// map names to areas + sizes } *DB; typedef struct Relrec { char *relname;// copy of table name int start;// page index of start of table data int npages;// number of pages of table data ... } *Rel;
Example: Scanning a Relation | 62/247 |
With the above disk manager, our example:
select name from Employee
might be implemented as something like
DB db = openDatabase("myDB"); Rel r = openRelation(db,"Employee"); Page buffer = malloc(PAGESIZE*sizeof(char)); for (int i = 0; i < r->npages; i++) { PageId pid = r->start+i; get_page(db, pid, buffer); for each tuple in buffer { get tuple data and extract name add (name) to result tuples } }
Single-File Storage Manager | 63/247 |
// start using DB, buffer meta-data DB openDatabase(char *name) { DB db = new(struct DBrec); db->dbname = strdup(name); db->fd = open(name,O_RDWR); db->map = readSpaceTable(db->fd); db->names = readNameTable(db->fd); return db; }// stop using DB and update all meta-data void closeDatabase(DB db) { writeSpaceTable(db->fd,db->map); writeNameTable(db->fd,db->map); fsync(db->fd); close(db->fd); free(db->dbname); free(db); }
... Single-File Storage Manager | 64/247 |
// set up struct describing relation Rel openRelation(DB db, char *rname) { Rel r = new(struct Relrec); r->relname = strdup(rname);// get relation data from map tables r->start = ...; r->npages = ...; return r; }// stop using a relation void closeRelation(Rel r) { free(r->relname); free(r); }
... Single-File Storage Manager | 65/247 |
// assume that Page = byte[PageSize] // assume that PageId = block number in file // read page from file into memory buffer void get_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); read(db->fd, buf, PAGESIZE); }// write page from memory buffer to file void put_page(Db db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); write(db->fd, buf, PAGESIZE); }
... Single-File Storage Manager | 66/247 |
Managing contents of space mapping table can be complex:
// assume an array of (offset,length,status) records // allocate n new pages PageId allocate_pages(int n) { if (no existing free chunks are large enough) { int endfile = lseek(db->fd, 0, SEEK_END); addNewEntry(db->map, endfile, n); } else { grab "worst fit" chunk split off unused section as new chunk }// note that file itself is not changed }
... Single-File Storage Manager | 67/247 |
Similar complexity for freeing chunks
// drop n pages starting from p void deallocate_pages(PageId p, int n) { if (no adjacent free chunks) { markUnused(db->map, p, n); } else { merge adjacent free chunks compress mapping table }// note that file itself is not changed }
Changes take effect when closeDatabase()
Multiple-file Disk Manager | 68/247 |
Most DBMSs don't use a single large file for all data.
They typically provide:
... Multiple-file Disk Manager | 69/247 |
Using multiple files (one file per relation) can be easier
E.g. extending the size of a relation
... Multiple-file Disk Manager | 70/247 |
Structure of PageId
If system uses one file per table, PageId
PageId
Oracle File Structures | 71/247 |
Oracle uses five different kinds of files:
catalogue, tables, procedures | ||
update logs | ||
record system events | ||
configuration info | ||
off-line collected updates |
... Oracle File Structures | 72/247 |
There may be multiple instances of each kind of file:
SYSTEM
... Oracle File Structures | 73/247 |
Tablespaces are logical units of storage (cf directories).
Every database object resides in exactly one tablespace.
Units of storage within a tablespace:
fixed size unit of storage (cf 2KB page) | ||
specific number of contiguous data blocks | ||
set of extents allocated to a single database object |
Segments can span multiple data files; extents cannot.
To be confusing, tables are called datafiles internally in Oracle.
... Oracle File Structures | 74/247 |
Layout of data within Oracle file storage:
PostgreSQL Storage Manager | 75/247 |
PostgreSQL uses the following file organisation ...
... PostgreSQL Storage Manager | 76/247 |
Components of storage subsystem:
RelFileNode
storage/smgr
storage/smgr/md.c
storage/file
smgr
Relations as Files | 77/247 |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode { Oid spcNode;// tablespace Oid dbNode;// database Oid relNode;// relation } RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OID
dbNode == 0
... Relations as Files | 78/247 |
The relpath
RelFileNode
char *relpath(RelFileNode rnode)// simplified { char *path = malloc(ENOUGH_SPACE); if (rnode.spcNode == GLOBALTABLESPACE_OID) {/* Shared system relations live in PGDATA/global */ Assert(rnode.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, rnode.relNode); } else if (rnode.spcNode == DEFAULTTABLESPACE_OID) {/* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, rnode.dbNode, rnode.relNode); } else {/* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir rnode.spcNode, rnode.dbNode, rnode.relNode); } return path; }
File Descriptor Pool | 79/247 |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()
typedef char *FileName
Open files are referenced via: typedef int File
A File
Source: include/storage/fd.h
backend/storage/file/fd.c
... File Descriptor Pool | 80/247 |
Interface to file descriptor (pool):
File PathNameOpenFilePerm(char *fileName, int fileFlags, int fileMode);// open a file in the database directory ($PGDATA/base/...) File OpenTemporaryFile(bool interXact);// open temp file; flag: close at end of transaction? void FileClose(File file); void FileUnlink(File file); int FileRead(File file, char *buffer, int amount); int FileWrite(File file, char *buffer, int amount); int FileSync(File file); long FileSeek(File file, long offset, int whence); int FileTruncate(File file, long offset);
... File Descriptor Pool | 81/247 |
Virtual file descriptors (Vfd
VfdCache[0]
... File Descriptor Pool | 82/247 |
Virtual file descriptor records (simplified):
typedef struct vfd { s_short fd;// current FD, or VFD_CLOSED if none u_short fdstate;// bitflags for VFD's state File nextFree;// link to next free VFD, if in freelist File lruMoreRecently;// doubly linked recency-of-use list File lruLessRecently; long seekPos;// current logical file position char *fileName;// name of file, or NULL for unused VFD // NB: fileName is malloc'd, and must be free'd when closing the VFD int fileFlags;// open(2) flags for (re)opening the file int fileMode;// mode to pass to open(2) } Vfd;
File Manager | 83/247 |
The "magnetic disk storage manager"
PageId
PageId
typedef struct { RelFileNode rnode;// which relation ForkNumber forkNum;// which fork BlockNumber blockNum;// which block } BufferTag;
... File Manager | 84/247 |
PostgreSQL stores each table
PGDATA/pg_database.oid
... File Manager | 85/247 |
Data files (Oid, Oid.1, ...):
... File Manager | 86/247 |
Free space map (Oid_fsm):
VACUUM
DELETE
xmax
VACUUM
... File Manager | 87/247 |
Access to a block of data proceeds (roughly) as follows:
// pageID set from pg_catalog tables // buffer obtained from Buffer pool getBlock(BufferTag pageID, Buffer buf) { File fid; off_t offset; int fd; (fid, offset) = findBlock(pageID) fd = VfdCache[fid].fd; lseek(fd, offset, SEEK_SET) VfdCache[fid].seekPos = offset; nread = read(fd, buf, BLOCKSIZE) if (nread < BLOCKSIZE)... we have a problem }
BLOCKSIZE
... File Manager | 88/247 |
findBlock(BufferTag pageID) returns (Vfd, off_t) { offset = pageID.blockNum * BLOCKSIZE fileName = relpath(pageID.rnode) if (pageID.forkNum > 0) fileName = fileName+"."+pageID.forkNum fid = PathNameOpenFIle(fileName, O_READ); fSize = VfdCache[fid].fileSize; if (offset > fSize) { fid = allocate new Vfd for next fork offset = offset - fd.fileSize } return (fd, offset) }
Buffer Pool |
Buffer Manager | 90/247 |
Aim:
Buffer pool
... Buffer Manager | 91/247 |
Buffer pool interposed between access methods and disk manager
Access methods/page manager normally work via get_page() calls;
now work via calls to get_page_via_buffer_pool() (aka request_page())
... Buffer Manager | 92/247 |
Basic buffer pool interface
Page request_page(PageId p);
p
void release_page(PageId p);
p
void mark_page(PageId p);
p
void flush_page(PageId p);
p
void hold_page(PageId p);
p
allocate_page
deallocate_page
Buffer Pool Usage | 93/247 |
How scans are performed without Buffer Pool:
Buffer buf; int N = numberOfBlocks(Rel); for (i = 0; i < N; i++) { pageID = makePageID(db,Rel,i); getBlock(pageID, buf); for (j = 0; j < nTuples(buf); j++) process(buf, j) }
Requires N
If we read it again, N
... Buffer Pool Usage | 94/247 |
How scans are performed with Buffer Pool:
Buffer buf; int N = numberOfBlocks(Rel); for (i = 0; i < N; i++) { pageID = makePageID(db,Rel,i); bufID = request_page(pageID); buf = frames[bufID] for (j = 0; j < nTuples(buf); j++) process(buf, j) release_page(pageID); }
Requires N
If we read it again, 0 ≤ page reads ≤ N
Buffer Pool Data | 95/247 |
... Buffer Pool Data | 96/247 |
Buffer pool data structures:
... Buffer Pool Data | 97/247 |
... Buffer Pool Data | 98/247 |
In subsequent discussion, we assume:
Requesting Pages | 99/247 |
Call from client: request_page(pid)
If page pid
If page pid
... Requesting Pages | 100/247 |
Advantages:
Releasing Pages | 101/247 |
The release_page
If the page has been modified, must be written to disk before replaced.
Possible problem: changes not immediately reflected on disk
... Releasing Pages | 102/247 |
Advantages:
Buffer Manager Example #1 | 103/247 |
Self join: an example where buffer pool achieves major efficiency gains.
Consider a query to find pairs of employees with the same birthday:
select e1.name, e2.name from Employee e1, Employee e2 where e1.id < e2.id and e1.birthday = e2.birthday
This might be implemented inside the DBMS via nested loops:
for each tuple t1 in Employee e1 { for each tuple t2 in Employee e2 { if (t1.id < t2.id && t1.birthday == t2.birthday) append (t1.name,t2.name) to result set } }
... Buffer Manager Example #1 | 104/247 |
In terms of page-level operations, the algorithm looks like:
DB db = openDatabase("myDB"); Rel emp = openRel(db,"Employee"); int npages = nPages(emp); for (int i = 0; i < npages; i++) { PageId pid1 = makePageId(emp,i); Page p1 = request_page(pid1); for (int j = 0; j < npages; i++) { PageId pid2 = makePageId(emp,j); Page p2 = request_page(pid2);// compare all pairs of tuples from p1,p2 // construct solution set from matching pairs release_page(pid2); } release_page(pid1); }
... Buffer Manager Example #1 | 105/247 |
Consider a buffer pool with 200 frames and a relation with b ≤ 200 pages:
p1
p2
p2
Employee
... Buffer Manager Example #1 | 106/247 |
Now consider a buffer pool with 2 frames (the minimum required for the join):
p1
p2
p2
p2
... Buffer Manager Example #1 | 107/247 |
(... continued)
p2
p2
p1
p2
p1
Cf. 200-frame buffer vs 2-frame buffer ... if b=100, 100 reads vs 10000 reads.
Buffer Pool Implementation | 108/247 |
Buffer pool data structures:
typedef char Page[PAGESIZE]; typedef ... PageID;// defined earlier typedef struct _FrameData { PageID pid;// which page is in frame int pin_count;// how many processes using page int dirty;// page modified since loaded? Time last_used;// when page was last accessed } FrameData; Page frames[NBUFS];// actual buffers FrameData directory[NBUFS];
... Buffer Pool Implementation | 109/247 |
Implementation of request_page()
int request_page(PageID pid) { bufID = findInPool(pid) if (pid == NOT_FOUND) { if (no free frames in Pool) { bufID = findFrameToReplace() if (directory[bufID].dirty) old = directory[bufID].page put_page(old, frames[bufID]); } bufID = index of freed frame directory[bufID].page = pid directory[bufID].pin_count = 0 directory[bufID].dirty = 0 get_page(pid, frames[bufID]); } directory[bufID].pin_count++ return bufID }
Other Buffer Operations | 110/247 |
The release_page
The
The
Several schemes are commonly in use:
For DBMS, we can predict patterns of page access better
The cost benefits from a buffer pool (with n frames) is determined by:
First scan costs b reads; subsequent scans are ``free''.
Example (b): sequential scan, MRU, n < b, no competition
First scan costs b reads; subsequent scans cost b - n reads.
Example (c): sequential scan, LRU, n < b, no competition
All scans cost b reads; known as sequential flooding.
How to determine when a page in the buffer was last accessed?
Could simply use the time of the last
But this doesn't reflect real accesses to page.
For more realism, could use last
Or could introduce operations for examining and modifying pages in pool:
Standard join: an example where replacement policy can have large impact.
Consider a query to find customers who are also employees:
This might be implemented inside the DBMS via nested loops:
Assume that:
Works well with MRU strategy:
Note: assumes that both
Works less well with LRU strategy:
PostgreSQL buffer manager:
Buffers are located in a large region of shared memory.
Functions:
Definitions:
Buffer pool consists of:
Definitions related to buffer manager:
Buffer manager data types:
Buffer manager interface:
Buffer manager interface (cont):
Additional buffer manager functions:
Important internal buffer manager function:
PostgreSQL page replacement strategy: clock-sweep
The disk and buffer manager provide the following view:
The abstract view of a relation:
We use the following low-level abstractions:
We use the following high-level abstractions:
A table is defined by a collection of attributes (schema), e.g.
A tuple is a collection of attribute values for such a schema, e.g.
A record is a sequence of bytes, containing data for one tuple.
Aim:
Operations to access records from a page ...
Operations to make changes to records in a page ...
Typ
Also need operations to convert between
Recall previous example of simple scan of a relation:
implemented as:
Conceptually, the scanning implementation is simple:
The real implementation relies on the buffer manager:
And similarly the
The implementation of
A
Some DBMSs provide
PostgreSQL provides a unique OID for every row in the database.
Consider a DBMS like Oracle which uses a small number of large files.
Suitable
(Note: however you partition the bits, you can address at most 4 billion records)
Consider a DBMS like MiniSQL, which uses one data file per relation.
One possibility is a variation on the Oracle approach:
Functions for constructing/interrogating
Alternative implementation if details of file/page are hidden within
Records are stored within fixed-length pages.
Records may be fixed-length:
Encoding scheme for fixed-length records:
Since record format is frequently consulted at query time, it should be memory-resident.
Advantages of fixed-length records:
Handling attempts to insert values larger than available fields:
Some encoding schemes for variable-length records:
More encoding schemes for variable-length records:
Advantages of variable-length records:
How to handle record that does not fit into free space in page?
Two approaches:
Advantages of spanned records:
A
DBMSs typically define a fixed set of field types for use in schema.
E.g.
This determines the primitive types to be handled in the implementation:
To convert a
For the example relation:
This defines the schema
A
A
It also necessary to know how the
Assume the following
Assume also that lengths are 1-byte quantities
(no field longer than 256-bytes).
How the
Definitions:
Functions:
PostgreSQL defines tuples via:
Tuple structure:
Tuple-related data types:
The actual data value:
Tuple-related data types: (cont)
Operations on Tuples:
Ultimately, a
We want to interpret/manipulate it as a collection of
Typical operations on
Factors affecting
For fixed-length records, use record slots.
Insertion: place new record in first available slot.
Deletion: two possibilities for handling free record slots:
Problem with packed format and no slot directory
Problem with unpacked/bitmap format
For variable-length records, use slot directory.
Possibilities for handling free-space within block:
Compacted free space:
Note: "pointers" are implemented as word offsets within block.
Fragmented free space:
How many records can fit in a page?
(How long is a piece of string?)
Depends on: page size, (avg) record size, slot directory, ...
For a typical DBMS application
Example of determining space utilisation ...
Assumptions:
Max record size = 4(offsets) + 4 + 20 + 12 + 4 = 44 bytes
Minimum number of records = 1024/44 = 23
(assume all max size and no directory)
Average number of records = 1024/40 = 25
(assume no directory)
So, allow 32 directory slots (5-bit slot indexes), and 32 bytes for directory.
Number of records = Nr, where 44 × Nr + 32+4 ≤ 1024
Aim to maximise Nr, so Nr = 22
Notes: because there are 32 slots, could have up to 32 (small) records
If we switched to 8KB pages, then
So, allow 256 slots (8-bit slot indexes), and 352 bytes for directory (256*11bits)
Number of records = Nr, where 44 × Nr + 352 ≤ 8192
Aim to maximise Nr, so Nr = 178
Could reduce size of directory to allow more records ... but only so far.
Note: 11-bit directory entries also means that it's costly to access them.
Sometimes, it may not be possible to insert a record into a page:
If there is still insufficient space, we have one of the other cases.
How the other cases are handled depends on the file organisation:
Overflow files for very large records and BLOBs:
Page-based handling of overflows:
Useful for scan-all-records type operations.
Record-based handling of overflows:
Useful for locating specific record via rid.
Functions:
Definitions:
Each page is 8KB (default
PostgreSQL tuple page layout:
Page-related data types:
Page-related data types: (cont)
Operations on
PostgreSQL has two kinds of pages:
One important difference:
RDBMSs manage different kinds of objects
How are the different types of objects represented?
How do we go from a name (or OID) to bytes stored on disk?
Top-level "objects" in typical SQL standard databases:
catalog ... SQL terminology for a database
Consider what information the RDBMS needs about relations:
All of this information is stored in the system catalog.
(The "system catalog" is also called "data dictionary" or "system view")
In most RDBMSs, the catalog itself is also stored as tables.
Standard for catalogs in SQL:2003:
For complete details, see Section 37 of the PostgreSQL 11.3 documentation.
Most DBMSs also have their own internal catalog structure.
Would typically contain information such as:
Standard SQL
The catalog is manipulated by a range of SQL operations:
E.g. consider an SQL DDL operation such as:
This would produce a set of catalog changes something like ...
In PostgreSQL, the system catalog is available to users via:
The
A PostgreSQL installation typically has several databases.
Some catalog information is global, e.g.
Global installation data is recorded in shared tables
PostgreSQL tuples contain
In version 8, PostgreSQL merged notions of users/groups into roles.
Represented by two base tables:
View
View
Both tables are global (shared across all DBs in a cluster).
etc. etc.
etc. etc.
Note the use of multi-valued attribute (PostgreSQL extension)
Above the level of individual DB schemata, we have:
Keys are names (strings) and must be unique within cluster.
etc. etc.
Digression: access control lists (
PostgreSQL represents access via an array of access elements.
Each access element contains:
where Privileges is a string enumerating privileges, e.g.
Note that
Two pre-defined tablespaces:
Entries in multiple catalog tables are required for each user-level table.
Due to O-O heritage, base table for tables is called
The
Tuples in
etc.
Note: a complex type is automatically created for each table
(We discuss more details of the
Also has notion of large data being stored in a separate table
(so-called "TOAST" table).
An SQL DDL statement like
will cause entries to be made in the following tables:
The example leads to a series of database changes like
(Names are automatically generated from context (
For foreign-key constraints,
Foreign keys also introduce triggers to perform checking.
For column-specific constraints:
An SQL DDL statement like
will cause similar entries as before in catalogs, plus
The example leads to a series of database changes like
Stored procedures (functions) are defined as
Stored procedures are represented in the catalog via
etc.
Consider two alternative ways of defining a x2 function.
or
The above leads to a series of database changes like
Users can define their own aggregate functions (like
Requires definition of three components:
The aggregate's name is stored in the
Consider defining your own
Need to define a new aggregate:
and need to define functions to support aggregate ...
Users can define their own operators to use in expressions.
Operators are syntactic sugar for unary/binary functions.
Consider defining an operator for the
Operator definitions are stored in
Users can also define new data types, which includes
All attributes are references to
All data types need access methods for querying.
The following catalogs tables are involved in this:
All attributes are references to
Functions drive the query evaluation process.
All attributes are references to
Functions implement different aspects of updating data/index files.
Built-in access methods:
Produced: 22 Aug 2019
mark_page
operation:
Note: doesn't actually write to disk; indicates that frame needs to be written if used for replacement;
flush_page
Note: not generally used by higher levels of DBMS; they rely on request/release protocol.
write_page
Page Replacement Policies 111/247
LRU works for VM because of working set model
(recent past accesses determines future accesses)
(from our knowledge of how the relational operations are implemented)
... Page Replacement Policies 112/247
Example (a): sequential scan, LRU or MRU, n ≥ b, no competition
Page Access Times 113/247 request_page
PageId
request_page
release_page
examine_page(PageId, TupleId)
modify_page(PageId, TupleId, Tuple)
Buffer Manager Example #2 114/247
select c.name
from Customer c, Employee e
where c.ssn = e.ssn;
for each tuple t1 in Customer {
for each tuple t2 in Employee {
if (t1.ssn == t2.ssn)
append (t1.name) to result set
}
}
... Buffer Manager Example #2 115/247
Customer
Employee
... Buffer Manager Example #2 116/247
Total page reads = bC + bE + (bC-1) × (bE - (n-3)) = 20 + 10 + 19*(10-3) = 163
Customer
Employee
Customer
Customer
Employee
request_page
release_page
... Buffer Manager Example #2 117/247
Total page reads = bC + bC × bE = 20 + 20*10 = 220
Customer
Employee
Employee
Customer
Employee
PostgreSQL Buffer Manager 118/247
Same code used by backends which need a local buffer pool.
src/backend/storage/buffer/*.c
src/include/storage/buf*.h
... PostgreSQL Buffer Manager 119/247
Nbuffers
BufferDesc
Nbuffers
Buffer
BufferDesc
Buffer
shared_buffers = 16MB
... PostgreSQL Buffer Manager 120/247
... PostgreSQL Buffer Manager 121/247 include/storage/buf.h
Buffer
include/storage/bufmgr.h
(i.e. the functions that other parts of the system call to user buffer manager)
include/storage/buf_internals.h
Code in: BufferDesc
backend/storage/buffer/
Buffer Pool Data Objects 122/247 BufferDescriptors
Buffer
BufferDescriptors
BufMgrLock
BufferTag
... Buffer Pool Data Objects 123/247
Buffer Pool Functions 124/247 Buffer ReadBuffer(Relation r, BlockNumber n)
Actually a special case of r
(may need to remove an existing unpinned page and read data from file)
Buffer
ForkNumber
ReadBuffer_Common
... Buffer Pool Functions 125/247 void ReleaseBuffer(Buffer buf)
ensures all activity on buffer is completed before returning
void MarkBufferDirty(Buffer buf)
... Buffer Pool Functions 126/247 Page BufferGetPage(Buffer buf)
BufferIsPinned(Buffer buf)
CheckPointBuffers
etc. etc. etc.
... Buffer Pool Functions 127/247 BufferDesc *BufferAlloc(
Relation r, ForkNumber f,
BlockNumber n, bool *found)
ReadBuffer
ReadBuffer
Clock-sweep Replacement Strategy 128/247
For specialised kinds of access (e.g. sequential scan),
can allocate a private "buffer ring"
with different replacement strategy.
NextVictimBuffer
usage_count
NextVictimBuffer
Record/Tuple Management
Views of Data 130/247
Database applications view data as:
PageId
Standard terminology: records are also called tuples, items, rows, ...
RecordId
RID
... Views of Data 131/247
The physical representation of a relation:
... Views of Data 132/247 RecPage
byte[]
Record
... Views of Data 133/247 Relation
Tuple
Records vs Tuples 134/247
create table Employee (
id# integer primary key,
name varchar(20), -- or char(20)
job varchar(10), -- or char(10)
dept number(4)
);
(33357462, 'Neil Young', 'Musician', 0277)
Record Management 135/247
In other words, the record manager reconciles the views of a block:
Tuple
Record
RecordId
Tuple
RecPage
Assumptions (neither of which are essential):
Page-level Operations 136/247 Record get_record(RecordId rid)
rid
Record
Record first_record()
Record
Record next_record()
Record
null
... Page-level Operations 137/247 void update_record(RecordId rid, Record rec)
rid
rec
RecordId insert_record(Record rec)
rid
void delete_record(RecordId rid)
rid
Tuple-level Operations 138/247 get
Field(int fno)
Examples: fno
Tuple
getIntField(1)
getStringField(2)
void
set
Field(int fno,
val)
Examples: fno
Tuple
val
setIntField(1,42)
setStringField(2,"abc")
Record
Tuple
Relation-level Operations 139/247 Tuple get_tuple(RecordId rid)
rid
Tuple
Tuple first_tuple()
Tuple
Tuple next_tuple()
Plus operations to insert, delete and modify Tuple
null
Tuples
Tuples
Records
Example Query 140/247
select name from Employee
DB db = openDatabase("myDB");
Rel r = openRel(db,"Employee");
Scan s = startScan(r);
Tuple t;
while ((t = nextTuple(s)) != NULL)
{
char *name = getField(t,"name");
printf("%s\n", name);
}
... Example Query 141/247
... Example Query 142/247
struct ScanRec {
Rel curRel; PageId curPID; RecPage curPage;
};
typedef struct ScanRec *Scan;
Scan startScan(Rel r)
{
Scan s = malloc(sizeof(struct ScanRec));
s->curPID = firstPageId(r);
Buffer page = request_page(s->curPage);
s->curPage = start_page_scan(page);
return s;
}
... Example Query 143/247 nextTuple()
Tuple nextTuple(Scan s)
{
Record Identifiers 144/247 RecordID
RecordId
If multiple files for a relation, then also need:
(Or, more likely, use a PageId
ROWID
... Record Identifiers 145/247 RecordID
E.g. with 4KB pages and 16 bits available for page addressing
E.g. using indexes into a slot table to identify records within a page
(page addresses are all of the form 0x0000, 0x1000, 0x2000, 0x3000, ...
RecordId
Example RecordId
146/247 RecordId
Example:
... Example RecordId
147/247
Another possibility is
(Under this scheme, there will be multiple records in the DB with the same RecordId
rid
Manipulating RecordId
148/247 RecordId
typedef unsigned int RecordId;
RecordId makeRecordId(int file, int page, int slot) {
return (file << 28) | (page << 8) | (slot);
}
int fileNo(RecordId rid) { return (rid >> 28) & 0xF; }
int pageNo(RecordId rid) { return (rid >> 8) & 0xFFFFF; }
int slotNo(RecordId rid) { return rid & 0xFF; }
... Manipulating RecordId
149/247 PageId
typedef unsigned int PageId;
Record Formats 150/247
Records may be variable-length:
Fixed-length Records 151/247
... Fixed-length Records 152/247
Disadvantages of fixed-length records:
Note: if all records were close to specified maximum size,
this would be the most compact format.
... Fixed-length Records 153/247
Alignment considerations (for numeric fields) may require:
RecordId
varchar
Variable-length Records 154/247
... Variable-length Records 155/247
<employee>
<id#>33357462</id#> <dept>0277</dept>
<name>Neil Young</name>
<job>Musician</job>
</employee>
Tuple
Record
... Variable-length Records 156/247
Disadvantages of variable-length records:
Spanned Records 157/247
... Spanned Records 158/247
Disadvantages of spanned records:
More common strategy than spanning:
Converting Records to Tuples 159/247 Record
The information on how to interpret the bytes
byte[]
Tuple
For variable-length records, further formatting information is stored in the record itself.
... Converting Records to Tuples 160/247 DATE
FLOAT
INTEGER
NUMBER(
)
VARCHAR(
)
DATE
time_t
FLOAT
float,double
INTEGER
int,long
NUMBER(
)
int[]
VARCHAR(
)
char[]
Defining Tuple
161/247 Record
Tuple
This leads to two structs: FieldDesc
RelnDesc
typedef struct {
short offset;
... Defining Tuple
162/247
FieldDesc fields[] = malloc(4*sizeof(FieldDesc);
fields[0] = FieldDesc(0,4,INTEGER);
fields[1] = FieldDesc(4,20,VARCHAR);
fields[2] = FieldDesc(24,10,CHAR);
fields[3] = FieldDesc(34,4,NUMBER);
... Defining Tuple
163/247 Tuple
Record
typedef struct {
Record data;
... Defining Tuple
164/247 Tuple
Record
RelnDesc
Record
Record
... Defining Tuple
165/247 Record
Tuple
Tuple mkTuple(RelnDesc schema, Record record)
{
int i, pos = 0;
int size = sizeof(Tuple) +
(nfields-1)*sizeof(FieldDesc);
Tuple *t = malloc(size);
t->data = record;
t->nfields = schema.nfields;
for (i=0; i < schema.nfields; i++) {
int len = record[pos++];
t->fields[i].offset = pos;
t->fields[i].length = len;
PostgreSQL Tuples 166/247 src/include/access/*tup*.h
src/backend/access/common/*tup*.c
Datum
... PostgreSQL Tuples 167/247
... PostgreSQL Tuples 168/247
Datum
int
... PostgreSQL Tuples 169/247
typedef struct HeapTupleFields
... PostgreSQL Tuples 170/247
typedef struct tupleDesc
{
int natts;
... PostgreSQL Tuples 171/247
Page Formats 172/247 Page
byte[]
Record
Page
get(rid)
TupleId
first()
Page
next()
Page
insert(rec)
Page
update(rid,rec)
delete(rid)
Page
... Page Formats 173/247 Page
Implementation of Page
Page
Page
Page
Page
... Page Formats 174/247
... Page Formats 175/247
Could add a slot directory to overcome this, but wastes space.
(e.g. 4KB page requires 12-bit offset (10-bit if word-aligned), 256 slots requires 8-bit index)
... Page Formats 176/247
In practice, a combination is useful:
... Page Formats 177/247
... Page Formats 178/247
Storage Utilisation 179/247
... Storage Utilisation 180/247
(integer,varchar(20),char(10),number(4))
char(10)
PageId
... Storage Utilisation 181/247
... Storage Utilisation 182/247
Minimum number of records = 8192/44 = 186
(assume all max size and no directory)
Overflows 183/247
The first case can initially be handled by compacting the free-space.
... Overflows 184/247
... Overflows 185/247
... Overflows 186/247
PageId
... Overflows 187/247
PostgreSQL Page Representation 188/247 src/backend/storage/page/*.c
src/include/storage/bufpage.h
BLCKSZ
Large data items are stored in separate (TOAST) files.
... PostgreSQL Page Representation 189/247
... PostgreSQL Page Representation 190/247
... PostgreSQL Page Representation 191/247
typedef struct PageHeaderData
{
...
uint16 pd_flags;
... PostgreSQL Page Representation 192/247 Page
void PageInit(Page page, Size pageSize, ...)
Page
pd_lower
pd_upper
OffsetNumber
PageAddItem(Page page, Item item, Size size, ...)
Page
void PageRepairFragmentation(Page page)
... PostgreSQL Page Representation 193/247
Both kinds of page have the same page layout.
Representing Database Objects
Database Objects 195/247
Many objects have names (and, in PostgreSQL, all have OIDs).
... Database Objects 196/247
schema ... collection of DB object definitions
tablespace ... collection of DB files
Schema.Relation
PostgreSQL also has cluster: a server managing a set of DBs.
... Database Objects 197/247
Similarly for other DBMS objects
(e.g. views, functions, triggers, ...)
... Database Objects 198/247 INFORMATION_SCHEMA
Schemata(catalog_name, schema_name, schema_owner, ...)
Tables(table_catalog, table_schema, table_name, table_type, ...)
Columns(table_catalog, table_schema, table_name, column_name,
ordinal_position, column_default, is_nullable, data_type, ...)
Views(table_catalog, table_schema, table_name, view_definition,
check_option, is_updatable, is_insertable_into)
Role_table_grants(grantor, grantee, privilege_type, is_grantable,
table_catalog, table_schema, table_name, ...)
System Catalog 199/247
Users(id:int, name:string, ...)
Databases(id:int, name:string, owner:ref(User), ...)
Schemas(id:int, name:string, owner:ref(User), ...)
Types(id:int, name:string, defn:string, size:int, ...)
Tables(id:int, name:string, owner:ref(User),
inSchema:ref(Schema), ...)
Attributes(id, name:string, table:ref(Table),
type:ref(Type), pkey:bool, ...)
INFORMATION_SCHEMA
... System Catalog 200/247
where Object is one of table, view, function, trigger, schema, ...
create
as
drop
alter
grant
on
create table ABC (
x integer primary key,
y integer
);
... System Catalog 201/247
userID := current_user();
schemaID := current_schema();
tabID := nextval('tab_id_seq');
select into intID id
from Types where name='integer';
insert into Tables(id,name,owner,inSchema,...)
values (tabID, 'abc', userID, schema, ...)
attrID := nextval('attr_id_seq');
insert into Attributes(id,name,table,type,pkey,...)
values (attrID, 'x', tabID, intID, true, ...)
attrID := nextval('attr_id_seq');
insert into Attributes(id,name,table,type,pkey,...)
values (attrID, 'y', tabID, intID, false, ...)
... System Catalog 202/247
The low-level representation is available to sysadmins via:
psql
\d
information_schema
(e.g. select * from information_schema.tables;
pg_catalog
pg_tables
PostgreSQL Catalog 203/247 \d?
psql
\dt
list information about tables
\dv
list information about views
\df
list information about functions
\dp
list table access privileges
\dT
list information about data types
\dd
shows comments attached to DB objects
... PostgreSQL Catalog 204/247
Other catalog information is local to each database, e.g
PGDATA/pg_global
... PostgreSQL Catalog 205/247
Each kind of DB object has table(s) to describe it, e.g.
... PostgreSQL Catalog 206/247
OIDs are used as primary keys in many of the catalog tables.
create table
oid
unique identifying number for tuple (optional)
tableoid
which table this tuple belongs to
xmin/xmax
which transaction created/deleted tuple (for MVCC)
Representing Users/Groups 207/247 pg_authid
pg_auth_members
pg_shadow
pg_authid
pg_user
pg_shadow
CREATE
ALTER
DROP
USER
pg_authid
CREATE
ALTER
DROP
GROUP
pg_auth_members
... Representing Users/Groups 208/247 pg_authid
oid
unique integer key for this role
rolname
symbolic name for role (PostgreSQL identifier)
rolpassword
plain or md5-encrypted password
rolcreatedb
can create new databases
rolsuper
is a superuser (owns server process)
rolcatupdate
can update system catalogs
... Representing Users/Groups 209/247 pg_shadow
usename
symbolic user name (e.g.
'jas'
usesysid
integer key to reference user (
pg_authid.oid
passwd
plain or md5-encrypted password
usecreatdb
can create new databases
usesuper
is a superuser (owns server process)
usecatupd
can update system catalogs
... Representing Users/Groups 210/247 pg_group
groname
group name (e.g.
'developers'
grosysid
integer key to reference group
grolist[]
array containing group members
(vector of refs to pg_authid.oid
Representing High-level Objects 211/247
These tables are global to each PostgreSQL cluster.
pg_database
pg_namespace
pg_tablespace
... Representing High-level Objects 212/247 pg_database
datname
database name (e.g.
'mydb'
datdba
database owner (refs
pg_authid.oid
datpath
where files for database are stored
(if not in the PGDATA directory)
datacl[]
access permissions
datistemplate
can be used to clone new databases
(e.g. template0
template1
... Representing High-level Objects 213/247 acl
UserName=Privileges/Grantor
group GroupName=Privileges/Grantor
jas=arwdRxt/jas,fred=r/jas,joe=rwad/jas
... Representing High-level Objects 214/247 pg_namespace
nspname
namespace name (e.g.
'public'
nspowner
namespace owner (refs
pg_authid.oid
nspacl[]
access permissions
nspname
... Representing High-level Objects 215/247 pg_tablespace
spcname
tablespace name (e.g.
'disk5'
spcowner
tablespace owner (refs
pg_authid.oid
spclocation
full filepath to tablespace directory
spcacl[]
access permissions
pg_default
pg_global
Representing Tables 216/247 pg_class
pg_class
CREATE TYPE AS
pg_class
pg_class
... Representing Tables 217/247 pg_class
relname
name of table (e.g.
employee
relnamespace
schema in which table defined
(refs pg_namespace.oid
reltype
data type corresponding to table
(refs pg_type.oid
relowner
owner (refs
pg_authid.oid
reltuples
# tuples in table
relacl
access permissions
... Representing Tables 218/247 pg_class
relkind
what kind of object
'r' = ordinary table, 'i' = index, 'v' = view
'c' = composite type, 'S' = sequence, 's' = special
relnatts
# attributes in table
(how many entries in pg_attribute
relchecks
# of constraints on table
(how many entries in pg_constraint
relhasindex
table has/had an index?
relhaspkey
table has/had a primary key?
... Representing Tables 219/247 pg_type
typname
name of type (e.g.
'integer'
typnamespace
schema in which type defined
(refs pg_namespace.oid
typowner
owner (refs
pg_authid.oid
typtype
what kind of data type
'b' = base type, 'c' = complex (row) type, ...
(defines "type" for each tuple in table; also, type for functions returning SETOF
... Representing Tables 220/247 pg_type
typlen
how much storage used for values
(-1 for variable-length types, e.g. text
typalign
memory alignment for values
('c' = byte-boundary, 'i' = 4-byte-boundary, ...)
typrelid
table associated with complex type
(refs pg_class.oid
typstorage
where/how values are stored
('p' = in-tuple, 'e' = in external table, compressed?)
pg_type
... Representing Tables 221/247 pg_attribute
attname
name of attribute (e.g.
'empname'
attrelid
table this attribute belongs to
(refs pg_class.oid
attnum
attribute position (1..n, sys attrs are -ve)
atttypid
data type of this attribute
(refs pg_type.oid
(attrelid,attnum)
... Representing Tables 222/247 pg_attribute
attlen
storage space required by attribute
(copy of pg_type.typlen
atttypmod
storage space for var-length attributes
(e.g. 6+ATTR_HEADER_SIZE for char(6)
attalign
memory-alignment info (copy of
pg_type.typalign
attndims
number of dimensions if attr is an array
... Representing Tables 223/247 pg_attribute
attnotnull
attribute may not be null?
atthasdef
attribute has a default values
(value is held in pg_attrdef
attisdropped
attribute has been dropped from table
... Representing Tables 224/247
create table MyTable (
a int unique not null,
b char(6)
);
pg_class
pg_attribute
pg_type
... Representing Tables 225/247
rel_oid := new_oid(); user_id = current_user();
insert into
pg_class(oid,name,owner,kind,pages,tuples,...)
values (rel_oid, 'mytable', user_id, 'r', 0, 0, ...)
select oid,typlen into int_oid,int_len
from pg_type where typname = 'int';
insert into
pg_attribute(relid,name,typid,num,len,typmod,notnull...)
values (rel_oid, 'a', int_oid, 1, int_len, -1, true, ...)
select oid,typlen into char_oid,char_len
from pg_type where typname = 'char';
insert into
pg_attribute(relid,name,typid,num,len,typmod,notnull...)
values (rel_oid, 'b', char_oid, 2, -1, 6+4, false, ...)
insert into
pg_type(name,owner,len,type,relid,align,...)
values ('mytable', user_id, 4, 'c', rel_oid, 'i', ...)
... Representing Tables 226/247 pg_attrdef
adrelid
table that column belongs to
(refs pg_class.oid
adnum
which column in the table
(refs pg_attribute.attnum
adsrc
readable representation of default value
adbin
internal representation of default value
... Representing Tables 227/247 pg_constraint
conname
name of constraint (not unique)
connamespace
schema containing this constraint
contype
kind of constraint
'c' = check, 'u' = unique,
'p' = primary key, 'f' = foreign key
conrelid
which table (refs
pg_class.oid
conkey
which attributes
(vector of values from pg_attribute.attnum
consrc
check constraint expression
fkey,check
... Representing Tables 228/247 pg_constraint
confrelid
referenced table for foreign key
confkey
key attributes in foreign table
conkey
corresponding attributes in local table
consrc
readable check constraint expression
conbin
internal check constraint expression
... Representing Tables 229/247
create table MyOtherTable (
x int check (x > 0),
y int references MyTable(a),
z int default -1
);
pg_constraint
x
y
pg_attrdef
z
... Representing Tables 230/247
rel_oid := new_oid(); user_id = current_user();
insert into
pg_class(oid,name,owner,kind,pages,tuples,...)
values (rel_oid, 'myothertable', user_id, 'r', 0, 0, ...)
select oid,typlen into int_oid,int_len
from pg_type where typname = 'int';
select oid into old_oid
from pg_class where relname='mytable';
Representing Functions 231/247
create function power(int x, int y) returns int
as $$
declare i int; product int := 1;
begin
for i in 1..y loop
product := product * x;
end loop;
return product;
end;
$$ language plpgsql;
pg_proc
pg_type
... Representing Functions 232/247 pg_proc
proname
name of function (e.g.
substr
pronamespace
schema in which function defined
(refs pg_namespace.oid
proowner
owner (refs
pg_authid.oid
proacl[]
access permissions
... Representing Functions 233/247 pg_proc
pronargs
how many arguments
prorettype
return type (refs
pg_type.oid
proargtypes[]
argument types (ref
pg_type.oid
proreset
returns set of values of
prorettype
proisagg
is function an aggregate?
proisstrict
returns null if any arg is null
provolatile
return value depends on side-effects?
('i' = immutable, 's'= stable, 'v' = volatile)
... Representing Functions 234/247 pg_proc
prolang
what language function written in
prosrc
source code if interpreted (e.g. PLpgSQL)
probin
additional info on how to invoke function
(interpretation is language-specific)
... Representing Functions 235/247
sq.c int square_in_c(int x) { return x * x; }
create function square(int) returns int
as '/path/to/sq.o', 'square_in_c' language 'C';
create function square(int) returns int
as $$
begin
return $1 * $1;
end;
$$ language plpgsql;
... Representing Functions 236/247
user_id := current_user();
select oid,typlen into int_oid,int_len
from pg_type where typname = 'int';
insert into
pg_proc(name,owner,rettype,nargs,argtypes,
prosrc,probin...)
values ('square', user_id, int_oid, 1, {int_oid},
'square_in_c', '/path/to/sq.o', ...)
... Representing Functions 237/247 max()
This information is stored in the pg_aggregate
pg_proc
... Representing Functions 238/247 average()
create aggregate average (
basetype = integer,
sfunc = int_avg_accum,
stype = int[],
finalfunc = int_avg_result,
initcond = '{0,0}'
);
... Representing Functions 239/247
create function
int_avg_accum(state int[], int) returns int[]
as $$
declare res int[2];
begin
res[1] := state[1] + $2; res[2] := res[2] + 1;
return res;
end;
$$ language plpgsql;
create function
int_avg_result(state int[]) returns int
as $$
begin
if (state[2] = 0) then return null; end if;
return (state[1] / state[2]);
end;
$$ language plpgsql;
... Representing Functions 240/247 power(x,y)
create operator ** (
procedure = power, leftarg = int, rightarg = int
);
pg_operator
Representing Types 241/247
Consider defining a 3-dimensional point type for spatial data:
create type point3d (
input = point3d_in,
... Representing Types 242/247 pg_type
typinput
text input conversion function
typoutput
text output conversion function
typreceive
binary input conversion function
typsend
binary output conversion function
pg_proc.oid
... Representing Types 243/247
pg_am
pg_opclass
pg_amop
pg_amproc
... Representing Types 244/247 pg_am
amname
name of access method (e.g.
btree
amowner
owner (refs
pg_authid.oid
amorder
strategy
operator for determining sort order
(0 if unsorted)
amcanunique
does AM support unique indexes?
ammulticol
does AM support multicolumn indexes?
amindexnulls
does AM support NULL index entries?
amconcurrent
does AM support concurrent updates?
... Representing Types 245/247 pg_am
amgettuple
"next valid tuple" function
ambeginscan
"start new scan" function
amrescan
"restart this scan" function
amendscan
"end this scan" function
amcostestimate
estimate cost of index scan
pg_proc.oid
... Representing Types 246/247 pg_am
aminsert
"insert this tuple" function
ambuild
"build new index" function
ambulkdelete
bulk delete function
amvacuum
cleanup
post-vacuum cleanup function
pg_proc.oid
... Representing Types 247/247
Some access methods introduce additional files (e.g. B-tree)
heap
btree
hash
rtree
GiST
SP-GiST
GIN