❖ Things To Note |
❖ Assignment 2 |
An implementation of multi-attribute linear hashing
❖ Assignment 2 (cont) |
Implementation involves three main commands
create
insert
select
stats
dump
gendata
main()
❖ Assignment 2 (cont) |
The commands are supported by a set of ADTs:
bits
chvec
hash
page
query
reln
tuple
util
.h
.c
❖ Assignment 2 (cont) |
As supplied, all code compiles
create
insert
select
Your task: complete the ADTs so that all commands work.
❖ Assignment 2 (cont) |
What to change:
tuple.c
query.c
reln.c
zip ass2.zip all changed files
give cs9315 ass2 ass2.zip
create.c
insert.c
select.c
❖ Indexing with Signatures |
Signature-based indexing:
Each tuple is associated with a signature
❖ Indexing with Signatures (cont) |
File organisation for signature indexing (two files)
One signature slot per tuple slot; unused signature slots are zeroed.
Signatures do not determine record placement
❖ Signatures |
A signature "summarises" the data in one tuple
A tuple consists of N attribute values A1 .. An
A codeword cw(Ai) is
❖ Generating Codewords |
Generating a k-in-m codeword for attribute Ai
bits codeword(char *attr_value, int m, int k) { int nbits = 0; // count of set bits bits cword = 0; // assuming m <= 32 bits srandom(hash(attr_value)); while (nbits < k) { int i = random() % m; if (((1 << i) & cword) == 0) { cword |= (1 << i); nbits++; } } return cword; // m-bits with k 1-bits and m-k 0-bits }
❖ Superimposed Codewords (SIMC) |
In a superimposed codewords (simc) indexing scheme
OR
OR
OR
bits desc = 0 for (i = 1; i <= n; i++) { bits cw = codeword(A[i]) desc = desc | cw }
❖ SIMC Example |
Consider the following tuple (from bank deposit database)
Branch | AcctNo | Name | Amount |
Perryridge | 102 | Hayes | 400 |
It has the following codewords/descriptor (for m = 12, k = 2 )
Ai | cw(Ai) |
Perryridge | 010000000001 |
102 | 000000000011 |
Hayes | 000001000100 |
400 | 000010000100 |
desc(r) | 010011000111 |
❖ SIMC Queries |
To answer query q in SIMC
desc(q) is formed by OR of codewords for known attributes.
E.g. consider the query (Perryridge, ?, ?, ?)
Ai | cw(Ai) |
Perryridge | 010000000001 |
? |
000000000000 |
? |
000000000000 |
? |
000000000000 |
desc(q) | 010000000001 |
❖ SIMC Queries (cont) |
Once we have a query descriptor, we search the signature file:
pagesToCheck = {}
for each descriptor D[i] in signature file {
if (matches(D[i],desc(q))) {
pid = pageOf(tupleID(i))
pagesToCheck = pagesToCheck ∪ pid
}
}
for each P in pagesToCheck {
Buf = getPage(f,P)
check tuples in Buf for answers
}
// where ...
#define matches(rdesc,qdesc)
((rdesc & qdesc) == qdesc)
❖ Example SIMC Query |
Consider the query and the example database:
Signature | Deposit Record |
010000000001 |
(Perryridge,?,?,?) |
100101001001 |
(Brighton,217,Green,750) |
010011000111 |
(Perryridge,102,Hayes,400) |
101001001001 |
(Downtown,101,Johnshon,512) |
101100000011 |
(Mianus,215,Smith,700) |
010101010101 |
(Clearview,117,Throggs,295) |
100101010011 |
(Redwood,222,Lindsay,695) |
Gives two matches: one true match, one false match.
❖ SIMC Parameters |
False match probablity pF = likelihood of a false match
How to reduce likelihood of false matches?
Having k too high ⇒ increased overlapping.
Having k too low ⇒ increased hash collisions.
❖ SIMC Parameters (cont) |
How to determine "optimal" m and k?
m = ( 1/loge 2 )2 . n . loge ( 1/pF )
❖ Query Cost for SIMC |
Cost to answer pmr query: Costpmr = bD + bq
E.g. m=64, B=8192, r=104 ⇒ cD = 1024, bD=10
bq includes pages with rq matching tuples and rF false matches
Expected false matches = rF = (r - rq).pF ≅ r.pF if rq ≪ r
E.g. Worst bq = rq+rF, Best bq = 1, Avg bq = ceil(b(rq+rF)/r)
❖ Exercise: SIMC Query Cost |
Consider a SIMC-indexed database with the following properties
❖ Page-level SIMC |
SIMC has one descriptor per tuple ... potentially inefficient.
Alternative approach: one descriptor for each data page.
Every attribute of every tuple in page contributes to descriptor.
Size of page descriptor (PD) (clearly larger than tuple descriptor):
Typically, pages are 1..8KB ⇒ 1..9 PD/page (NPD).
❖ Exercise: Page-level SIMC Query Cost |
Consider a SIMC-indexed database with the following properties
❖ Bit-sliced SIMC (cont) |
At query time
matches = ~0 //all ones
for each bit i set to 1 in desc(q) {
slice = fetch bit-slice i
matches = matches & slice
}
for each bit i set to 1 in matches {
fetch page i
scan page for matching records
}
Effective because desc(q) typically has less than half bits set to 1
❖ Exercise: Bit-sliced SIMC Query Cost |
Consider a SIMC-indexed database with the following properties
❖ Join |
DBMSs are engines to store, combine and filter information.
Join (⋈) is the primary means of combining information.
Join is important and potentially expensive
Most common join condition: equijoin, e.g. (R.pk = S.fk)
Join varieties (natural, inner, outer, semi, anti) all behave similarly.
We consider three strategies for implementing join
❖ Join Example |
Consider a university database with the schema:
create table Student( id integer primary key, name text, ... ); create table Enrolled( stude integer references Student(id), subj text references Subject(code), ... ); create table Subject( code text primary key, title text, ... );
❖ Join Example (cont) |
List names of students in all subjects, arranged by subject.
SQL query to provide this information:
select E.subj, S.name from Student S, Enrolled E where S.id = E.stude order by E.subj, S.name;
And its relational algebra equivalent:
Student
Enrolled
❖ Join Example (cont) |
Some database statistics:
Sym | Meaning | Value |
rS | # student records | 20,000 |
rE | # enrollment records | 80,000 |
cS | Student |
20 |
cE | Enrolled |
40 |
bS | # data pages in Student |
1,000 |
bE | # data pages in Enrolled |
2,000 |
Also, in cost analyses below, N = number of memory buffers.
❖ Join Example (cont) |
Out
Sym | Meaning | Value |
rOut | # tuples in result | 80,000 |
COut | result records/page | 80 |
bOut | # data pages in result | 1,000 |
Notes:
Enrolled
subj
name
❖ Nested Loop Join |
Basic strategy (R.a ⋈ S.b):
Result = {} for each page i in R { pageR = getPage(R,i) for each page j in S { pageS = getPage(S,j) for each pair of tuples tR,tS from pageR,pageS { if (tR.a == tS.b) Result = Result ∪ (tR:tS) } } }
Needs input buffers for R and S, output buffer for "joined" tuples
Terminology: R is outer relation, S is inner relation
Cost = bR . bS ... ouch!
❖ Block Nested Loop Join |
Method (for N memory buffers):
(tR,tS)
❖ Block Nested Loop Join (cont) |
Best-case scenario: bR ≤ N-2
Typical-case scenario: bR > N-2
Cost = bR + bS . ceil(bR/N-2)
Note: always requires rR.rS checks of the join condition
❖ Exercise: Nested Loop Join Cost |
Compute the cost (# pages fetched) of (S ⋈ E)
Sym | Meaning | Value |
rS | # student records | 20,000 |
rE | # enrollment records | 80,000 |
cS | Student |
20 |
cE | Enrolled |
40 |
bS | # data pages in Student |
1,000 |
bE | # data pages in Enrolled |
2,000 |
for N = 22, 202, 2002 and different inner/outer combinations
❖ Exercise: Nested Loop Join Cost (cont) |
If the query in the above example was:
select j.code, j.title, s.name from Student s join Enrolled e on (s.id=e.student) join Subject j on (e.subj=j.code)
how would this change the previous analysis?
What join combinations are there?
Assume 2000 subjects, with cJ = 10
How large would the intermediate tuples be? What assumptions?
Compute the cost (# pages fetched, # pages written) for N = 202
❖ BNJ in Practice |
Why block nested loop join is actually useful in practice ...
Many queries have the form
select * from R,S where r.i=s.j and r.x=k
This would typically be evaluated as
Tmp = Sel[r.x=k](R) Res = Tmp Join[i=j] S
If Tmp
❖ Index Nested Loop Join |
A problem with nested-loop join:
Consider Join[i=j](R,S):
for each tuple r in relation R { use index to select tuples from S where s.j = r.i for each selected tuple s from S { add (r,s) to result } }
❖ Index Nested Loop Join (cont) |
This method requires:
Typical SelS = 1-2 (hashing) .. bq (unclustered index)
Trade-off: rR.SelS vs bR.bS, where bR ≪ rR and SelS ≪ bS
❖ Exercise: Index Nested Loop Join Cost |
Consider executing Join[i=j](S,T) with the following parameters:
❖ Sort-Merge Join |
Basic approach:
(r,s)
❖ Sort-Merge Join (cont) |
Method requires several cursors to scan sorted relations:
r
s
ss
❖ Sort-Merge Join (cont) |
Algorithm using query iterators/scanners:
Query ri, si; Tuple r,s; ri = startScan("SortedR"); si = startScan("SortedS"); while ((r = nextTuple(ri)) != NULL && (s = nextTuple(si)) != NULL) { // align cursors to start of next common run while (r != NULL && r.i < s.j) r = nextTuple(ri); if (r == NULL) break; while (s != NULL && r.i > s.j) s = nextTuple(si); if (s == NULL) break; // must have (r.i == s.j) here ...
❖ Sort-Merge Join (cont) |
... // remember start of current run in S TupleID startRun = scanCurrent(si) // scan common run, generating result tuples while (r != NULL && r.i == s.j) { while (s != NULL and s.j == r.i) { addTuple(outbuf, combine(r,s)); if (isFull(outbuf)) { writePage(outf, outp++, outbuf); clearBuf(outbuf); } s = nextTuple(si); } r = nextTuple(ri); setScan(si, startRun); } }
❖ Sort-Merge Join (cont) |
Buffer requirements:
❖ Sort-Merge Join (cont) |
Cost of sort-merge join.
Step 1: sort each relation (if not already sorted):
❖ Sort-Merge Join on Example |
Case 1: Join[id=stude](Student,Enrolled)
Cost | = | sort(S) + sort(E) + bS + bE |
= | 2bS(1+log31(bS/32)) + 2bE(1+log31(bE/32)) + bS + bE | |
= | 2×1000×(1+2) + 2×2000×(1+2) + 1000 + 2000 | |
= | 6000 + 12000 + 1000 + 2000 | |
= | 21,000 |
❖ Sort-Merge Join on Example (cont) |
Case 2: Join[id=stude](Student,Enrolled)
Cost = 2,000 + 1,000 = 3,000 (regardless of which relation is outer)
❖ Exercise: Sort-merge Join Cost |
Consider executing Join[i=j](S,T) with the following parameters:
Produced: 25 Mar 2024