❖ Week 09 Tuesday |
❖ Mapping SQL to RA |
Mapping SQL to relational algebra, e.g.
-- schema: R(a,b) S(c,d)
select a as x select a as x
from R join S on (b=c) OR from R, S
where d=100 where b=c and d=100
In general:
SELECTFROMWHEREASFROMWHERE❖ Mapping SQL to RA (cont) |
Example mapping using above:
select a as x
from R join S on (b=c)
where d=100
-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a) = Proj[a](Tmp2)
Res(x) = Rename[Res(x)](Tmp3)
❖ Exercise: A Better SQL to RA Mapping |
On the previous slide, we translated this SQL query as follows:
select a as x
from R join S on (b=c)
where d=100
-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a) = Proj[a](Tmp2)
Res(x) = Rename[Res(x)](Tmp3)
Suggest a better approach (where "better" = "lower cost")
Assumptions: |R| = 2000, |S| = 1000, |R⋈S| = 1000, |Res| = 100
❖ Another Mapping Example |
The query: Courses with more than 100 students in them?
Can be expressed in SQL as
select s.id, s.code from Course c, Subject s, Enrolment e where c.id = e.course and c.subject = s.id group by s.id, s.code having count(*) > 100;
and might be compiled to
Result =
Project[id,code](
GroupSelect[size>100] (
GroupBy[id,code] (
Subject Join[s.id=c.subject]
(Course Join[c.id=e.course] Enrolment)
) ) )
❖ Exercise: Relational Operation Plan |
So far, we have been expressing RA evaluation as follows:
Tmp(a,b,c,...) = Op[x](R) or R Op[x] S
Each statement involves a single relational algebra operation.
Render the RA from the previous slide in this form.
Result =
Project[id,code](
GroupSelect[size>100] (
GroupBy[id,code] (
Subject Join[s.id=c.subject]
(Course Join[c.id=e.course] Enrolment)
) ) )
❖ Implementations of RA Ops |
Sorting (quicksort, etc. are not applicable)
❖ Query Cost Estimation |
The cost of evaluating a query is determined by
❖ Query Cost Estimation (cont) |
An execution plan is a sequence of concrete relational operations.
Consider execution plans for: σc (R ⋈d S ⋈e T)
tmp1 := hash_join[d](R,S) tmp2 := sort_merge_join[e](tmp1,T) result := binary_search[c](tmp2)
or
tmp1 := btree_search[c](R) tmp2 := hash_join[d](tmp1,S) result := sort_merge_join[e](tmp2)
Both produce same result, but have different costs.
❖ Exercise: Query Cost Estimation |
Estimate the cost of each plan, where |R|=100, |S|=1000, |T|=10000
tmp1 := hash_join[d](R,S) tmp2 := sort_merge_join[e](tmp1,T) result := binary_search[c](tmp2)
or
tmp1 := sort_merge_join[e](S,T) tmp2 := hash_join[d](R,tmp1) result := linear_search[c](tmp2)
or
tmp1 := btree_search[c](R) tmp2 := hash_join[d](tmp1,S) result := sort_merge_join[e](tmp2)
❖ Query Optimisation |
A DBMS query optimiser works as follows:
Input: relational algebra expression
Output: execution plan (sequence of RA ops)
bestCost = INF; bestPlan = none
while (more possible plans) {
plan = produce a new evaluation plan
cost = estimated_cost(plan)
if (cost < bestCost) {
bestCost = cost; bestPlan = plan
}
}
return bestPlan
Typically, there are very, very many possible plans
❖ DB Application Performance |
In order to make DB applications efficient, it is useful to know:
(which queries, updates, inserts and how frequently is each one performed)
(data structures and algorithms for select, project, join, sort, ...)
(in terms of the amount of data transferred between memory and disk)
Achieve by using indexes and avoiding certain SQL query structures
❖ DB Application Performance (cont) |
Application programmer choices that affect query cost:
❖ DB Application Performance (cont) |
Whatever you do as a DB application programmer
You have no control over the optimisation process
❖ DB Application Performance (cont) |
Example: query to find sales people earning more than $50K
select name from Employee
where salary > 50000 and
empid in (select empid from Worksin
where dept = 'Sales')
A query evaluator would have to use the strategy
SalesEmps = (select empid from WorksIn where dept='Sales')
foreach e in Employee {
if (e.empid in SalesEmps && e.salary > 50000)
add e to result set
}
Needs to examine all employees, even if not in Sales
This is not a good expression of the query.
❖ DB Application Performance (cont) |
A different expression of the same query:
select name
from Employee join WorksIn using (empid)
where Employee.salary > 5000 and
WorksIn.dept = 'Sales'
Query evaluator might use the strategy
SalesEmps = (select * from WorksIn where dept='Sales')
foreach e in (Employee join SalesEmps) {
if (e.salary > 50000)
add e to result set
}
Only examines Sales employees, and uses a simpler test
This is a good expression of the query.
❖ DB Application Performance (cont) |
A very poor expression of the query (correlated subquery):
select name from Employee e
where salary > 50000 and
'Sales' in (select dept from Worksin where empid=e.id)
A query evaluator would be forced to use the strategy:
foreach e in Employee {
Depts = (select dept from WorksIn where empid=e.empid)
if ('Sales' in Depts && e.salary > 50000)
add e to result set
}
Needs to run a query for every employee ...
❖ Exercise: Comparing Query Costs |
Measure the time costs of the following queries:
select count(*) from Movies where substring(title,1,1) = 'D';
select count(*) from Movies where title like 'D%';
select count(*) from Movies where title ilike 'd%';
select count(*) from Movies where title ~ '^D.*';
select count(*) from Movies where title ~* '^d.*';
select count(*) from Movies where title = 'District 9';
select count(*) from Movies where id = 111113;
select max(id) from Movies;
select max(title) from Movies;
select count(*)
from Movies m join Principals p on (m.id = p.movie)
join People n on (n.id = p.person)
where p.ord = 1;
❖ PostgreSQL Query Costs |
PostgreSQL provides the explain
EXPLAIN [ANALYZE] Query
Without ANALYZEEXPLAIN
With ANALYZEEXPLAIN
Note that runtimes may show considerable variation due to buffering.
If simply want to know the runtime is ok, maybe \timing
❖ Indexes |
Indexes provide efficient content-based access to tuples.
Can build indexes on any (combination of) attributes.
Definining indexes:
CREATE INDEX name ON table ( attr1, attr2, ... )
attri can be an arbitrary expression (e.g. upper(name)
CREATE INDEX
UNIQUEUSING❖ Indexes (cont) |
Indexes can significantly improve query costs.
Considerations in applying indexes:
-- use hashing for (unique) attributes in equality tests, e.g. select * from Employee where id = 12345 -- use B-tree for attributes in range tests, e.g. select * from Employee where age > 60
❖ Exercise: Indexes and Query Types |
Consider a table Students(id,name,age,...)
If we had the following mix of queries
| Query | Freq |
select * from Students where id = zID |
70% |
select * from Students where name = Name |
10% |
select * from Students where name like '%str%' |
15% |
select * from Students where age > Age |
5% |
❖ Query Tuning |
Sometimes, a query can be re-phrased to affect performance:
select name from Employee where salary/365 > 100
-- fix by re-phrasing condition to (salary > 36500)
select name from Employee where name like '%ith%'
select name from Employee where birthday is null
-- above two are difficult to "fix"
select name from Employee
where dept in (select id from Dept where ...)
-- fix by using Employee join Dept on (e.dept=d.id)
❖ Query Tuning (cont) |
Other tricks in query tuning (effectiveness is DBMS-dependent)
select distinctdistinct
select ... Employee join Customer on (s.name = p.name)
vs
select ... Employee join Customer on (s.ssn = p.ssn)
ororor
select name from Employee where Dept=1 or Dept=2
vs
(select name from Employee where Dept=1)
union
(select name from Employee where Dept=2)
❖ Exercise: More Query Costs |
Check whether the optimisations mentioned above actually work
-- movies longer than 3 hours select title from Movies where runtime/60 > 3 -- or -- select title from Movies where runtime > 180 -- people who were a principal in some movie select name from People where id in (select person from Principals) -- or -- select p.name from People p where exists (select * from Principals where person=p.id) -- or -- select n.name from People n join Principals p on (n.id=p.person)
❖ Exercise: More Query Costs (cont) |
Check whether the optimisations mentioned above actually work
-- movies made in Albania or Iceland select title from Movies where origin = 'AL' or origin = 'IS' -- or -- (select title from Movies where origin = 'AL') union (select title from Movies where origin = 'IS') -- or -- select title from Movies where origin in ('AL','IS')
See how the above queries are affected by adding an index on Movies.origin
❖ EXPLAIN Examples |
Note that PostgreSQL builds a query evaluation tree, rather than a linear plan, e.g.
EXPLAIN
❖ EXPLAIN Examples (cont) |
Example: Select on indexed attribute
db=# explain select * from Students where id=100250;
QUERY PLAN
-----------------------------------------------------------------
Index Scan using student_pkey on student
(cost=0.00..5.94 rows=1 width=17)
Index Cond: (id = 100250)
db=# explain analyze select * from Students where id=100250;
QUERY PLAN
-----------------------------------------------------------------
Index Scan using student_pkey on student
(cost=0.00..5.94 rows=1 width=17)
(actual time=3.209..3.212 rows=1 loops=1)
Index Cond: (id = 100250)
Total runtime: 3.252 ms
❖ EXPLAIN Examples (cont) |
Example: Select on non-indexed attribute
db=# explain select * from Students where stype='local';
QUERY PLAN
----------------------------------------------------------
Seq Scan on student (cost=0.00..70.33 rows=18 width=17)
Filter: ((stype)::text = 'local'::text)
db=# explain analyze select * from Students
db-# where stype='local';
QUERY PLAN
---------------------------------------------------------------
Seq Scan on student (cost=0.00..70.33 rows=18 width=17)
(actual time=0.061..7.784 rows=2512 loops=1)
Filter: ((stype)::text = 'local'::text)
Total runtime: 7.554 ms
❖ EXPLAIN Examples (cont) |
Example: Join on a primary key (indexed) attribute
db=# explain
db-# select s.sid,p.name
db-# from Students s join People p on s.id=p.id;
QUERY PLAN
-----------------------------------------------------------
Hash Join (cost=70.33..305.86 rows=3626 width=52)
Hash Cond: ("outer".id = "inner".id)
-> Seq Scan on person p
(cost=0.00..153.01 rows=3701 width=52)
-> Hash (cost=61.26..61.26 rows=3626 width=8)
-> Seq Scan on student s
(cost=0.00..61.26 rows=3626 width=8)
❖ EXPLAIN Examples (cont) |
Example: Join on a non-indexed attribute
db=# explain
db-# select s1.code, s2.code
db-# from Subjects s1 join Subjects s2 on s1.offerer=s2.offerer;
QUERY PLAN
----------------------------------------------------------------
Merge Join (cost=2744.12..18397.14 rows=1100342 width=18)
Merge Cond: (s1.offerer = s2.offerer)
-> Sort (cost=1372.06..1398.33 rows=10509 width=13)
Sort Key: s1.offerer
-> Seq Scan on subjects s1
(cost=0.00..670.09 rows=10509 width=13)
-> Sort (cost=1372.06..1398.33 rows=10509 width=13)
Sort Key: s2.offerer
-> Seq Scan on subjects s2
(cost=0.00..670.09 rows=10509 width=13)
❖ Transactions, Concurrency |
DBMSs provide valuable information resources in an environment that is:
❖ Transactions, Concurrency (cont) |
Transaction processing
❖ Transactions |
A transaction is
To maintain integrity of data, transactions must be:
❖ Example Transaction |
Bank funds transfer
Accounts(id,name,balance,heldAt, ...)Branches(id,name,address,assets, ...)Branches.assets❖ Example Transaction (cont) |
Example function to implement bank transfer ...
create or replace function
transfer(N integer, Src text, Dest text)
returns integer
declare
sID integer; dID integer; avail integer;
begin
select id,balance into sID,avail
from Accounts where name=Src;
if (sID is null) then
raise exception 'Invalid source account %',Src;
end if;
select id into dID
from Accounts where name=Dest;
if (dID is null) then
raise exception 'Invalid dest account %',Dest;
end if;
...
❖ Example Transaction (cont) |
Example function to implement bank transfer
...
if (avail < N) then
raise exception 'Insufficient funds in %',Src;
end if;
-- total funds in system = NNNN
update Accounts set balance = balance-N
where id = sID;
-- funds temporarily "lost" from system
update Accounts set balance = balance+N
where id = dID;
-- funds restored to system; total funds = NNNN
return nextval('tx_id_seq');
end;
❖ Transaction Concepts |
A transaction must always terminate, either:
COMMITABORT
❖ Transaction Concepts (cont) |
To describe transaction effects, we consider:
READWRITEABORTCOMMITR(X)W(X)AC
SELECTREADINSERTWRITEUPDATEDELETEREADWRITE❖ Transaction Consistency |
Transactions typically have intermediate states that are invalid.
However, states before and after transaction must be valid.
Valid = consistent = satisfying all specified constraints on the data
❖ Transaction Consistency (cont) |
Transaction descriptions can be abstracted
❖ Serial Schedules |
Serial execution:
T1T2T2T1
T1: R(X) W(X) R(Y) W(Y) T2: R(X) W(X)
or
T1: R(X) W(X) R(Y) W(Y) T2: R(X) W(X)
Serial execution guarantees a consistent final state if
T1T2❖ Concurrent Schedules |
Concurrent schedules interleave T1T2
Some concurrent schedules are ok, e.g.
T1: R(X) W(X) R(Y) W(Y) T2: R(X) W(X)
Other concurrent schedules cause anomalies, e.g.
T1: R(X) W(X) R(Y) W(Y) T2: R(X) W(X)
Want the system to ensure that only valid schedules occur.
❖ Serializability |
Serializable schedule:
Abstracting this needs a notion of schedule equivalence.
Two common formulations of serializability:
Produced: 12 Apr 2023