❖ Week 09 Tuesday |
❖ Mapping SQL to RA |
Mapping SQL to relational algebra, e.g.
-- schema: R(a,b) S(c,d)
select a as x select a as x
from R join S on (b=c) OR from R, S
where d=100 where b=c and d=100
In general:
SELECT
FROM
WHERE
AS
FROM
WHERE
❖ Mapping SQL to RA (cont) |
Example mapping using above:
select a as x
from R join S on (b=c)
where d=100
-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a) = Proj[a](Tmp2)
Res(x) = Rename[Res(x)](Tmp3)
❖ Exercise: A Better SQL to RA Mapping |
On the previous slide, we translated this SQL query as follows:
select a as x
from R join S on (b=c)
where d=100
-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a) = Proj[a](Tmp2)
Res(x) = Rename[Res(x)](Tmp3)
Suggest a better approach (where "better" = "lower cost")
Assumptions: |R| = 2000, |S| = 1000, |R⋈S| = 1000, |Res| = 100
❖ Another Mapping Example |
The query: Courses with more than 100 students in them?
Can be expressed in SQL as
select s.id, s.code from Course c, Subject s, Enrolment e where c.id = e.course and c.subject = s.id group by s.id, s.code having count(*) > 100;
and might be compiled to
Result = Project[id,code]( GroupSelect[size>100] ( GroupBy[id,code] ( Subject Join[s.id=c.subject] (Course Join[c.id=e.course] Enrolment) ) ) )
❖ Exercise: Relational Operation Plan |
So far, we have been expressing RA evaluation as follows:
Tmp(a,b,c,...) = Op[x](R) or R Op[x] S
Each statement involves a single relational algebra operation.
Render the RA from the previous slide in this form.
Result = Project[id,code]( GroupSelect[size>100] ( GroupBy[id,code] ( Subject Join[s.id=c.subject] (Course Join[c.id=e.course] Enrolment) ) ) )
❖ Implementations of RA Ops |
Sorting (quicksort, etc. are not applicable)
❖ Query Cost Estimation |
The cost of evaluating a query is determined by
❖ Query Cost Estimation (cont) |
An execution plan is a sequence of concrete relational operations.
Consider execution plans for: σc (R ⋈d S ⋈e T)
tmp1 := hash_join[d](R,S) tmp2 := sort_merge_join[e](tmp1,T) result := binary_search[c](tmp2)
or
tmp1 := btree_search[c](R) tmp2 := hash_join[d](tmp1,S) result := sort_merge_join[e](tmp2)
Both produce same result, but have different costs.
❖ Exercise: Query Cost Estimation |
Estimate the cost of each plan, where |R|=100, |S|=1000, |T|=10000
tmp1 := hash_join[d](R,S) tmp2 := sort_merge_join[e](tmp1,T) result := binary_search[c](tmp2)
or
tmp1 := sort_merge_join[e](S,T) tmp2 := hash_join[d](R,tmp1) result := linear_search[c](tmp2)
or
tmp1 := btree_search[c](R) tmp2 := hash_join[d](tmp1,S) result := sort_merge_join[e](tmp2)
❖ Query Optimisation |
A DBMS query optimiser works as follows:
Input: relational algebra expression Output: execution plan (sequence of RA ops) bestCost = INF; bestPlan = none while (more possible plans) { plan = produce a new evaluation plan cost = estimated_cost(plan) if (cost < bestCost) { bestCost = cost; bestPlan = plan } } return bestPlan
Typically, there are very, very many possible plans
❖ DB Application Performance |
In order to make DB applications efficient, it is useful to know:
(which queries, updates, inserts and how frequently is each one performed)
(data structures and algorithms for select, project, join, sort, ...)
(in terms of the amount of data transferred between memory and disk)
Achieve by using indexes and avoiding certain SQL query structures
❖ DB Application Performance (cont) |
Application programmer choices that affect query cost:
❖ DB Application Performance (cont) |
Whatever you do as a DB application programmer
You have no control over the optimisation process
❖ DB Application Performance (cont) |
Example: query to find sales people earning more than $50K
select name from Employee where salary > 50000 and empid in (select empid from Worksin where dept = 'Sales')
A query evaluator would have to use the strategy
SalesEmps = (select empid from WorksIn where dept='Sales') foreach e in Employee { if (e.empid in SalesEmps && e.salary > 50000) add e to result set }
Needs to examine all employees, even if not in Sales
This is not a good expression of the query.
❖ DB Application Performance (cont) |
A different expression of the same query:
select name from Employee join WorksIn using (empid) where Employee.salary > 5000 and WorksIn.dept = 'Sales'
Query evaluator might use the strategy
SalesEmps = (select * from WorksIn where dept='Sales') foreach e in (Employee join SalesEmps) { if (e.salary > 50000) add e to result set }
Only examines Sales employees, and uses a simpler test
This is a good expression of the query.
❖ DB Application Performance (cont) |
A very poor expression of the query (correlated subquery):
select name from Employee e where salary > 50000 and 'Sales' in (select dept from Worksin where empid=e.id)
A query evaluator would be forced to use the strategy:
foreach e in Employee { Depts = (select dept from WorksIn where empid=e.empid) if ('Sales' in Depts && e.salary > 50000) add e to result set }
Needs to run a query for every employee ...
❖ Exercise: Comparing Query Costs |
Measure the time costs of the following queries:
select count(*) from Movies where substring(title,1,1) = 'D'; select count(*) from Movies where title like 'D%'; select count(*) from Movies where title ilike 'd%'; select count(*) from Movies where title ~ '^D.*'; select count(*) from Movies where title ~* '^d.*'; select count(*) from Movies where title = 'District 9'; select count(*) from Movies where id = 111113; select max(id) from Movies; select max(title) from Movies; select count(*) from Movies m join Principals p on (m.id = p.movie) join People n on (n.id = p.person) where p.ord = 1;
❖ PostgreSQL Query Costs |
PostgreSQL provides the explain
EXPLAIN [ANALYZE] Query
Without ANALYZE
EXPLAIN
With ANALYZE
EXPLAIN
Note that runtimes may show considerable variation due to buffering.
If simply want to know the runtime is ok, maybe \timing
❖ Indexes |
Indexes provide efficient content-based access to tuples.
Can build indexes on any (combination of) attributes.
Definining indexes:
CREATE INDEX name ON table ( attr1, attr2, ... )
attri can be an arbitrary expression (e.g. upper(name)
CREATE INDEX
UNIQUE
USING
❖ Indexes (cont) |
Indexes can significantly improve query costs.
Considerations in applying indexes:
-- use hashing for (unique) attributes in equality tests, e.g. select * from Employee where id = 12345 -- use B-tree for attributes in range tests, e.g. select * from Employee where age > 60
❖ Exercise: Indexes and Query Types |
Consider a table Students(id,name,age,...)
If we had the following mix of queries
Query | Freq |
select * from Students where id = zID |
70% |
select * from Students where name = Name |
10% |
select * from Students where name like '%str%' |
15% |
select * from Students where age > Age |
5% |
❖ Query Tuning |
Sometimes, a query can be re-phrased to affect performance:
select name from Employee where salary/365 > 100 -- fix by re-phrasing condition to (salary > 36500) select name from Employee where name like '%ith%' select name from Employee where birthday is null -- above two are difficult to "fix" select name from Employee where dept in (select id from Dept where ...) -- fix by using Employee join Dept on (e.dept=d.id)
❖ Query Tuning (cont) |
Other tricks in query tuning (effectiveness is DBMS-dependent)
select distinct
distinct
select ... Employee join Customer on (s.name = p.name)
vs
select ... Employee join Customer on (s.ssn = p.ssn)
or
or
or
select name from Employee where Dept=1 or Dept=2
vs
(select name from Employee where Dept=1)
union
(select name from Employee where Dept=2)
❖ Exercise: More Query Costs |
Check whether the optimisations mentioned above actually work
-- movies longer than 3 hours select title from Movies where runtime/60 > 3 -- or -- select title from Movies where runtime > 180 -- people who were a principal in some movie select name from People where id in (select person from Principals) -- or -- select p.name from People p where exists (select * from Principals where person=p.id) -- or -- select n.name from People n join Principals p on (n.id=p.person)
❖ Exercise: More Query Costs (cont) |
Check whether the optimisations mentioned above actually work
-- movies made in Albania or Iceland select title from Movies where origin = 'AL' or origin = 'IS' -- or -- (select title from Movies where origin = 'AL') union (select title from Movies where origin = 'IS') -- or -- select title from Movies where origin in ('AL','IS')
See how the above queries are affected by adding an index on Movies.origin
❖ EXPLAIN Examples |
Note that PostgreSQL builds a query evaluation tree, rather than a linear plan, e.g.
EXPLAIN
❖ EXPLAIN Examples (cont) |
Example: Select on indexed attribute
db=# explain select * from Students where id=100250; QUERY PLAN ----------------------------------------------------------------- Index Scan using student_pkey on student (cost=0.00..5.94 rows=1 width=17) Index Cond: (id = 100250) db=# explain analyze select * from Students where id=100250; QUERY PLAN ----------------------------------------------------------------- Index Scan using student_pkey on student (cost=0.00..5.94 rows=1 width=17) (actual time=3.209..3.212 rows=1 loops=1) Index Cond: (id = 100250) Total runtime: 3.252 ms
❖ EXPLAIN Examples (cont) |
Example: Select on non-indexed attribute
db=# explain select * from Students where stype='local'; QUERY PLAN ---------------------------------------------------------- Seq Scan on student (cost=0.00..70.33 rows=18 width=17) Filter: ((stype)::text = 'local'::text) db=# explain analyze select * from Students db-# where stype='local'; QUERY PLAN --------------------------------------------------------------- Seq Scan on student (cost=0.00..70.33 rows=18 width=17) (actual time=0.061..7.784 rows=2512 loops=1) Filter: ((stype)::text = 'local'::text) Total runtime: 7.554 ms
❖ EXPLAIN Examples (cont) |
Example: Join on a primary key (indexed) attribute
db=# explain db-# select s.sid,p.name db-# from Students s join People p on s.id=p.id; QUERY PLAN ----------------------------------------------------------- Hash Join (cost=70.33..305.86 rows=3626 width=52) Hash Cond: ("outer".id = "inner".id) -> Seq Scan on person p (cost=0.00..153.01 rows=3701 width=52) -> Hash (cost=61.26..61.26 rows=3626 width=8) -> Seq Scan on student s (cost=0.00..61.26 rows=3626 width=8)
❖ EXPLAIN Examples (cont) |
Example: Join on a non-indexed attribute
db=# explain db-# select s1.code, s2.code db-# from Subjects s1 join Subjects s2 on s1.offerer=s2.offerer; QUERY PLAN ---------------------------------------------------------------- Merge Join (cost=2744.12..18397.14 rows=1100342 width=18) Merge Cond: (s1.offerer = s2.offerer) -> Sort (cost=1372.06..1398.33 rows=10509 width=13) Sort Key: s1.offerer -> Seq Scan on subjects s1 (cost=0.00..670.09 rows=10509 width=13) -> Sort (cost=1372.06..1398.33 rows=10509 width=13) Sort Key: s2.offerer -> Seq Scan on subjects s2 (cost=0.00..670.09 rows=10509 width=13)
❖ Transactions, Concurrency |
DBMSs provide valuable information resources in an environment that is:
❖ Transactions, Concurrency (cont) |
Transaction processing
❖ Transactions |
A transaction is
To maintain integrity of data, transactions must be:
❖ Example Transaction |
Bank funds transfer
Accounts(id,name,balance,heldAt, ...)
Branches(id,name,address,assets, ...)
Branches.assets
❖ Example Transaction (cont) |
Example function to implement bank transfer ...
create or replace function transfer(N integer, Src text, Dest text) returns integer declare sID integer; dID integer; avail integer; begin select id,balance into sID,avail from Accounts where name=Src; if (sID is null) then raise exception 'Invalid source account %',Src; end if; select id into dID from Accounts where name=Dest; if (dID is null) then raise exception 'Invalid dest account %',Dest; end if; ...
❖ Example Transaction (cont) |
Example function to implement bank transfer
... if (avail < N) then raise exception 'Insufficient funds in %',Src; end if; -- total funds in system = NNNN update Accounts set balance = balance-N where id = sID; -- funds temporarily "lost" from system update Accounts set balance = balance+N where id = dID; -- funds restored to system; total funds = NNNN return nextval('tx_id_seq'); end;
❖ Transaction Concepts |
A transaction must always terminate, either:
COMMIT
ABORT
❖ Transaction Concepts (cont) |
To describe transaction effects, we consider:
READ
WRITE
ABORT
COMMIT
R(X)
W(X)
A
C
SELECT
READ
INSERT
WRITE
UPDATE
DELETE
READ
WRITE
❖ Transaction Consistency |
Transactions typically have intermediate states that are invalid.
However, states before and after transaction must be valid.
Valid = consistent = satisfying all specified constraints on the data
❖ Transaction Consistency (cont) |
Transaction descriptions can be abstracted
❖ Serial Schedules |
Serial execution:
T1
T2
T2
T1
T1: R(X) W(X) R(Y) W(Y) T2: R(X) W(X)
or
T1: R(X) W(X) R(Y) W(Y) T2: R(X) W(X)
Serial execution guarantees a consistent final state if
T1
T2
❖ Concurrent Schedules |
Concurrent schedules interleave T1
T2
Some concurrent schedules are ok, e.g.
T1: R(X) W(X) R(Y) W(Y) T2: R(X) W(X)
Other concurrent schedules cause anomalies, e.g.
T1: R(X) W(X) R(Y) W(Y) T2: R(X) W(X)
Want the system to ensure that only valid schedules occur.
❖ Serializability |
Serializable schedule:
Abstracting this needs a notion of schedule equivalence.
Two common formulations of serializability:
Produced: 12 Apr 2023