COMP9315 24T1 |
Exercises 08 Query Optimisation and Execution |
DBMS Implementation |
Consider the following tables relating to trips on a suburban bus network
Trip(fromPlace:integer, toPlace:integer, tripDate:date) Place(placeId:integer, address:string, suburb:string)
Write an SQL query that returns all of the addresses in Randwick that are the destination of a trip on March 4, 2005.
Answer:
select address from Trip, Place where tripDate='04-03-2005' and suburb='Randwick' and toPlace=placeId;
Give a naive translation of the SQL query into a relational algebra expression.
Answer:
Temp1 = Trip Join[toPlace=placeId] Place Temp2 = Sel[tripDate='04-03-2005' and suburb='Randwick'](Temp1) Result = Proj[address](Temp2)
Translate the naive relational algebra expression into an equivalent expression using pushing of selections and projections.
Answer:
TT = Proj[toPlace](Sel[tripDate='04-03-2005'](Trip)) PP = Proj[placeId,address](Sel[suburb='Randwick'](Place)) Result = Proj[address](TT Join[toPlace=placeId] PP)
Translate the optimized relational algebra expression into the most directly corresponding SQL query.
Answer:
select address from (select toPlace from Trip where [tripDate='04-03-2005') TT, (select placeId,address from Place where suburb='Randwick') PP where toPlace = placeId;
What are the possible join trees (without cross-products) for each of the following queries:
select * from R,S,T where R.a=S.b and S.c=T.d
Answer:
((R join S) join T) or (R join (S join T))
select * from R,S,T where R.a=S.b and T.c=R.d
Answer:
((R join S) join T) or ((R join T) join S)
select * from R,S,T where R.a=S.b and S.c=T.d and T.e=R.f
Answer:
any of the above
select * from R,S,T,U
where R.a=S.b and S.c=T.d and T.e=R.f and T.g=U.h and S.i=U.j
Answer:
((R join S) join (T join U)) or (((R join S) join T) join U) or (R join (S join (T join U)) or ...
(pretty much any order that does not involve a direct join of R and U)
Do not include trees/sub-trees that are reflections of other tree/subtrees.
Consider a table R(a,b,c) and assume that
Calculate the expected number of tuples in the result of each of the following queries:
where j, k, l, m, n are constants.
Answer:
select * from R where not a=k
Since the set of values for each attribute is uniformly distributed, and since there are 10 possible values for the a attribute, we would expect that 1/10 of the tuples would satisfy a=k. This means that 90% of the tuples would not satisfy the condition. Since there are 1000 tuples, we would expect 900 of them to be in the result of the query.
An alternative formulation of the above
select * from R where a=k and b=j
If attributes are independent, then we'd expect that the chance of attribute R.a having a specific value was 1/10 and the chance of attribute R.b then have a particular value was 1/100 of that (i.e. the likelihoods multiply).
Using the probability-based formulation:
select * from R where a in (k,l,m,n)
If we have alternatives for possible values for attribute R.a then we simply sum up the chances for any one of the values to occur.
Using the probability-based formulation:
Consider the following tables relating to retail sales:
create table Item ( iname text, category text, primary key (name) ); create table Store ( sname text, city text, street text, primary key (city,street) ); create table Transaction ( item text references Item(iname), store text references Store(sname), tdate date, primary key (item,store,tdate) );
Consider the following query (expressed as SQL and relational algebra):
select category,city from Item, Store, Transaction where iname=item and store=sname and tdate='20-12-2004' and city='Sydney'; JoinResult = Item Join[iname=item] Transaction Join[store=sname] Store SelectResult = Sel[tdate='20-12-2004' and city='Sydney'](JoinResult) FinalResult = Proj[category,city](SelectResult)
Show the three most promising
relational algebra expressions
that the query optimizer is likely to consider; then find the most
efficient query plan and estimate its cost.
Assume 50 buffer pages and the following statistics and indices:
Answer:
The three most promising relation algebra expressions:
Exp1: Temp1 = Sel[tdate='20-12-2004'](Transaction) Temp2 = Sel[city='Sydney'](Store) Temp3 = Temp1 Join[store=sname] Temp2 Temp4 = Item Join[iname=item] Temp3 Result = Proj[category,city](Temp4) Exp2: Temp1 = Sel[city='Sydney'](Store) Temp2 = Transaction Join[store=sname] Temp1 Temp3 = Sel[tdate='20-12-2004'](Temp2) Temp4 = Item Join[iname=item] Temp3 Result = Proj[category,city](Temp4) Exp2: Temp1 = Sel[tdate='20-12-2004'](Transaction) Temp2 = Temp1 Join[store=sname] Store Temp3 = Sel[city='Sydney'](Temp2) Temp4 = Item Join[iname=item] Temp3 Result = Proj[category,city](Temp4)
The most efficient relational algebra expression is a variation on Exp2:
Best: Temp1 = Sel[city='Sydney'](Store) Temp2 = Temp1 Join[sname=store] Transaction Temp3 = Sel[tdate='20-12-2004'](Temp2) Temp4 = Temp3 Join[item=iname] Item Result = Proj[category,city](Temp4)
Costs for the various operations (assuming pipelining, so no writes):
Cost of Temp1: 2 + 2 = 4
Assume 10 tuples/city (uniform distribution).
Since there are 5 tuples/page and the file is sorted on city,
these 10 Sydney tuples should appear in 2 pages (worst-case 3 pages).
Cost of traversing B+tree to find the first matching tuple is 2
(since tree has depth 2), plus cost of reading 2 data pages, gives
a total of 4 page reads.
Cost of Temp2+Temp3: 10 * 3 = 30
Use indexed nested loop join and combine it with the selection on
tdate as follows: the index on Transaction is on
the pair of attributes (store,tdate); the join condition
involves just store; however, the selection also gives
us the tdate value; so, for each store, we can
form an index key (store,tdate) and do an index lookup.
Since there are 10 tuples in the result of Temp1, we will
perform 10 index lookups on Transaction.
Each index lookup will yield 10 results (we are given that 10 items
are sold in each store on each day).
Since the Transaction data file is sorted (clustered)
by store (then tdate), and there are 25 tuples per page,
all 10 tuples will likely be on a single page (worst case 2 pages).
The B+tree index is depth 2, so there are 2 index page reads and 1
data page read for each index lookup (i.e. 3 page reads).
Since we perform 10 index lookups, this gives a total of 30 reads,
which produces 100 tuples.
Cost of Temp4: 100 * 1 = 100
The join between Temp3 and Item can also use
an index nested loop join, based on the fact that Item
is hashed on the iname value, and each tuple from
Temp3 gives us an item value to use as a hash key.
For each of the 100 tuples from Temp3, we do a hash
access to the page containing the matching tuple (there will be
only 1 matching tuple, since iname is a key).
This requires exactly 100 page reads.
Total cost is the sum of these: 4 + 30 + 100 = 134 page reads.
Note that we ignore the cost of the projection in the above calculation. We can do this because we generally ignore the cost of writing results (especially in the context of comparing relational algebra expressions for a given query, since all expressions produce the same result). Also, we don't need to worry about filtering duplicates in this case, since the query didn't ask for it (i.e. not select distinct).