Non-classical DBMSs


Database Trends (overview)


Future of Database2/95

Core "database" goals:

At the moment (and for the last 30 years) RDBMSs dominate ... RDBMSs work well in domains with uniform, structured data.


... Future of Database3/95

Limitations/pitfalls of classical RDBMSs:


... Future of Database4/95

How to overcome (some) RDBMS limitations?

Extend the relational model ...

Replace the relational model ...


... Future of Database5/95

How to overcome (some) RDBMS limitations?

Performance ...

Scalability ...


... Future of Database6/95

An overview of the possibilities:


... Future of Database7/95

Historical perspective

[Diagram:Pics/future/dbms-history-small.png]


Large Data8/95

Some modern applications have massive data sets (e.g. Google)

Approach to dealing with such data Often this data does not need full relational selection


... Large Data9/95

Popular computational approach to such data: map/reduce

Some large data proponents see no future need for SQL/relational ... Humour: Parody of noSQL fans   (strong language warning)


Information Retrieval10/95

DBMSs generally do precise matching (although like/regexps)

Information retrieval systems do approximate matching.

E.g. documents containing a set of keywords (Google, etc.)

Also introduces notion of "quality" of matching
(e.g. tuple T1 is a better match than tuple T2)

Quality also implies ranking of results.


Ongoing research in incorporating IR ideas into DBMS context.

Goal: support database exploration better.


Multimedia Data11/95

Data which does not fit the "tabular model":

Research problems: Solutions to the first problem typically: Sample query: find other songs like this one?


Uncertainty12/95

Multimedia/IR introduces approximate matching.

In some contexts, we have approximate/uncertain data.

E.g. witness statements in a crime-fighting database

"I think the getaway car was red ... or maybe orange ..."

"I am 75% sure that John carried out the crime"

Work by Jennifer Widom at Stanford on the Trio system


Stream Data Management Systems13/95

Makes one addition to the relational model

Applications: news feeds, telecomms, monitoring web usage, ...

RDBMSs: run a variety of queries on (relatively) fixed data

StreamDBs: run fixed queries on changing data (stream)

One approach: window = "relation" formed from a stream via a rule

E.g. StreamSQL


select avg(price)
from examplestream [size 10 advance 1 tuples]


Graph Data14/95

Uses graphs rather than tables as basic data structure tool.

Applications: social networks, ecommerce purchases, interests, ...

Many real-world problems are modelled naturally by graphs

Graph data models:  flexible,  "schema-free",  inter-linked

Typical modeling formalisms:  XML,  JSON,  RDF

More details later ...


Dispersed Databases15/95

Characteristics of dispersed databases:

Applications: environmental monitoring devices, "intelligent dust", ...

Research issues:

Less extreme versions of this already exist:


Parallel and Distributed Databases


Parallel and Distributed Systems17/95

RDBMS discussion so far has revolved around systems

[Diagram:Pics/parallel/one-processor-small.png]


... Parallel and Distributed Systems18/95

Why parallelism? ... Throughput!

[Diagram:Pics/parallel/para-thru-small.png]


Parallel Architectures19/95

Types:   shared memory,   shared disk,   shared nothing

Example shared-nothing architecture:

[Diagram:Pics/parallel/shared-nothing-small.png]

Typically in the same room   (data transfer cost ~ 100's of μsecs)


... Parallel Architectures20/95

Hierarchical architectures are hybrid parallel ones

[Diagram:Pics/parallel/hybrid-arch-small.png]

Typically on a local-area network   (data transfer cost ~ msecs)


Distributed Architectures21/95

Distributed architectures are ...

[Diagram:Pics/parallel/distrib-arch-small.png]

Typically on the Internet   (data transfer cost ~ secs)


Parallel Databases (PDBs)22/95

Parallel databases provide various forms of parallelism ...

PDBs typically run on closely-connected parallel architectures,
so we focus on hybrid architectures on a LAN.


Data Storage in PDBs 23/95

Consider each table as a collection of pages ...

Page addressing: (Table, File, PageNum)

If all data for one table resides on one node


... Data Storage in PDBs 24/95

However, with multiple nodes, we could ...

Could also have a combination of partitioning and replication


... Data Storage in PDBs 25/95

Data-partitioning example:

[Diagram:Pics/parallel/data-partition-small.png]


... Data Storage in PDBs 26/95

Data-partitioning strategies for one table:


Assume:   R(a,b,c,...),   D0 .. Dn-1 disks,   tup0 .. tupr-1 tuples

Storing data on many disks maximises chance for parallel data access


... Data Storage in PDBs 27/95

Round-robin partitioning:


... Data Storage in PDBs 28/95

Hash partitioning


... Data Storage in PDBs 29/95

Range partitioning


PostgreSQL and Parallelism30/95

PostgreSQL assumes

PostgreSQL allows So could run on ...


... PostgreSQL and Parallelism31/95

PostgreSQL can provide

Both need data synchronisation between servers

PostgreSQL uses notion of master and slave servers.


... PostgreSQL and Parallelism32/95

High availability ...

  • updates occur on master, recorded in tx log
  • tx logs shipped/streamed from master to slave(s)
  • slave uses tx logs to maintain current state
  • configuration controls frequency of log shipping
  • bringing slave up-to-date is fast (~1-2secs)

    Note: small window for data loss (committed tx log records not sent)


    Distributed Databases33/95

    Two kinds of distributed databases

    The latter are also called federated databases

    Distribution of data complicates tx processing ...


    ... Distributed Databases34/95

    Distributed tx processing handled by two-phase commit


    ... Distributed Databases35/95

    Distributed query processing

    Query optimisation in such contexts is difficult.


    Non-classical DBMSs


    Classical DBMSs37/95

    Assumptions made in conventional DBMSs:


    Modern DBMSs38/95

    Demands from modern applications

    Clearly, not all of these are relevant for every modern application.


    ... Modern DBMSs39/95

    Some conclusions:

    Some "modernists" claim that


    ... Modern DBMSs40/95

    Some approaches:


    Scale, Distribution, Replication41/95

    Data for modern applications is very large (TB, PB, XB)

    Many systems opt for massive networks of simple nodes Benefits:


    Schema-free Data Models42/95

    Many new DBMSs provide (key,value) stores

    Tables can be simulated by a collection of "similar" objects.


    Eventual Consistency43/95

    RDBMSs use a strong transactional/consistency model

    Many new DBMSs applications do not need strong consistency Because of distribution/replication


    ... Eventual Consistency44/95

    If different nodes have different versions of data

    Levels of consistency (from Cassandra system)


    MapReduce45/95

    MapReduce is a programming model

    Computation is structured in two phases:


    ... MapReduce46/95

    MapReduce makes use of (key,value) pairs

    Map(key1,val1) → list(key2,val2) Reduce(key2,list(val2)) → val3


    ... MapReduce47/95

    "Classic" MapReduce example (word frequency in set of docs):

    function map(String name, String document):
      // name: document name
      // document: document contents
      for each word w in document:
        emit (w, 1)
     
    function reduce(String word, Iterator partialCounts):
      // word: a word
      // partialCounts: list of aggregated partial counts
      sum = 0
      for each c in partialCounts:
        sum += c
      emit (word, sum)
    


    ... MapReduce48/95

    MapReduce as a "database language"


    Modern vs Classical49/95

    Some criticisms of the NoSQL approach:

    [Diagram:Pics/newdb/xtra-small.png]


    Hadoop DFS50/95

    Apache Hadoop Distributed File System

    Provides support for Hadoop map/reduce implementation.

    Optimised for write-once-read-many apps


    ... Hadoop DFS51/95

    Architecture of one HDFS cluster:

    [Diagram:Pics/future/hadoop-dfs-small.png]


    ... Hadoop DFS52/95

    Datanodes ...

    A Hadoop file Datanode → Namenode reports


    ... Hadoop DFS53/95

    Namenodes ...

    Namenode knows file ok if all relevant Datanodes sent Bockreport


    Two Case Studies54/95

    Consider two variations on the DBMS theme ...

    Column Stores

    Graph Databases


    Column Stores


    (Based on material by Daniel Abadi et al.)


    Column Stores56/95

    Column-oriented Databases (CoDbs):

    Ideas for CoDbs have been around since the 1970's

    Rose to prominence via Daniel Abadi's PhD thesis (MIT, 2008)

    Commercial systems have now been developed (e.g. Vertica)


    ... Column Stores57/95

    File structures for row-store vs column-store:

    [Diagram:Pics/future/row-vs-col-small.png]


    Values in individual columns are related by extra tuple id (cf. oid)


    ... Column Stores58/95

    Stored representation of logical (relational) tables

    Example:  Enrolment(course,student,term,mark,grade)


    Rows vs Columns59/95

    Workload for different operations


    ... Rows vs Columns60/95

    Which is more efficient depends on mix of queries/updates


    ... Rows vs Columns61/95

    Storing sorted columns leads to

    Only one column in each projection will be sorted


    Query Evaluation in CoDbs62/95

    Projection is easy if one slice contains all required attributes.

    If not ...

    Example: select a,b,c from R(a,b,c,d,e)

    Assume: each column contains N values
    for i in 0 .. N-1 {
       x = a[i]   // i'th value in slice containing a
       y = b[i]   // i'th value in slice containing b
       z = c[i]   // i'th value in slice containing c
       add (x,y,z) to Results
    }
    


    ... Query Evaluation in CoDbs63/95

    If slices are sorted differently, more complicated

    Example: select a,b,c from R(a,b,c,d,e)

    Assume: each column contains N values
    for tid in 0 .. N-1 {
       x = fetch(a,tid)  // entry with tid in slice containing a
       y = fetch(b,tid)  // entry with tid in slice containing b
       z = fetch(c,tid)  // entry with tid in slice containing c
       add (x,y,z) to Results
    }
    

    Potentially slow, depending on how fetch() works.


    ... Query Evaluation in CoDbs64/95

    For remaining discussion, assume


    [Diagram:Pics/future/row-vs-col1-small.png]


    ... Query Evaluation in CoDbs65/95

    Consider typical multi-attribute SQL query

    select a,b,c from R where b > 10 and d < 7
    

    Query operation on individual column is done in one slice

    Mark index of each matching entry in a bit-vector

    Combine (AND) bit-vectors to get indexes for result entries

    For each index, merge result entry columns into result tuple

    Known as late materialization.


    ... Query Evaluation in CoDbs66/95

    Example: select a,b,c from R where b = 5

    // Assume: each column contains N values
    matches = all-zero bit-string of length N
    for i in 0 .. N-1 {
       x = b[i]  // i'th value in b column
       if (x == 5)
           matches[i] = 1  // set bit i in matches
    }
    for i in 0 .. N-1 {
       if (matches[i] == 0) continue
       add (a[i], b[i], c[i]) to Results
    }
    

    Fast sequential scanning of small (compressed?) data


    ... Query Evaluation in CoDbs67/95

    Example: select a,b,c from R where b>10 and d<7

    
    // Assume: each column contains N values
    matches1 = all-zero bit-string of length N
    matches2 = all-zero bit-string of length N
    for i in 0 .. N-1 {
       if (b[i] > 10) matches1[i] = 1
       if (d[i] < 7) matches2[i] = 1
    }
    matches = matches1 AND matches2
    for i in 0 .. N-1 {
       if (matches[i] == 0) continue
       add (a[i], b[i], c[i]) to Results
    }
    


    ... Query Evaluation in CoDbs68/95

    Join on columns, set up for late materialization

    [Diagram:Pics/future/join-small.png]


    Note: the left result column is always sorted


    ... Query Evaluation in CoDbs69/95

    Example: select R.a, S.b
                    from   R join S on R.a = S.x

    // Assume: N tuples in R, M tuples in S
    for i in 0 .. N-1 {
       for j in 0 .. M-1 {
          if (a[i] == x[j])
             append (i,j) to IndexList
       }
    }
    for each (i,j) in IndexList {
       add (aR[i], bS[j]) to Results
    }
    


    ... Query Evaluation in CoDbs70/95

    Aggregation generally involves a single column

    E.g.

    select avg(mark), count(student) from Enrolments
    


    Operations involving groups of columns


    Graph Databases


    (Based on material by Markus Krotzsch, Renzo Angles, Claudio Gutierrez)


    Graph Databases72/95

    Graph Databases (GDbs):


    But what kind of "graphs"?
    Two major GDb data models:   RDF,   Property Graph


    ... Graph Databases73/95

    Typical graph modelled by a GDb

    [Diagram:Pics/future/people-graph-small.png]


    Graph Data Models74/95

    RDF  =  Resource Description Framework

    Data as triples, e.g. <Person1,given,"John">,   <Person1,parent,Person3>

    RDF is a W3C standard;  supported in many prog. languages


    ... Graph Data Models75/95

    RDF model of part of earlier graph:

    [Diagram:Pics/future/people-graph-rdf-small.png]


    ... Graph Data Models76/95

    Property Graph


    Not a standard like RDF, so variations exist


    ... Graph Data Models77/95

    Property Graph model of part of earlier graph:

    [Diagram:Pics/future/people-graph-prop-small.png]


    GDb Queries78/95

    Graph data models require a graph-oriented query framework

    Types of queries in GDbs


    ... GDb Queries79/95

    Graphs contain arbitrary-length paths

    Need an expression mechanism for describing such paths

    GDb query languages:


    Example Graph Queries80/95

    Example: Persons whose first name is James

    SPARQL:

    PREFIX p: <http://www.people.org>
    SELECT ?X
    WHERE { ?X p:given "James" }
    

    Cypher:

    MATCH (person:Person)
    WHERE person.given="James"
    RETURN person
    


    ... Example Graph Queries81/95

    Example: Persons born between 1965 and 1975

    SPARQL:

    PREFIX p: <http://www.people.org/>
    SELECT ?X
    WHERE {
     ?X p:type p:Person . ?X p:y-o-b ?A .
     FILTER (?A ≥ 1965 && ?A ≤ 1975)
     }
    

    Cypher:

    MATCH (person:Person)
    WHERE person.y-o-b ≥ 1965 and person.y-o-b ≤ 1975
    RETURN person
    


    ... Example Graph Queries82/95

    Example: pairs of Persons related by the "parent" relationship

    SPARQL:

    PREFIX p: <http://www.people.org/>
    SELECT ?X ?Y
    WHERE { ?X p:parent ?Y }
    

    Cypher:

    MATCH (person1:Person)-[:parent]->(person2:Person)
    RETURN person1, person2
    


    ... Example Graph Queries83/95

    Example: Given names of people with a sibling called "Tom"

    SPARQL:

    PREFIX p: <http://www.people.org/>
    SELECT ?N
    WHERE { ?X p:type p:Person . ?X p:given ?N .
            ?X p:sibling ?Y . ?Y p:given "Tom"  }
    
    

    Cypher:

    MATCH (person:Person)-[:sibling]-(tom:Person)
    WHERE tom.given="Tom"
    RETURN person.given
    


    ... Example Graph Queries84/95

    Example: All of James' ancestors

    SPARQL:

    PREFIX p: <http://www.socialnetwork.org/>
    SELECT ?Y
    WHERE { ?X p:type p:Person . ?X p:given "James" .
            ?Y p:parent* ?X }
    

    Cypher:

    MATCH (ancestor:Person)-[:parent*]->(james:Person)
    WHERE james.given="James"
    RETURN DISTINCT ancestor
    


    Course Review + Exam


    Syllabus86/95

    View of DBMS internals from the bottom-up:


    Exam87/95

    Tuesday 27 August, 1.45pm - 5pm, (1.45 = reading time)

    Held in CSE Labs (allocations posted in Week 11).

    All answers are typed and submitted on-line.

    Environment is similar to Vlab.

    Learn to use the shell, a text editor and on-screen calculator.


    ... Exam88/95

    Resources available during exam:

    No access to any other material is allowed.

    No network access is available (e.g. no web, no email, ...)


    ... Exam89/95

    Tools available during the exam


    What's on the Exam?90/95

    Potential topics to be examined ...


    ... What's on the Exam?91/95

    Questions will have the following "flavours" ...

    There will be no SQL/PLpgSQL code writing.

    You will not have to modify PostgreSQL during the exam.


    Exam Structure92/95

    There will be 8 questions

    Reminder:


    Special Consideration93/95

    Reminder: this is a one-chance exam.

    If you're sick, get documentation and do not attend the exam.

    Special consideration requests must clearly show

    Other factors are not relevant   (e.g. "I can't afford to repeat")


    Revision94/95

    Things you can use for revision:

    Pre-exam consultations leading up to exam  (see course web site)

    Note: I'm away on August 18-20 inclusive (no email)


    And that's all folks ...95/95


    End of COMP9315 19T2 Lectures.


    Good luck with the exam ...


    And keep on using PostgreSQL ...


    Produced: 8 Aug 2019