Database Trends

>>

Database Trends

Future of Database
Large Data
Information Retrieval
Multimedia Data
Uncertainty
Stream Data Management Systems
Graph Data
Dispersed Databases

∧ >>

❖ Future of Database

Core "database" goals:

deal with very large amounts of data (petabyes, exabytes, ...)
very-high-level languages (deal with data in uniform ways)
fast query execution (evaluation too slow ⇒ useless)

At the moment (and for the last 30 years) RDBMSs dominate ...

simple/clean data model, backed up by theory
high-level language for accessing data
40 years development work on RDBMS engine technology

RDBMSs work well in domains with uniform, structured data.

<< ∧ >>

❖ Future of Database (cont)

Limitations/pitfalls of classical RDBMSs:

NULL is ambiguous: unknown, not applicable, not supplied
"limited" support for constraints/integrity and rules
no support for uncertainty (data represents the state-of-the-world)
data model too simple (e.g. no direct support for complex objects)
query model too rigid (e.g. no approximate matching)
continually changing data sources not well-handled
data must be "molded" to fit a single rigid schema
database systems must be manually "tuned"
do not scale well to some data sets (e.g. Google, Telco's)

<< ∧ >>

❖ Future of Database (cont)

How to overcome (some) RDBMS limitations?

Extend the relational model ...

add new data types and query ops for new applications
deal with uncertainty/inaccuracy/approximation in data

Replace the relational model ...

object-oriented DBMS ... OO programming with persistent objects
XML DBMS ... all data stored as XML documents, new query model
noSQL data stores (e.g. (key,value) pairs, json or rdf)

<< ∧ >>

❖ Future of Database (cont)

How to overcome (some) RDBMS limitations?

Performance ...

new query algorithms/data-structures for new types of queries
parallel processing
DBMSs that "tune" themselves

Scalability ...

distribute data across (more and more) nodes
techniques for handling streams of incoming data

<< ∧ >>

❖ Future of Database (cont)

An overview of the possibilities:

"classical" RDBMS (e.g. PostgreSQL, Oracle, SQLite)
parallel DBMS (e.g. XPRS)
distributed DBMS (e.g. Cohera)
deductive databases (e.g. Datalog)
temporal databases (e.g. MariaDB)
column stores (e.g. Vertica, Druid)
object-oriented DBMS (e.g. ObjectStore)
key-value stores (e.g. Redis, DynamoDB)
wide column stores (e.g. Cassandra, Scylla, HBase)
graph databases (e.g. Neo4J, Datastax)
document stores (e.g. MongoDB, Couchbase)
search engines (e.g. Google, Solr)

<< ∧ >>

❖ Future of Database (cont)

Historical perspective

<< ∧ >>

❖ Large Data

Some modern applications have massive data sets (e.g. Google)

far too large to store on a single machine/RDBMS
query demands far too high even if could store in DBMS

Approach to dealing with such data

distribute data over large collection of nodes (also, redundancy)
provide computational mechanisms for distributing computation

Often this data does not need full relational selection

represent data via (key,value) pairs
unique keys can be used for addressing data
values can be large objects (e.g. web pages, images, ...)

<< ∧ >>

❖ Large Data (cont)

Popular computational approach to such data: map/reduce

suitable for widely-distributed, very-large data
allows parallel computation on such data to be easily specified
distribute (map) parts of computation across network
compute in parallel (possibly with further mapping)
merge (reduce) multiple results for delivery to requestor

Some large data proponents see no future need for SQL/relational ...

depends on application (e.g. hard integrity vs eventual consistency)

<< ∧ >>

❖ Information Retrieval

DBMSs generally do precise matching (although like/regexps)

Information retrieval systems do approximate matching.

E.g. documents containing a set of keywords (Google, etc.)

Also introduces notion of "quality" of matching
(e.g. tuple T₁ is a better match than tuple T₂)

Quality also implies ranking of results.

Ongoing research in incorporating IR ideas into DBMS context.

Goal: support database exploration better.

<< ∧ >>

❖ Multimedia Data

Data which does not fit the "tabular model":

image, video, music, text, ... (and combinations of these)

Research problems:

how to specify queries on such data? (image₁ ≅ image₂)
how to "display" results? (synchronize components)

Solutions to the first problem typically:

extend notions of "matching"/indexes for querying
require sophisticated methods for capturing data features

Sample query: find other songs like this one?

<< ∧ >>

❖ Uncertainty

Multimedia/IR introduces approximate matching.

In some contexts, we have approximate/uncertain data.

E.g. witness statements in a crime-fighting database

"I think the getaway car was red ... or maybe orange ..."

"I am 75% sure that John carried out the crime"

Work by Jennifer Widom at Stanford on the Trio system

extends the relational model (ULDB)
extends the query language (TriQL)

<< ∧ >>

❖ Stream Data Management Systems

Makes one addition to the relational model

stream = infinite sequence of tuples, arriving one-at-a-time

Applications: news feeds, telecomms, monitoring web usage, ...

RDBMSs: run a variety of queries on (relatively) fixed data

StreamDBs: run fixed queries on changing data (stream)

One approach: window = "relation" formed from a stream via a rule

E.g. StreamSQL


select avg(price)
from examplestream [size 10 advance 1 tuples]

<< ∧ >>

❖ Graph Data

Uses graphs rather than tables as basic data structure tool.

Applications: social networks, ecommerce purchases, interests, ...

Many real-world problems are modelled naturally by graphs

can be represented in RDBMSs, but not processed efficiently
e.g. recursive queries on Nodes, Properties, Edges tables

Graph data models: flexible, "schema-free", inter-linked

Typical modeling formalisms: XML, JSON, RDF

More details later ...

<< ∧

❖ Dispersed Databases

Characteristics of dispersed databases:

very large numbers of small processing nodes
data is distributed/shared among nodes

Applications: environmental monitoring devices, "intelligent dust", ...

Research issues:

query/search strategies (how to organise query processing)
distribution of data (trade-off between centralised and diffused)

Less extreme versions of this already exist:

grid and cloud computing
database management for mobile devices