IN5800 – Data Structure

Leif Harald Karlsen

Data Structure and Complexity

Data can be organized according to how structured it is:

  • Unstructured
    • E.g. images, text, sound
    • No metadata about the meaning of the content
    • Typically use machine learning/AI/statistics to get meaning
  • Semi-structured
    • E.g. (free) graphs, trees
    • Some metadata, typically contained within the data itself
    • Typically assume some structure or use semantics to get meaning
  • Structured
    • E.g. tables, relations
    • Strict separation between metadata and data
    • Meaning (mostly) contained in metadata
  • This course focuses on semi-structured and structured.

Relational data

Relational structure

Problems with relational structure

NoSQL

Central idea: Loosen up the rigid structure (semi-structured)

  • Easier to create and maintain
  • Easier integration
  • Easier to scale horizontally (split data over multiple clusters)
  • Limits object-relational impendance mismatch
  • Allows different forms of data in same DB
  • Note: Often read as “Not Only SQL” (Many NoSQL-DBs support SQL)

RDF/Triples

  • All statements are triples: s p o
    • Subject
    • Predicate
    • Object
  • E.g. Leif teaches IN5800
  • Data represented as sets of triples
  • More general than a graph
    • Edges between edges, e.g. teaches type Relationship
    • Also: type type Relationship

Resources and data

Example graph (in Turtle)

@prefix ifi: <https://www.uio.no/studier/emner/matnat/ifi/> .
@prefix folk: <http://folk.uio.no/> .
@prefix ex: <http://example.com/> .

folk:leifhka ex:hasName "Leif Harald Karlsen" .
folk:leifhka ex:teaches ifi:IN5800 .
folk:leifhka ex:teaches ifi:IN2090 .
folk:leifhka ex:knows folk:martingi .
folk:martingi ex:teaches ifi:IN3060 .
ifi:IN5800 ex:hasCredits 10 .
ifi:IN3060 ex:alias ifi:IN4060 .

More compact syntax:

@prefix ifi: <https://www.uio.no/studier/emner/matnat/ifi/> .
@prefix folk: <http://folk.uio.no/> .
@prefix ex: <http://example.com/> .

folk:leifhka ex:hasName "Leif Harald Karlsen" ;
             ex:teaches ifi:IN5800, ifi:IN2090 ;
             ex:knows folk:martingi .
folk:martingi ex:teaches ifi:IN3060 .
ifi:IN5800 ex:hasCredits 10 .
ifi:IN3060 ex:alias ifi:IN4060 .

Triple’s metadata

Relational vs. triples

Translating between structures

Other structures (CSV, JSON, XML)

Definitions: Format and Structure

  • Format: The concrete layout of data in memory/disk with procedures for manipulation
    • SQL Tables
    • CSV
    • RDF
    • XML, JSON, Excel, etc.
  • Data structure: Mathematical description of the layout and manipulation of data + data representation approach + minimal semantics
    • Relational (tables/relations)
    • Triple-based (RDF, RDFS, OWL)
    • Hierarchical (trees)
    • Graph-based (property graphs/networks)
    • Key-value-based

Data Formats and Structures

Data Format vs. Data Structure: Example

Format: SQL tables – Data structure: Relational

         Company
      
 cid | name |  founded   
-----+------+------------
   1 | UiO  | 1811-09-02
   2 | DNB  | 2003-12-03


         Person         
       
 pid | name  | worksfor 
-----+-------+----------
   1 | Peter |        2
   2 | Kari  |        1
   3 | Mary  |        1
   4 | Nils  |         

Format: SQL tables – Data structure: Triples

                                Nodes                                 
                                                        
 id  | ntype  |                     svalue                      | lang 
-----+--------+-------------------------------------------------+------
   1 | uri    | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | 
   2 | uri    | http://www.w3.org/2001/XMLSchema#date           | 
 101 | uri    | http://example.org/comp                         | 
 102 | uri    | http://example.org/comp/cid                     | 
 103 | uri    | http://example.org/comp/cid/1                   | 
 104 | uri    | http://example.org/comp/cid/2                   | 
 105 | uri    | http://example.org/comp/founded                 | 
 106 | uri    | http://example.org/comp/name                    | 
 107 | uri    | http://example.org/pers                         | 
 108 | uri    | http://example.org/pers/name                    | 
 109 | uri    | http://example.org/pers/pid                     | 
 110 | uri    | http://example.org/pers/pid/1                   | 
 111 | uri    | http://example.org/pers/pid/2                   | 
 112 | uri    | http://example.org/pers/pid/3                   | 
 113 | uri    | http://example.org/pers/pid/4                   | 
 114 | uri    | http://example.org/pers/worksfor                | 
 115 | string | DNB                                             | 
 116 | string | Kari                                            | 
 117 | string | Mary                                            | 
 118 | string | Nils                                            | 
 119 | string | Peter                                           | 
 120 | string | UiO                                             | 
 121 | date   | 1811-09-02                                      | 
 122 | date   | 2003-12-03                                      | 


          Triples
             
 subject | predicate | object 
---------+-----------+--------
     103 |         1 |    101
     103 |       105 |    121
     103 |       106 |    120
     104 |         1 |    101
     104 |       105 |    122
     104 |       106 |    115
     110 |         1 |    107
     110 |       108 |    119
     110 |       114 |    104
     111 |         1 |    107
     111 |       108 |    116
     111 |       114 |    103
     112 |         1 |    107
     112 |       108 |    117
     112 |       114 |    103
     113 |         1 |    107
     113 |       108 |    118

Format: RDF – Data structure: Triples

@prefix ex:   <http://example.org/public/> .
@prefix ex-c: <http://example.org/public/Company/> .
@prefix ex-p: <http://example.org/public/pers/> .

ex-c:cid1 a  ex:Company ;
  ex-c:founded 1811-09-02 ;
  ex-c:name "UiO" .

ex-c:cid2 a ex:Company ;
  ex-c:founded 2003-12-03 ;
  ex-c:name "DNB" .

ex-p:pid1 a ex:Person .
  ex-p:name "Peter" ;
  ex-p:worksfor ex-c:cid2 .

ex-p:pid2 a ex:Person ;
  ex-p:name "Kari" ;
  ex-p:worksfor ex-c:cid1 .

ex-p:pid3 a ex:Person ;
  ex-p:name "Mary" ;
  ex-p:worksfor ex-c:cid1 .

ex-p:pid4 a ex:Person ;
  ex-p:name "Nils" .

Format: RDF – Data structure: Relational

ex:Company a ex:Table ;
  ex:columns ("cid", "name", "founded") ;
  ex:row (1, "UiO", "1811-09-02"^^xsd:date) ,
         (2, "DNB", "2003-12-03"^^xsd:date) .

ex:Person a ex:Table ;
  ex:columns ("pid", "name", "worksfor") ;
  ex:row (1, "Peter", 2) ,
         (2, "Kari",  1) ,
         (3, "Mary",  1) ,
         (4, "Nils",  ex:null) .

Definitions: Data Schema

Conceptual vs. Logical vs. Physical Data Schema: Relational Example

CREATE TABLE company(
    cid int PRIMARY KEY,
    name text,
    founded date
);

CREATE TABLE person(
    pid int PRIMARY KEY,
    name text,
    worksFor int REFERENCES Company(cid)
);

CREATE INDEX person_name ON person(name);
CREATE TABLE company(
    cid bigint PRIMARY KEY,
    name text,
    founded int
);

CREATE TABLE person(
    pid bigint PRIMARY KEY,
    name text,
    worksFor bigint REFERENCES Company(cid)
);

CREATE INDEX person_name ON person(name);
CREATE INDEX company_name ON company(name);

Example: Different schemas for different use

Example cont.: Differnet physical schemas

Explicit vs. Implicit

Virtual vs. Stored

Qualitative vs. Quantitative

Data structure and use

  • Different structures are good for different things, e.g.
    • Triples good for integration and maintenance, relational for security and efficiency
    • Virtual data saves space, but stored is computationally more efficient
    • Implicit data might be useful, but impossible to extract/define everything
    • Qualitative might be more convenient for humans, but expensive to compute/store
  • Need to think about these trade-offs when engineering data