Lecture 2: Data Structure

This lecture gives an overview of different structure data can take, the different qualities data can have, and how they are related.

Slides

Relevant Wiki pages

PostgeSQL (relational DB and backend for triplestore)
RDF (Triple-based data representation)
Lore Declarative extension of PostgreSQL
Triplelore Triplestore (RDF DB based on Lore)

Exercises

1. Different structures

a)

You work in a data center with lots of computers. You need to keep track of which computers you have, and their parts, for management and maintenance. You get the following statements from the technicians that installed the computers:

A computer consists of parts
Every concrete computer has a unique MAC address that identifies it, and a label
Every concrete part in a computer has a unique registration number, and a label
Computer with MAC-address 12.34.56.78 and label Thinkpad X123, has two parts:
- A part with registration number 987.654.321 and label Motherboard XYZ
- A part with registration number 111.222.333 and label RAM 8GB

Unfortunately, you were not able do decide on type of database, and decided to make both a triplestore and a relational database for storing the information, with the aim of dropping one of them when it becomes clear which system fits best with the task.

So, make a set of triples and a relational database (e.g. with SQL-statements) containing the information above. For the triples, you do not need to use any ontology language (such as OWL), simply make resources as you see fit, with intuitive names.

b)

The technicians did not give you the full picture in the beginning, and sent you the following updated statements via email:

Parts may be part of other parts (instead of computers directly)
Part with registration number 987.654.321 has two parts:
- A part with registration number 192.837.465 and label CPU 5.2 GHz
- A part with registration number 999.888.777 and label GPU 4 GHz

Also, as the data center grows, you now see the need to divide the computers into separate clusters, both for resource management, but also for maintenance.

These are the updates you get from the technicians once they are done with the clustering:

Computers are parts of clusters
Clusters are uniquely identified by a cluster-ID, and also contains a label
Cluster with cluster-ID 1A, and label main, has as part the computer with MAC address 12.34.56.78

Update your databases to also contain the information given above. For SQL, write SQL-commands that acts on the previously made schema (e.g. ALTER TABLE-commands), and for RDF simply write “Add triples” or “Remove triples”.

c)

After some time of operation, you realize that in certain situations, it would be useful to view the data center as simply a collection of components, arranged in a part-of-hierarchy (for example: if a part in a part is broken, the whole part is broken; if a part in a computer is broken, the whole computer is broken; and if a computer in a cluster is broken, the whole cluster is broken). You therefore see the need for having a single notion of component, and a single part-of-relation that contains the full part-of-hierarchy. In such cases, it would also be useful to be able to handle all the different types of objects as a single thing. However, you have already implemented lots of programs using the original databases, so no breaking changes can be made to the original schemas (but no data should be duplicated either).

Thus, your database needs to adhere to the following statements as well:

All objects (i.e. clusters, computers and parts) should be gathered into a single notion called “component”, each with its unique ID and label
The “part of” relation between computers and clusters, parts and computers, and between parts should all be the same relation

Now further extend/change your data and meta-data made above to also include this information. You can assume that cluster-IDs, MAC-addresses, and registrations numbers are all just text. Also, they all have a distinct form so e.g. a MAC-address will never equal a cluster-ID nor equal a registration number.

2. Implicit data

Given the following triples:

ex:carl ex:hasName "Carl Smith";
        ex:livesAt "Streetroad 1, 1234, Oslo" .
ex:mary ex:hasName "Mary Smith";
        ex:livesAt "Streetalley 2, 2345, Oslo" .

where ex:hasName relates people to their names and ex:livesAt relates people to the address of the house they live in.

Write down at least 10 triples implicit in this data. You can invent new resources, such as ex:hasSameAddress or ex:Person, but give them a natural language definition.

3. Meaning from structure in the relational model

Given the following relational database schema (defined in SQL):

CREATE TABLE ping(
  bip int PRIMARY KEY,
  pip int,
  kip int REFERENCES pong(bop)
);

CREATE TABLE pong(
  bop int PRIMARY KEY,
  pop int
);

CREATE TABLE bang(
  bap int REFERENCES ping(bip),
  pap, int REFERENCES pong(bop),
  kap text,
  PRIMARY KEY (bap, pap)
);

What can you say about ping, bip, pip, kip, pong, bop, pop, bang, bap, pap, and kap? What are they?

4. Meaning from structure in RDF

Given the following triples:

ex:ping ex:pang ex:peng .
ex:pong ex:pong ex:pang .
ex:peng ex:pong ex:peng .

What can you say about ex:ping, ex:pang, ex:peng, and ex:pong?
If you are given the information that ex:pong is a property that only relates properties to other properties, what can you then say about the other resources?
If you are given the information that ex:pong denotes equality, what can you now say about the resources in the above graph?
Now assume you are given the information that ex:pong denotes (reflexive) superproperty (i.e. the inverse of subproperty) (e.g. ex:knows is a superproperty of ex:hasFriend, and ex:inRelationshipWith is a superproperty of ex:isMarriedTo). What can you now say about the resources in the above graph?

Solution

The solution to these exercises can be found here.