IN5800 – Data Transformation

Leif Harald Karlsen

Definitions: Data transformation

Data transformation: Any function/procedure that takes data as input and produces data as output
Input called source, output called target
Can compose multiple data transformations into larger data transformations
Today we will look at
- different ways of combining transformations
- different types of data transformations
- what one needs to think about when making and composing data transformations

Data pipelines

Can view data transformations as graphs: Sources and targets are nodes, tranformations are edges
Data pipeline: A collection of data transformation that forms a connected graph
Will have a separate lecture on data pipelines

Recap: Conceptual vs. logical vs. physical data schema

CREATE TABLE company(
    cid int PRIMARY KEY,
    name text,
    founded date
);

CREATE TABLE person(
    pid int PRIMARY KEY,
    name text,
    worksFor int REFERENCES Company(cid)
);

CREATE INDEX person_name ON person(name);

CREATE TABLE company(
    cid bigint PRIMARY KEY,
    name text,
    founded date
);

CREATE TABLE person(
    pid bigint PRIMARY KEY,
    name text,
    worksFor bigint REFERENCES Company(cid)
);

CREATE INDEX person_name ON person(name);
CREATE INDEX person_name ON company(name);

Recap: Format and structure

Format: The concrete layout of data in memory/disk with procedures for manipulation
- SQL Tables
- CSV
- RDF
- XML, JSON, Excel, etc.
Data structure: Mathematical description of the layout and manipulation of data + data representation approach + minimal semantics
- Relational (tables/relations)
- Triple-based (RDF, RDFS, OWL)
- Hierarchical (trees)
- Graph-based (property graphs/networks)
- Key-value-based

Types of data transformations

Format transformation: Change data from one format to another
Structural transformation: Change data from one data structure to another
Conceptual transformation: Change the conceptual schema of the data
Logical/physical transformation: Change the logical/physical schema of the data within the same structure/format
Changes to any format, structure or schema changes the possible use and meaning of data
A complex data transformation can typically be split into simpler transformations doing one thing

Transform data or schema

Direct data transformation: A data transformation that directly changes the data
Indirect data transformation: A data transformation that implicitly changes the data by changing the schema
Data transformation via indirect data transformations is often called data mediation

Physical transformation

Change the physical schema of the data
Typically done for:
- Improving efficiency/performance of data management/manipulation/querying
- Fit data for efficient use by a particular program/algorithm

Examples: Physical transformation

Add an index to a database, e.g.:

CREATE INDEX name_index ON person(name);

Change the type of a column in a database, e.g.:

ALTER TABLE person ALTER COLUMN age TYPE bigint;

Add rule to redirect INSERTs

CREATE RULE ancestor_insert AS
ON INSERT TO ancestors DO INSTEAD 
INSERT INTO ancestors_raw VALUES (NEW.id, NEW.ancestor);

Adding a prefix and refactoring IRIs in RDF
Changing the order of triples in an RDF-file

Changing from row-based to column-based storage

-- Using https://github.com/citusdata/cstore_fdw
CREATE EXTENSION citus;

ALTER TABLE person RENAME TO person_row;

CREATE TABLE person (
  LIKE person_row INCLUDING ALL
) USING columnar; -- use columnar storage

INSERT INTO person SELECT * FROM person_row;

DROP TABLE person_row;

Format transformations

Move data from one format to another
Typical for data migration
Reasons for only transforming format (not structure, logical or conceptual schema) could be
- Move away from obselete format
- Change to format supported by particular software
- Easier to manipulate data in a particular format

Examples: Format transformation

CSVREAD in H2 or COPY FROM in PostgreSQL (CSV to SQL-table)
Triplelore’s rdftolore-command (RDF to Lore-relations)
R2RML/ontop-mappings (may also change structure)
Foreign data wrappers (may also change structure)

Logical transformations

A transformation that changes the logical schema (within a structure)
Useful for restructuring data to e.g.:
- better fit a particular use-case
- to add layers of abstraction on top of data
- to ease maintenance or extraction

Examples: Logical transformation

Database normalization: Reduces reduncancy in the data

             Person
                                     
 id | name | companyid | companyname 
----+------+-----------+-------------
  1 | Per  |         2 | UiO
  2 | Kari |         1 | DNB
  3 | Mary |         1 | DNB
  3 | Mary |         2 | UiO
  
           
           ||
           || Normalization
           \/
  
  
    Person             Company
     
   id | name          id | name 
  ----+------        ----+------
    1 | Per            1 | UiO
    2 | Kari           2 | DNB
    3 | Mary                         
  
  
             WorksFor
             
         person | company  
        --------+---------
              1 |       2 
              2 |       1
              3 |       1 
              3 |       2

Views for derivable columns: Add views computing derviable values

        Person                             Person_age
                                                     
 id | name |    born              id | name |    born    | age
----+------+------------         ----+------+------------+-----                   
  1 | Per  | 2001-01-03    ==>     1 | Per  | 2001-01-03 |  21
  2 | Kari | 1989-06-09            2 | Kari | 1989-06-09 |  31                   
  3 | Mary | 1995-09-08            3 | Mary | 1995-09-08 |  25

Structural transformation

Move data from one structure to another, e.g.
- to a DB/structure with different properties
- for different types of use (sales vs. decision making)
Integrate different data sources in a single database/structure
Destructuring: From structurally rich to poor (relational to triples)
Structuring: From structurally poor to rich (triples to relational)
Lifting: From semantically poor to rich (relational to OWL)
Lowering: From semantically rich to poor (OWL to relational)

Example: Structural transformation

Direct mappings: Preserves as much of the conceptual schema as possible in the target structure
- Useful for moving data from one structure to another without changing the meaning
OTTR Templates: Tabular to triples (may also change conceptual schema)
SPARQL SELECT-queries: RDF to Tabular

Conceptual transformation

A transformation that changes the conceptual data schema
Often necessary when doing full/semantic data integration
Abstraction can also change the conceptual schema

Example: Conceptual transformation

Change an attribute to an entity and a relation

Change an attribute to a collection of classes

ex:peter ex:hasDegreee "Bachelor" .
ex:mary ex:hasDegreee "Master" .
ex:carl ex:hasDegreee "Master" .
ex:ingrid ex:hasDegreee "PhD" .

         ||
         ||
         \/

ex:peter a ex:Bachelor .
ex:mary a ex:Master .
ex:carl a ex:Master .
ex:ingrid a ex:PhD .

Example: Complex transformations for integration

Given the following CSV-file describing authours and the books they have written:

aid, name, isbn, title
1, Peter, 12345, The Troll
1, Peter, 67890, The Troll Returns
2, Mary, 13579, A Planet Adventure
3, Nils, 24680, Caves of Treasure
3, Nils, 13246, Dungeons of Gold

Problem: Transform into RDF using the following DC-Terms vocabulary:

Use @prefix dc: <http://purl.org/dc/terms/>
dc:creator: Relates a creator to its work
dc:title: Relates a thing to its title
dc:identifier: Relates a thing to its unique identifier
dc:BibliographicResource: Class of bibliographic resources, such as books
dc:Agent: Things that can act, e.g. a person

Erroneous Solution:

Map the CSV directly to RDF as ex:person{aid} ex:<column> <value>
Make an integrating ontology, e.g. ex:name rdfs:subPropertyOf dc:title

Result after first step:

ex:person1 ex:name "Peter" ;
  ex:isbn "12345", "67890" ;
  ex:title "The Troll", "The Troll Returns" .

ex:person2 ex:name "Mary" ;
  ex:isbn "13579" ;
  ex:title "The Planet Adventure" .

ex:person3 ex:name "Nils" ;
  ex:isbn "24680", "13246" ;
  ex:title "Caves of Treasure", "Dungeons of Gold" .

Supposed to only be a structural transformation but substantially changed the conceptual schema!

Solution 1:

Format transformation: CSV to SQL table
- using e.g. COPY
Logical transformation: Normalization
- split into Person(id, name), Book(isbn, title) and Wrote(person, book)
Structural transformation: Direct mapping
- E.g. with Triplelore
Conceptual transformation: Add integrating ontology
- E.g. ex-p:name rdfs:subPropertyOf dc:title

Solution 2:

Templates:

ex-t:Agent[ ottr:IRI ?person, xsd:string ?name ] :: {
    o-rdf:Type(?person, dc:Agent),
    ottr:Triple(?person, dc:title, ?name)
}.

ex-t:Book[ ottr:IRI ?book, xsd:string ?isbn, xsd:string ?title ] :: {
    o-rdf:Type(?book, dc:BibliographicResource),
    ottr:Triple(?book, dc:identifier, ?isbn),
    ottr:Triple(?book, dc:title, ?title)
} .

ex-t:Wrote[ ottr:IRI ?person, ottr:IRI ?book ] :: {
    ottr:Triple(?person, dc:creator, ?book)
} .

bOTTR-mappings:

ex-m:Person a :InstanceMap ;
  :template ex-t:Agent ;
  :query """SELECT DISTINCT CONCAT('http://example.org/person/', aid), name
            FROM CSVREAD('authors.csv', null, 'charset=UTF-8 fieldSeparator=,');""" ;
  :argumentMaps (
    [ :type :IRI ]
    [ :type xsd:string ]
    ) ;
  :source
    [ a :H2Source ] .

ex-m:Book a :InstanceMap ;
  :template ex-t:Book ;
  :query """SELECT DISTINCT CONCAT('http://example.org/book/', isbn), isbn, title
            FROM CSVREAD('authors.csv', null, 'charset=UTF-8 fieldSeparator=,');""" ;
  :argumentMaps (
    [ :type :IRI ]
    [ :type xsd:string ]
    [ :type xsd:string ]
    ) ;
  :source
    [ a :H2Source ] .
    
ex-m:Agents a :InstanceMap ;
  :template ex-t:Wrote ;
  :query """SELECT DISTINCT CONCAT('http://example.org/person/', aid), CONCAT('http://example.org/book/', isbn)
            FROM CSVREAD('authors.csv', null, 'charset=UTF-8 fieldSeparator=,');""" ;
  :argumentMaps (
    [ :type :IRI ]
    [ :type :IRI ]
    ) ;
  :source
    [ a :H2Source ] .

Result:

ex-p:1  rdf:type  dc:Agent ;
        dc:title  "Peter" ;
        dc:creator  ex-b:12345, ex-b:67890 .

ex-p:2  rdf:type  dc:Agent ;
        dc:title  "Mary" ;
        dc:creator  ex-b:13579 .

ex-p:3  rdf:type  dc:Agent ;
        dc:title  "Nils" ;
        dc:creator  ex-b:13246, ex-b:24680 .

ex-b:12345  rdf:type   dc:BibliographicResource ;
        dc:title       "The Troll" ;
        dc:identifier  "12345" .

ex-b:13246  rdf:type   dc:BibliographicResource ;
        dc:title       "Dungeons of Gold" ;
        dc:identifier  "13246" .

ex-b:13579  rdf:type   dc:BibliographicResource ;
        dc:title       "A Planet Adventure" ;
        dc:identifier  "13579" .

ex-b:24680  rdf:type   dc:BibliographicResource ;
        dc:title       "Caves of Treasure" ;
        dc:identifier  "24680" .

ex-b:67890  rdf:type   dc:BibliographicResource ;
        dc:title       "The Troll Returns" ;
        dc:identifier  "67890" .

Interesting transformations

Identity transform: No change to data, same out as in
- Useful for moving data from one concrete DB instance to another of same type/format
Saturation: A transformation that only adds data (within the same structure/format), and can possibly extend the physical, logical and conceptual schemas
- Useful for adding implicit data
- Sometimes called constructive transformation or enrichment
Filter: A transformation that only removes data
- Useful for removing irrelevant or erroneous data (e.g. data cleaning)
- Sometimes called destructive transformation

Interesting transformations: Transposition

Want to do some analysis on statistics on open positions by domain:

CREATE SCHEMA stillinger;

CREATE TABLE stillinger.original(
    domain text,
    y2010 int,
    y2011 int,
    y2012 int,
    y2013 int,
    y2014 int,
    y2015 int,
    y2016 int,
    y2017 int,
    y2018 int,
    y2019 int,
    y2020 int,
    y2021 int,
    y2022 int
);

-- data from https://www.ssb.no/statbank/table/08836/tableViewLayout2/ 
\copy stillinger.original FROM 'ledige_stillinger.csv' WITH DELIMITER ';' CSV HEADER;

Unfortunately, data is not fit for queries comparing data for arbitrary years, e.g.

Find the domain and pair of consecutive years with highest difference in number of open positions

Need to transpose the data, so that years are part of data and not meta data! I.e.:

CREATE TABLE stillinger.transposed(
    domain text,
    year int,
    open_positions int
);

INSERT INTO stillinger.transposed
SELECT domain, 2010, y2010 FROM stillinger.original
UNION ALL
SELECT domain, 2011, y2011 FROM stillinger.original
UNION ALL
SELECT domain, 2012, y2012 FROM stillinger.original
UNION ALL
SELECT domain, 2013, y2013 FROM stillinger.original
UNION ALL
SELECT domain, 2014, y2014 FROM stillinger.original
UNION ALL
SELECT domain, 2015, y2015 FROM stillinger.original
UNION ALL
SELECT domain, 2016, y2016 FROM stillinger.original
UNION ALL
SELECT domain, 2017, y2017 FROM stillinger.original
UNION ALL
SELECT domain, 2018, y2018 FROM stillinger.original
UNION ALL
SELECT domain, 2019, y2019 FROM stillinger.original
UNION ALL
SELECT domain, 2020, y2020 FROM stillinger.original
UNION ALL
SELECT domain, 2021, y2021 FROM stillinger.original
UNION ALL
SELECT domain, 2022, y2022 FROM stillinger.original;

Can now make above query:

SELECT domain, t1.year, t2.year, t1.open_positions - t2.open_positions AS diff
FROM stillinger.transposed AS t1
     JOIN stillinger.transposed AS t2 USING (domain)
WHERE t1.year + 1 = t2.year
ORDER BY abs(t1.open_positions - t2.open_positions) DESC
LIMIT 1;

Interesting transformations: Reification

Sometimes we want to make statements about statements, e.g.
- by whom and when a statement was made
- who knows/believes which statements
- which statement is related to which other statements

Reify statement:

ex:per ex:hasName "Per" .

        ||
        ||
        \/
  
_:s a rdf:Statement ;
  rdf:subject ex:per ;
  rdf:predicate ex:hasName ;
  rdf:object "Per" .

Can now describe the statement-object, e.g.:

_:s ex:createdAt "2023-02-03" ;
    ex:madeBy ex:per ;
    ex:describes ex:per .

ex:kari ex:knows _:s .

Technologies for data transformation

Queries/Views/Rules: structure, conceptual, logical, physical
Mappings: format, structure, conceptual, logical, physical
Reasoning/Semantics: conceptual, logical
Transformations can be done in many places
Pipelines can place transformations in different places
- E.g. ETL (Extract, Transform, Load) vs. ELT (Extract, Load, Transform)

Stored vs. virtual transformations

Stored transformation: Target data is computed and stored explicitly in a data format
Virtual transformation: Queries to the target are rewritten using the definition of the transformation to equivalent queries over the source data

Semantics and mappings

Different structures have different underlying assumptions
Relational DBs normally have closed world and unique name assumptions
Triplestores normally have open world and non-unique name assumptions
These affect the semantics of the data
Thus, Mapping from one data structure to another can change the semantics of the data
Also, mapping from RDBs to triplestores changes what is entailed/derivable from the data
Important to think about this when applying mappings

Open vs. closed world

Important assumption about your data:
- Is it complete or not?
- E.g. if your data describes people, does it describe ALL people?
Closed world assumption: The data is complete
- Typical for relational databases
- E.g. if we have a table Employee and 'Kari' is not listed, we can conclude that 'Kari' is NOT an employee
- If a statement S is not entailed by the data, we can conclude NOT S
Open world assumption: The data might be incomplete
- E.g. if we have a table Employee and 'Kari' is not listed, 'Kari' may or may not be an employee
- Can only conclude NOT S if NOT S is entailed by the data

Open vs. closed world: Example

Relational data

person:

 pid |  name  | works_for 
-----+--------+-----------
   1 | Mary   |         2
   2 | Peter  |         3
   3 | Carl   |          
   4 | Albert |         1
   5 | Mina   |         2
   6 | Ida    |

company:

 cid |          name          
-----+------------------------
   1 | Albert Inc
   2 | North Analytics
   3 | Computers and Programs

Define:

CREATE VIEW unemployed AS
SELECT cid, name
FROM person
WHERE works_for IS NULL;

Now unemployed will contain:

 pid | name 
-----+------
   3 | Carl
   6 | Ida

Open vs. closed world: Example

Mapped to RDF:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix exd: <http://example.org/data/> .
@prefix exo: <http://example.org/ont/> .

exd:person1 a exo:Person ;
  rdfs:label "Mary" ;
  exo:works_for exd:company2 .
  
exd:person2 a exo:Person ;
  rdfs:label "Peter" ;
  exo:works_for exd:company3 .
  
exd:person3 a exo:Person ;
  rdfs:label "Carl" .
  
exd:person4 a exo:Person ;
  rdfs:label "Albert" ;
  exo:works_for exd:company1 .
  
exd:person5 a exo:Person ;
  rdfs:label "Mina" ;
  exo:works_for exd:company2 .
  
exd:person6 a exo:Person ;
  rdfs:label "Ida" .
  
  
exd:comany1 a exo:Comany ;
  rdfs:label "Albert Inc" .
  
exd:comany2 a exo:Comany ;
  rdfs:label "North Analytics" .
  
exd:comany3 a exo:Comany ;
  rdfs:label "Computers and Programs" .

Adding e.g. the following OWL-ontology:

exo:works_for rdfs:domain exo:Employed

exo:Person SubClassOf exo:Employed or exo:Unemployed

exo:Employed and exo:Unemployed SubClassOf owl:Nothing

and querying (with reasoning) for all instances of exo:Unemployed would give no results.