Central idea and problem

Want to answer queries over multiple sources (databases/files/streams/etc.)
Queries should be formulated using a (single) global vocabulary
- I.e. appear to be a single source
- Otherwise, the complexity from differences spill into all queries/systems/services
Thus, either need to
1. rewrite queries over a global vocabulary to fit the sources
2. rewrite the data in the sources to fit a global vocabulary
3. formulate query over a single source, which then somehow uses the other sources to answer the query
Why is it hard? Different sources may have:
- different servers, OSes, firewalls, etc. (solved by technology)
- different structures, formats and query languages (solved by technology)
- different vocabularies/semantics (solved by people)

Different approaches: 1. Query rewriting

Rewrite queries over a global vocabulary to fit the sources
Often called mediator based integration
Typically, the mediator is a global schema describing the target (query) vocabulary
The global schema should be unified, i.e. contain single term per thing
- E.g. single table per entity and relationship,
- or single resource per class and property
E.g. R2RML and Ontop mappings can be used for this

Different approaches: 2. Source rewriting

Rewrite the data in the sources to fit the vocabulary
Often called data exchange
Query then never needs to use sources
Often part of an ETL-pipeline
E.g. OTTR templates (with bOTTR) can be used for this

Data Exchange from Mediator

===>

Given a global schema G and a set of mappings M over a set of local sources S
Can always write general queries over G extracting all data from the local sources into global schema and store it
- E.g. if G consists of the relations R_1, R_2, …, R_n, we could for each i simply execute
```
CREATE TABLE stored.R_i(...) AS SELECT * FROM R_i;
```
- E.g. if G is an RDF-vocabulary/ontology, we can simply execute and store result of
```
CONSTRUCT { ?s ?p ?o . } WHERE { ?s ?p ?o . }
```
Thus, can always use a mediator to do data exchange

Different approaches: 3. P2P

Formulate query over a single source, which then somehow uses the other sources to answer the query
Called Peer-to-peer (P2P) integration
Sometimes used in Linked Open Data where resources are linked via properties (commonly owl:sameAs)
- See e.g. the LOD-cloud
Can also be databases or other systems linked
Note: A complete P2P-integration requires O(sn²) mappnings (s is size of biggest schema and n is number of DBs)
- Using a global schema is O(sn)

Query rewriting/Mediator based data integration: Idea

We will first focus on mediator-based data integration
The mediator is a global schema (or vocabulary)
Mappings relate the local schemas (or vocabularies) to the global schema

Query rewriting/Mediator based data integration: Mappings

Mappings are transformations that typically consists of:

Format transformation: Get all data on the same format
Structural transformation: Get all data into same structure
Logical transformation: Align vocabulary/structure
Conceptual transformation: Align meaning

Last two typically the focus of integration, as the former two are somewhat trivial

Query rewriting/Mediator based data integration: Queries

Queries formulated over global schema, and rewritten using mappings to queries over local schemas
In relational DBs three approaches to define mappings: GAV, LAV or GLAV
- Differences: Define global schema in terms of locals, or vice versa

GAV Mappings

GAV (Global-As-View) mappings define a global schema as functions of the sources
I.e. the global schema is defined in terms of the local (sources’) schemas
Global schema typically defined as a set of VIEWs over the local schemas
Query answering done by unfolding the terms in the global schema to terms from the local schemas

GAV Mappings: Example

Global Schema:

Product(pid, name, category)
Supplier(sid, name, address, country)
Supply(supplier, product, price)

Supply(supplier) -> Supplier(sid)
Supply(product) -> Product(pid)

Local Schema 1: Norwegian suppliers

l1.Product(id, product_name, price, supplier)
l1.Supplier(id, company_name, address)

l1.Product(supplier) -> l1.Supplier(id)

Local Schema 2: Swedish suppliers

l2.Product(product_id, name, price, category)
l2.Category(name)
l2.Supplier(supplier_id, name, category, address)

l2.Product(category) -> l2.Category(name)
l2.Supplier(category) -> l2.Category(name)

GAV Mappings:

Product(pid, name, category) ==> 
    SELECT id, name, NULL
    FROM l1.Product
    UNION ALL
    SELECT product_id, name, category
    FROM l2.Product;

Supplier(sid, name, address, country) ==>
    SELECT id, name, address, 'Norway'
    FROM l1.Supplier
    UNION ALL
    SELECT supplier_id, name, address, 'Sweden'
    FROM l2.Supplier;

Supply(supplier, product, price) ==>
    SELECT supplier, id, price
    FROM l1.product
    UNION ALL
    SELECT s.supplier_id, p.product_id, p.price
    FROM l2.Product AS p
         JOIN l2.Supplier AS s USING (category);

Example query:

SELECT p.name, s.name
FROM Product AS p
     JOIN Supply AS y ON (p.pid = y.product)
     JOIN Supplier AS s ON (y.supplier = s.sid);
WHERE y.price > 100;

==>

SELECT p.name, s.name
FROM 
    (
     SELECT id AS pid, name, NULL AS category
     FROM l1.Product
     UNION ALL
     SELECT product_id AS pid, name, category
     FROM l2.Product
    
    ) AS p JOIN (
     
     SELECT supplier, id AS product, price
     FROM l1.product
     UNION ALL
     SELECT s.supplier_id AS supplier, p.product_id AS product, p.price
     FROM l2.Product AS p
          JOIN l2.Supplier AS s USING (category)
    
    ) AS y ON (p.pid = y.product) JOIN (
     
     SELECT id AS sid, name, address, 'Norway' AS country
     FROM l1.Supplier
     UNION ALL
     SELECT product_id AS pid, name, address, 'Sweden' AS country
     FROM l2.Supplier
    
    ) AS s ON (y.supplier = s.sid)
WHERE y.price > 100;

GAV and Views/Rules

GAV mappings are within an RDBMSs typically implemted with VIEWs (assuming all sources are mapped into the same RDBMS, though), e.g.:

CREATE VIEW Product(pid, name, category) AS
    SELECT id, name, NULL
    FROM l1.Product
    UNION ALL
    SELECT product_id, name, category
    FROM l2.Product;

CREATE VIEW Supplier(sid, name, address, country) AS
    SELECT id, name, address, 'Norway'
    FROM l1.Supplier
    UNION ALL
    SELECT supplier_id, name, address, 'Sweden'
    FROM l2.Supplier;

CREATE VIEW Supply(supplier, product, price) AS
    SELECT supplier, id, price
    FROM l1.product
    UNION ALL
    SELECT s.supplier_id, p.product_id, p.price
    FROM l2.Product AS p
         JOIN l2.Supplier AS s USING (category);

GAV-mappings can also be written as backward-chaining (Lore-)rules:

CREATE RELATION Product(pid int, name text, category text);
Product(id, name, NULL) <- l1.Product(id, name, price, sup);
Product(id, name, cat) <- l2.Product(id, name, cat);

CREATE RELATION Supplier(sid int, name text, address text, country text);
Supplier(id, name, addr, 'Norway') <- l1.Supplier(id, name, addr);
Supplier(id, name, addr, 'Sweden') <- l2.Supplier(id, name, cat, addr);

CREATE RELATION Supply(supplier int, product int, price float);
Supply(sid, pid, price) <- l1.Product(pid, name, price, sid);
Supply(sid, pid, price) <- l2.Product(pid, pname, price, cat),
                           l2.Supplier(sid, sname, cat, addr);

LAV Mappings

LAV (Local-As-View) mappings define the local schemas as functions of the global schema
I.e. the local (sources’) schemas are defined in terms of the global schema
Each element (table) in each local schema is typically represented as a query over the global schema
Query answering done by trying to find a set of definitions that cover the answers to the query
- Thus, needs to check query containment => NP-complete for conjunctive queries
Note: Not expressible via views or rules!

LAV Mappings: Example

Global Schema:

Product(pid, name, category)
Supplier(sid, name, address, country)
Supply(supplier, product, price)

Supply(supplier) -> Supplier(sid)
Supply(product) -> Product(pid)

Local Schema 1: Norwegian suppliers

l1.Product(id, product_name, price, supplier)
l1.Supplier(id, company_name, address)

l1.Product(supplier) -> l1.Supplier(id)

Local Schema 2: Swedish suppliers

l2.Product(product_id, name, price, category)
l2.Category(name)
l2.Supplier(supplier_id, name, category, address)

l2.Product(category) -> l2.Category(name)
l2.CategorySupplier(category) -> l2.Category(name)

LAV Mappings for l1:

l1.Product(id, product_name, price, supplier) ==>
    SELECT p.pid, p.name, y.price, y.supplier
    FROM Product AS p
         JOIN Supply AS y ON (p.pid = y.product)
         JOIN Supplier AS s ON (y.supplier = s.sid)
    WHERE s.country = 'Norway' AND p.category IS NULL;

l1.Supplier(id, company_name, address) ==>
    SELECT sid, name, address
    FROM Supplier
    WHERE s.country = 'Norway';

LAV Mappings for l2:

l2.Product(product_id, name, price, category) ==>
    SELECT p.pid, p.name, y.price, p.category
    FROM Product AS p
         JOIN Supply AS y ON (p.pid = y.product)
         JOIN Supplier AS s ON (y.supplier = s.sid)
    WHERE s.country = 'Sweden';

l2.Category(name) ==>
    SELECT DISTINCT category
    FROM Product;

l2.Supplier(supplier_id, name, category, address) ==>
    SELECT s.sid, s.name,
        (SELECT p.category
         FROM Product AS p JOIN Supply AS y USING (product)
         WHERE y.supplier = s.supplier
         LIMI 1) AS category,
        s.address
    FROM Supplier AS s
    WHERE s.country = 'Sweden';

Example query:

SELECT p.name, s.name
FROM Product AS p
     JOIN Supply AS y ON (p.pid = y.product)
     JOIN Supplier AS s ON (y.supplier = s.sid);
WHERE y.price > 100;

==>

SELECT pname, sname
FROM l1.Product AS l1p(pid, pname, price, supplier
    JOIN l1.Supplier AS l1s(sid, sname, address) ON (l1p.supplier = l1s.sid)
WHERE price > 100
UNION ALL
SELECT pname, sname
FROM l2.Product AS l2p(pid, pname, price, category
     JOIN l2.Supplier AS l2s(sid, sname, category, address) USING (category)
WHERE price > 100;

GAV vs. LAV

GAV has a much simpler query answering-procedure
Thus, GAV is easier to implement, and simpler query answering wrt. complexity
LAV requires computing query-containment which is undecidable in general
- NP-complete for conjunctive queries (only inner joins)
GAV is always complete, whereas LAV might be incomplete
However, harder to add new sources to GAV, trivial in LAV
- In GAV, every new source requires redefining possibly most mappings
- In LAV, one only needs to add mappings relevant for the new source

GLAV Mappings

GLAV (Global and Local As View) is a generalization of both GAV and LAV
GLAV mappings can also define general relationships between the global and local schemas
E.g.:

SELECT p.pid, y.price
FROM Product AS p
     JOIN Supply AS y ON (p.pid = y.product)
==>
SELECT id AS pid, price FROM l1.Product
UNION ALL
SELECT product_id AS pid, price FROM l2.Product

GLAV and Ontologies

Ontology-based integration is actually (sort of) GLAV
If we have a global vocabulary and a set of local vocabularies, axioms can be used to integrate them
Ontology axioms define general relationships between terms (classes/properties)
- However, often restrict expressiveness of ontology language to improve complexity
- OWL 2 QL is tailored for integration
E.g.:

global:Product owl:equivalentClass (l1:Product or l2:Product) .                       # GAV-like

l1:Supplier owl:equivalentClass global:Supplier and (global:country value 'Norway') . # LAV-like

l1:Supplier or l2:Supplier owl:equivalentClass global:supplies some global:Product .  # GLAV-like

l1:Supplier owl:equivalentClass global:Supplier and not l2:Supplier .                 # GLAV-like

(Mixing Manchester and Turtle syntax)

Data Integration and Pipelines

In complex data pipelines, integration can be done in many places
- In views
- In ontology mappings
- In ontologies/semantics
Can also partially integrate (e.g. only a subset of sources) in one place

Ontology-Based Data Access/Integration (OBDA/OBDI)

Sources integrated under a single ontology/vocabulary served via a SPARQL-endpoint
Mappings (virtually) map from different sources into RDF (e.g. RML/R2RML, D2RQ)
Mappings used to rewrite query
Simple ontology languages (e.g. OWL2 QL) can be used for parts of the integration
Ontology axioms also used to rewrite query

Ontop

Ontop is an OBDA-system (introduced in Mappings-lecture)
Supports:
- R2RML mappings and its own simplified mapping syntax (Ontop mappings)
- OWL2 QL ontologies that is used as part of query answering
- SPARQL 1.1 (and a subset of GeoSPARQL)
- Can also do data exchange (materialize RDF graph to file)
- Data integration via Dremio or Teiid

OBDA/Ontop Example (Ontop mappings, basic idea)

Global vocabulary/ontology:

global:NorwegianSupplier rdfs:subClassOf global:Supplier .
global:SwedishSupplier rdfs:subClassOf global:Supplier .

(GLAV) Mapping for local schema 1:

target:     l1:supplier{id} a global:NorwegianSupplier .
source:     SELECT id FROM l1.Supplier

(GLAV) Mapping for local schema 2:

target:     l2:supplier{supplier_id} a global:SwedishSupplier .
source:     SELECT supplier_id FROM l2.Supplier

Example query:

SELECT ?sup
WHERE { 
    ?sup a global:Supplier .
}

'  ||                            '
'  || Rewriting (using ontology) '
'  \/                            '

SELECT ?sup
WHERE { 
    { ?sup a global:Supplier . }
    UNION
    { ?sup a global:NorwegianSupplier . }
    UNION
    { ?sup a global:SwedishSupplier . }
}

'  ||                            '
'  || Unfolding (using mappings) '
'  \/                            '

SELECT 'l1:supplier' || id
FROM l1.Supplier
UNION ALL
SELECT 'l2:supplier' || supplier_id
FROM l2.Supplier

Ontology Based Data Integration

Ontology’s vocabulary can have both specific terms and general terms
Integration can either be done in the mappings, or in the ontology itself
- Simpler mappings (e.g. direct mappings) and more complex ontology; or
- more complex mappings and simpler ontology
Most extreme cases for mapping l2.product-info into global OWL-schema:

Complex mappings, no (axioms in) ontology:

target: global:product{p.product_id} a global:Product ;
            global:name {p.name} ;
            global:price {p.price} ;
            global:supplier global:supplier{s.supplier_id} .
source: SELECT p.product_id, p.name, p.price, s.supplier_id
        FROM l2.Product AS p
             JOIN l2.Supplier AS s USING (category);

Direct mappings, complex ontology:

l2:product rdfs:subClassOf global:Product .
l2:product_name rdfs:subPropertyOf global:name .
l2:product_price rdfs:subPropertyOf global:price .

l2:supplier_of_category a owl:ObjectProperty ; # Needed for below axiom
    owl:inverseOf l2:supplier_ref_category .
    
# State that global:supplier is the join of l2.product_ref_category and l2.supplier_of_category
global:supplier owl:propertyChainAxiom ( l2.product_ref_category l2.supplier_of_category ) .

Layers in Data Integration

In general, more layers gives greater flexibility
Traditional LAV/GAV: Mappings (+ Views)
OBDA/OBDI: Mappings + ontology
bOTTR: Mappings + templates + ontology
Full integration can be split amomg the layers

Foreign Data Wrappers

Introduced in the Mappings-lecture
Virtual mappings from different external data formats into SQL
PostgreSQL’s Foreign Data Wrappers can access other Postgres DBs, different files, other DB implementations, etc.
- see e.g. this page for a list of wrappers supported
Can integrate a collection of data sources by making foreign data wrappers into the same DB, and then hamonizing them with views (or Lore-rules)
Can thus be used both for mediator-based and P2P integration

Foreign Data Wrappers: PostgreSQL Example

-- Assume this is the database containing the l1-schema
-- (from example above) called "norwegian_db" and we are
-- logged in as user "norvegian", and that
-- and we want to include l2-schema from
-- a different database called "swedish_db" into this db.

-- Create the foreign data wrapper extension:
-- (this is already done for you in your IFI-databases)
CREATE EXTENSION postgres_fwd;

-- Then create a reference to the other database.
-- Lets assume "swedish_db" is on host "123.4.5.6"
-- and port 5432:
CREATE SERVER swedish_db_server
FOREGISN DATA WRAPPER postgres_fdw
OPTIONS (host '123.4.5.6', port '5432', dbname 'swedish_db');

-- Now need to create a user-mapping to gain access
-- to the swedish_db:
-- Lets assume we can use user with username 'swede' and password 'pwd123':
CREATE USER MAPPING FOR norwegian
SERVER swedish_db_server
OPTIONS (user 'swede', password 'pwd123');

-- Create schema to map foreign source into
CREATE SCHEMA l2; -- Could also be named something different

-- Map the foreign source into the created schema
IMPORT FOREIGN SCHEMA l2
FROM SERVER swedish_db_server
INTO l2;

-- Can now run queries over l2 as if it were
-- a normal schame in this database.
-- Can now integrate under a global schema (e.g. public) as above,
-- e.g. with VIEW-based GAV-mappings from above
CREATE VIEW Product(pid, name, category) AS
    SELECT id, name, NULL
    FROM l1.Product
    UNION ALL
    SELECT product_id, name, category
    FROM l2.Product;

CREATE VIEW Supplier(sid, name, address, country) AS
    SELECT id, name, address, 'Norway'
    FROM l1.Supplier
    UNION ALL
    SELECT product_id, name, address, 'Sweden'
    FROM l2.Supplier;

CREATE VIEW Supply(supplier, product, price) AS
    SELECT supplier, id, price
    FROM l1.product
    UNION ALL
    SELECT s.supplier_id, p.product_id, p.price
    FROM l2.Product AS p
         JOIN l2.Supplier AS s USING (category);

Examples of approaches

For each example, what approach to integration makes the most sense?

Example 1: A big company A with a very large DB buys a much smaller company B with a small DB. All of B’s programs/services will be merged or replaced with A’s programs/services.

Suggestion: Data exchange, where B’s data is moved into A’s DB, where A’s DB is used as global schema with GAV-mappings.

Example 2: A reasearch organization has lots of biologists making small datasets about different species. Different biologiests may use different terms for the same thing (due to scientific conventions), however the differences are simple. New datasets are made continously. The organization wants to be able to combine all of the data for analysis.

Suggestion: Define a global schema that matches the vocabulary used in the analysis, which works as a data mediator. Use LAV-mappings from the biologists’ (local) schemas.

Example 3: Norway’s public transport companies wants to make a single service that allows people to buy all tickets from one place. However, the companies have not been able to aggree on any global schema that fits with all companies data, however, companies that are geographically close have schemas that are eaier to align. The companies have lots of data, but relatively small metadata.

Suggestion: Use P2P-integration, with mappings between companies that are geographically close. Use GLAV-mappings to define simple relationships between terms in the different vocabularies. Use a weak/efficient mapping/ontology language (e.g. OWL 2 QL).

IN5800 – Integration

Definition: Data integration

Reasons for data integration

Central idea and problem

Different approaches: 1. Query rewriting

Different approaches: 2. Source rewriting

Data Exchange from Mediator

Different approaches: 3. P2P

Query rewriting/Mediator based data integration: Idea

Query rewriting/Mediator based data integration: Mappings

Query rewriting/Mediator based data integration: Queries

GAV Mappings

GAV Mappings: Example

GAV and Views/Rules

LAV Mappings

LAV Mappings: Example

GAV vs. LAV

GLAV Mappings

GLAV and Ontologies

Data Integration and Pipelines

Ontology-Based Data Access/Integration (OBDA/OBDI)

Ontop

OBDA/Ontop Example (Ontop mappings, basic idea)

Ontology Based Data Integration

Layers in Data Integration

Foreign Data Wrappers

Foreign Data Wrappers: PostgreSQL Example

Examples of approaches