IN5800 – Integration

Leif Harald Karlsen

Definition: Data integration

Reasons for data integration

Central idea and problem

Different approaches: 1. Query rewriting

Different approaches: 2. Source rewriting

Data Exchange from Mediator

                     

===>

Different approaches: 3. P2P

Query rewriting/Mediator based data integration: Idea

Query rewriting/Mediator based data integration: Mappings

Mappings are transformations that typically consists of:

Last two typically the focus of integration, as the former two are somewhat trivial

Query rewriting/Mediator based data integration: Queries

GAV Mappings

  • GAV (Global-As-View) mappings define a global schema as functions of the sources
  • I.e. the global schema is defined in terms of the local (sources’) schemas
  • Global schema typically defined as a set of VIEWs over the local schemas
  • Query answering done by unfolding the terms in the global schema to terms from the local schemas

GAV Mappings: Example

Global Schema:

Product(pid, name, category)
Supplier(sid, name, address, country)
Supply(supplier, product, price)

Supply(supplier) -> Supplier(sid)
Supply(product) -> Product(pid)

Local Schema 1: Norwegian suppliers

l1.Product(id, product_name, price, supplier)
l1.Supplier(id, company_name, address)

l1.Product(supplier) -> l1.Supplier(id)

Local Schema 2: Swedish suppliers

l2.Product(product_id, name, price, category)
l2.Category(name)
l2.Supplier(supplier_id, name, category, address)

l2.Product(category) -> l2.Category(name)
l2.Supplier(category) -> l2.Category(name)

GAV Mappings:

Product(pid, name, category) ==> 
    SELECT id, name, NULL
    FROM l1.Product
    UNION ALL
    SELECT product_id, name, category
    FROM l2.Product;

Supplier(sid, name, address, country) ==>
    SELECT id, name, address, 'Norway'
    FROM l1.Supplier
    UNION ALL
    SELECT supplier_id, name, address, 'Sweden'
    FROM l2.Supplier;

Supply(supplier, product, price) ==>
    SELECT supplier, id, price
    FROM l1.product
    UNION ALL
    SELECT s.supplier_id, p.product_id, p.price
    FROM l2.Product AS p
         JOIN l2.Supplier AS s USING (category);

Example query:

SELECT p.name, s.name
FROM Product AS p
     JOIN Supply AS y ON (p.pid = y.product)
     JOIN Supplier AS s ON (y.supplier = s.sid);
WHERE y.price > 100;

==>

SELECT p.name, s.name
FROM 
    (
     SELECT id AS pid, name, NULL AS category
     FROM l1.Product
     UNION ALL
     SELECT product_id AS pid, name, category
     FROM l2.Product
    
    ) AS p JOIN (
     
     SELECT supplier, id AS product, price
     FROM l1.product
     UNION ALL
     SELECT s.supplier_id AS supplier, p.product_id AS product, p.price
     FROM l2.Product AS p
          JOIN l2.Supplier AS s USING (category)
    
    ) AS y ON (p.pid = y.product) JOIN (
     
     SELECT id AS sid, name, address, 'Norway' AS country
     FROM l1.Supplier
     UNION ALL
     SELECT product_id AS pid, name, address, 'Sweden' AS country
     FROM l2.Supplier
    
    ) AS s ON (y.supplier = s.sid)
WHERE y.price > 100;

GAV and Views/Rules

GAV mappings are within an RDBMSs typically implemted with VIEWs (assuming all sources are mapped into the same RDBMS, though), e.g.:

CREATE VIEW Product(pid, name, category) AS
    SELECT id, name, NULL
    FROM l1.Product
    UNION ALL
    SELECT product_id, name, category
    FROM l2.Product;

CREATE VIEW Supplier(sid, name, address, country) AS
    SELECT id, name, address, 'Norway'
    FROM l1.Supplier
    UNION ALL
    SELECT supplier_id, name, address, 'Sweden'
    FROM l2.Supplier;

CREATE VIEW Supply(supplier, product, price) AS
    SELECT supplier, id, price
    FROM l1.product
    UNION ALL
    SELECT s.supplier_id, p.product_id, p.price
    FROM l2.Product AS p
         JOIN l2.Supplier AS s USING (category);

GAV-mappings can also be written as backward-chaining (Lore-)rules:

CREATE RELATION Product(pid int, name text, category text);
Product(id, name, NULL) <- l1.Product(id, name, price, sup);
Product(id, name, cat) <- l2.Product(id, name, cat);

CREATE RELATION Supplier(sid int, name text, address text, country text);
Supplier(id, name, addr, 'Norway') <- l1.Supplier(id, name, addr);
Supplier(id, name, addr, 'Sweden') <- l2.Supplier(id, name, cat, addr);

CREATE RELATION Supply(supplier int, product int, price float);
Supply(sid, pid, price) <- l1.Product(pid, name, price, sid);
Supply(sid, pid, price) <- l2.Product(pid, pname, price, cat),
                           l2.Supplier(sid, sname, cat, addr);

LAV Mappings

  • LAV (Local-As-View) mappings define the local schemas as functions of the global schema
  • I.e. the local (sources’) schemas are defined in terms of the global schema
  • Each element (table) in each local schema is typically represented as a query over the global schema
  • Query answering done by trying to find a set of definitions that cover the answers to the query
    • Thus, needs to check query containment => NP-complete for conjunctive queries
  • Note: Not expressible via views or rules!

LAV Mappings: Example

Global Schema:

Product(pid, name, category)
Supplier(sid, name, address, country)
Supply(supplier, product, price)

Supply(supplier) -> Supplier(sid)
Supply(product) -> Product(pid)

Local Schema 1: Norwegian suppliers

l1.Product(id, product_name, price, supplier)
l1.Supplier(id, company_name, address)

l1.Product(supplier) -> l1.Supplier(id)

Local Schema 2: Swedish suppliers

l2.Product(product_id, name, price, category)
l2.Category(name)
l2.Supplier(supplier_id, name, category, address)

l2.Product(category) -> l2.Category(name)
l2.CategorySupplier(category) -> l2.Category(name)

LAV Mappings for l1:

l1.Product(id, product_name, price, supplier) ==>
    SELECT p.pid, p.name, y.price, y.supplier
    FROM Product AS p
         JOIN Supply AS y ON (p.pid = y.product)
         JOIN Supplier AS s ON (y.supplier = s.sid)
    WHERE s.country = 'Norway' AND p.category IS NULL;

l1.Supplier(id, company_name, address) ==>
    SELECT sid, name, address
    FROM Supplier
    WHERE s.country = 'Norway';

LAV Mappings for l2:

l2.Product(product_id, name, price, category) ==>
    SELECT p.pid, p.name, y.price, p.category
    FROM Product AS p
         JOIN Supply AS y ON (p.pid = y.product)
         JOIN Supplier AS s ON (y.supplier = s.sid)
    WHERE s.country = 'Sweden';

l2.Category(name) ==>
    SELECT DISTINCT category
    FROM Product;

l2.Supplier(supplier_id, name, category, address) ==>
    SELECT s.sid, s.name,
        (SELECT p.category
         FROM Product AS p JOIN Supply AS y USING (product)
         WHERE y.supplier = s.supplier
         LIMI 1) AS category,
        s.address
    FROM Supplier AS s
    WHERE s.country = 'Sweden';

Example query:

SELECT p.name, s.name
FROM Product AS p
     JOIN Supply AS y ON (p.pid = y.product)
     JOIN Supplier AS s ON (y.supplier = s.sid);
WHERE y.price > 100;

==>

SELECT pname, sname
FROM l1.Product AS l1p(pid, pname, price, supplier
    JOIN l1.Supplier AS l1s(sid, sname, address) ON (l1p.supplier = l1s.sid)
WHERE price > 100
UNION ALL
SELECT pname, sname
FROM l2.Product AS l2p(pid, pname, price, category
     JOIN l2.Supplier AS l2s(sid, sname, category, address) USING (category)
WHERE price > 100;

GAV vs. LAV

GLAV Mappings

SELECT p.pid, y.price
FROM Product AS p
     JOIN Supply AS y ON (p.pid = y.product)
==>
SELECT id AS pid, price FROM l1.Product
UNION ALL
SELECT product_id AS pid, price FROM l2.Product

GLAV and Ontologies

global:Product owl:equivalentClass (l1:Product or l2:Product) .                       # GAV-like

l1:Supplier owl:equivalentClass global:Supplier and (global:country value 'Norway') . # LAV-like

l1:Supplier or l2:Supplier owl:equivalentClass global:supplies some global:Product .  # GLAV-like

l1:Supplier owl:equivalentClass global:Supplier and not l2:Supplier .                 # GLAV-like

(Mixing Manchester and Turtle syntax)

Data Integration and Pipelines

Ontology-Based Data Access/Integration (OBDA/OBDI)

Ontop

OBDA/Ontop Example (Ontop mappings, basic idea)

Global vocabulary/ontology:

global:NorwegianSupplier rdfs:subClassOf global:Supplier .
global:SwedishSupplier rdfs:subClassOf global:Supplier .

(GLAV) Mapping for local schema 1:

target:     l1:supplier{id} a global:NorwegianSupplier .
source:     SELECT id FROM l1.Supplier

(GLAV) Mapping for local schema 2:

target:     l2:supplier{supplier_id} a global:SwedishSupplier .
source:     SELECT supplier_id FROM l2.Supplier

Example query:

SELECT ?sup
WHERE { 
    ?sup a global:Supplier .
}

'  ||                            '
'  || Rewriting (using ontology) '
'  \/                            '

SELECT ?sup
WHERE { 
    { ?sup a global:Supplier . }
    UNION
    { ?sup a global:NorwegianSupplier . }
    UNION
    { ?sup a global:SwedishSupplier . }
}

'  ||                            '
'  || Unfolding (using mappings) '
'  \/                            '

SELECT 'l1:supplier' || id
FROM l1.Supplier
UNION ALL
SELECT 'l2:supplier' || supplier_id
FROM l2.Supplier

Ontology Based Data Integration

  1. Complex mappings, no (axioms in) ontology:
target: global:product{p.product_id} a global:Product ;
            global:name {p.name} ;
            global:price {p.price} ;
            global:supplier global:supplier{s.supplier_id} .
source: SELECT p.product_id, p.name, p.price, s.supplier_id
        FROM l2.Product AS p
             JOIN l2.Supplier AS s USING (category);
  1. Direct mappings, complex ontology:
l2:product rdfs:subClassOf global:Product .
l2:product_name rdfs:subPropertyOf global:name .
l2:product_price rdfs:subPropertyOf global:price .

l2:supplier_of_category a owl:ObjectProperty ; # Needed for below axiom
    owl:inverseOf l2:supplier_ref_category .
    
# State that global:supplier is the join of l2.product_ref_category and l2.supplier_of_category
global:supplier owl:propertyChainAxiom ( l2.product_ref_category l2.supplier_of_category ) .

Layers in Data Integration

Foreign Data Wrappers

Foreign Data Wrappers: PostgreSQL Example

-- Assume this is the database containing the l1-schema
-- (from example above) called "norwegian_db" and we are
-- logged in as user "norvegian", and that
-- and we want to include l2-schema from
-- a different database called "swedish_db" into this db.

-- Create the foreign data wrapper extension:
-- (this is already done for you in your IFI-databases)
CREATE EXTENSION postgres_fwd;

-- Then create a reference to the other database.
-- Lets assume "swedish_db" is on host "123.4.5.6"
-- and port 5432:
CREATE SERVER swedish_db_server
FOREGISN DATA WRAPPER postgres_fdw
OPTIONS (host '123.4.5.6', port '5432', dbname 'swedish_db');

-- Now need to create a user-mapping to gain access
-- to the swedish_db:
-- Lets assume we can use user with username 'swede' and password 'pwd123':
CREATE USER MAPPING FOR norwegian
SERVER swedish_db_server
OPTIONS (user 'swede', password 'pwd123');

-- Create schema to map foreign source into
CREATE SCHEMA l2; -- Could also be named something different

-- Map the foreign source into the created schema
IMPORT FOREIGN SCHEMA l2
FROM SERVER swedish_db_server
INTO l2;

-- Can now run queries over l2 as if it were
-- a normal schame in this database.
-- Can now integrate under a global schema (e.g. public) as above,
-- e.g. with VIEW-based GAV-mappings from above
CREATE VIEW Product(pid, name, category) AS
    SELECT id, name, NULL
    FROM l1.Product
    UNION ALL
    SELECT product_id, name, category
    FROM l2.Product;

CREATE VIEW Supplier(sid, name, address, country) AS
    SELECT id, name, address, 'Norway'
    FROM l1.Supplier
    UNION ALL
    SELECT product_id, name, address, 'Sweden'
    FROM l2.Supplier;

CREATE VIEW Supply(supplier, product, price) AS
    SELECT supplier, id, price
    FROM l1.product
    UNION ALL
    SELECT s.supplier_id, p.product_id, p.price
    FROM l2.Product AS p
         JOIN l2.Supplier AS s USING (category);

Examples of approaches

For each example, what approach to integration makes the most sense?

Example 1: A big company A with a very large DB buys a much smaller company B with a small DB. All of B’s programs/services will be merged or replaced with A’s programs/services.

Suggestion: Data exchange, where B’s data is moved into A’s DB, where A’s DB is used as global schema with GAV-mappings.

Example 2: A reasearch organization has lots of biologists making small datasets about different species. Different biologiests may use different terms for the same thing (due to scientific conventions), however the differences are simple. New datasets are made continously. The organization wants to be able to combine all of the data for analysis.

Suggestion: Define a global schema that matches the vocabulary used in the analysis, which works as a data mediator. Use LAV-mappings from the biologists’ (local) schemas.

Example 3: Norway’s public transport companies wants to make a single service that allows people to buy all tickets from one place. However, the companies have not been able to aggree on any global schema that fits with all companies data, however, companies that are geographically close have schemas that are eaier to align. The companies have lots of data, but relatively small metadata.

Suggestion: Use P2P-integration, with mappings between companies that are geographically close. Use GLAV-mappings to define simple relationships between terms in the different vocabularies. Use a weak/efficient mapping/ontology language (e.g. OWL 2 QL).