Leif Harald Karlsen
===>
G
and a set of mappings
M
over a set of local sources S
G
extracting all
data from the local sources into global schema and store it
G
consists of the relations R_1
,
R_2
, …, R_n
, we could for each i
simply executeG
is an RDF-vocabulary/ontology, we can simply
execute and store result ofowl:sameAs
)
O(sn²)
mappnings (s
is size of biggest schema
and n
is number of DBs)
O(sn)
Mappings are transformations that typically consists of:
Last two typically the focus of integration, as the former two are somewhat trivial
VIEW
s over
the local schemasGlobal Schema:
Product(pid, name, category)
Supplier(sid, name, address, country)
Supply(supplier, product, price)
Supply(supplier) -> Supplier(sid)
Supply(product) -> Product(pid)
Local Schema 1: Norwegian suppliers
l1.Product(id, product_name, price, supplier)
l1.Supplier(id, company_name, address)
l1.Product(supplier) -> l1.Supplier(id)
Local Schema 2: Swedish suppliers
l2.Product(product_id, name, price, category)
l2.Category(name)
l2.Supplier(supplier_id, name, category, address)
l2.Product(category) -> l2.Category(name)
l2.Supplier(category) -> l2.Category(name)
GAV Mappings:
Product(pid, name, category) ==>
SELECT id, name, NULL
FROM l1.Product
UNION ALL
SELECT product_id, name, category
FROM l2.Product;
Supplier(sid, name, address, country) ==>
SELECT id, name, address, 'Norway'
FROM l1.Supplier
UNION ALL
SELECT supplier_id, name, address, 'Sweden'
FROM l2.Supplier;
Supply(supplier, product, price) ==>
SELECT supplier, id, price
FROM l1.product
UNION ALL
SELECT s.supplier_id, p.product_id, p.price
FROM l2.Product AS p
JOIN l2.Supplier AS s USING (category);
Example query:
SELECT p.name, s.name
FROM Product AS p
JOIN Supply AS y ON (p.pid = y.product)
JOIN Supplier AS s ON (y.supplier = s.sid);
WHERE y.price > 100;
==>
SELECT p.name, s.name
FROM
(
SELECT id AS pid, name, NULL AS category
FROM l1.Product
UNION ALL
SELECT product_id AS pid, name, category
FROM l2.Product
) AS p JOIN (
SELECT supplier, id AS product, price
FROM l1.product
UNION ALL
SELECT s.supplier_id AS supplier, p.product_id AS product, p.price
FROM l2.Product AS p
JOIN l2.Supplier AS s USING (category)
) AS y ON (p.pid = y.product) JOIN (
SELECT id AS sid, name, address, 'Norway' AS country
FROM l1.Supplier
UNION ALL
SELECT product_id AS pid, name, address, 'Sweden' AS country
FROM l2.Supplier
) AS s ON (y.supplier = s.sid)
WHERE y.price > 100;
GAV mappings are within an RDBMSs typically implemted with
VIEW
s (assuming all sources are mapped into the same RDBMS,
though), e.g.:
CREATE VIEW Product(pid, name, category) AS
SELECT id, name, NULL
FROM l1.Product
UNION ALL
SELECT product_id, name, category
FROM l2.Product;
CREATE VIEW Supplier(sid, name, address, country) AS
SELECT id, name, address, 'Norway'
FROM l1.Supplier
UNION ALL
SELECT supplier_id, name, address, 'Sweden'
FROM l2.Supplier;
CREATE VIEW Supply(supplier, product, price) AS
SELECT supplier, id, price
FROM l1.product
UNION ALL
SELECT s.supplier_id, p.product_id, p.price
FROM l2.Product AS p
JOIN l2.Supplier AS s USING (category);
GAV-mappings can also be written as backward-chaining (Lore-)rules:
CREATE RELATION Product(pid int, name text, category text);
Product(id, name, NULL) <- l1.Product(id, name, price, sup);
Product(id, name, cat) <- l2.Product(id, name, cat);
CREATE RELATION Supplier(sid int, name text, address text, country text);
Supplier(id, name, addr, 'Norway') <- l1.Supplier(id, name, addr);
Supplier(id, name, addr, 'Sweden') <- l2.Supplier(id, name, cat, addr);
CREATE RELATION Supply(supplier int, product int, price float);
Supply(sid, pid, price) <- l1.Product(pid, name, price, sid);
Supply(sid, pid, price) <- l2.Product(pid, pname, price, cat),
l2.Supplier(sid, sname, cat, addr);
Global Schema:
Product(pid, name, category)
Supplier(sid, name, address, country)
Supply(supplier, product, price)
Supply(supplier) -> Supplier(sid)
Supply(product) -> Product(pid)
Local Schema 1: Norwegian suppliers
l1.Product(id, product_name, price, supplier)
l1.Supplier(id, company_name, address)
l1.Product(supplier) -> l1.Supplier(id)
Local Schema 2: Swedish suppliers
l2.Product(product_id, name, price, category)
l2.Category(name)
l2.Supplier(supplier_id, name, category, address)
l2.Product(category) -> l2.Category(name)
l2.CategorySupplier(category) -> l2.Category(name)
LAV Mappings for l1:
l1.Product(id, product_name, price, supplier) ==>
SELECT p.pid, p.name, y.price, y.supplier
FROM Product AS p
JOIN Supply AS y ON (p.pid = y.product)
JOIN Supplier AS s ON (y.supplier = s.sid)
WHERE s.country = 'Norway' AND p.category IS NULL;
l1.Supplier(id, company_name, address) ==>
SELECT sid, name, address
FROM Supplier
WHERE s.country = 'Norway';
LAV Mappings for l2:
l2.Product(product_id, name, price, category) ==>
SELECT p.pid, p.name, y.price, p.category
FROM Product AS p
JOIN Supply AS y ON (p.pid = y.product)
JOIN Supplier AS s ON (y.supplier = s.sid)
WHERE s.country = 'Sweden';
l2.Category(name) ==>
SELECT DISTINCT category
FROM Product;
l2.Supplier(supplier_id, name, category, address) ==>
SELECT s.sid, s.name,
(SELECT p.category
FROM Product AS p JOIN Supply AS y USING (product)
WHERE y.supplier = s.supplier
LIMI 1) AS category,
s.address
FROM Supplier AS s
WHERE s.country = 'Sweden';
Example query:
SELECT p.name, s.name
FROM Product AS p
JOIN Supply AS y ON (p.pid = y.product)
JOIN Supplier AS s ON (y.supplier = s.sid);
WHERE y.price > 100;
==>
SELECT pname, sname
FROM l1.Product AS l1p(pid, pname, price, supplier
JOIN l1.Supplier AS l1s(sid, sname, address) ON (l1p.supplier = l1s.sid)
WHERE price > 100
UNION ALL
SELECT pname, sname
FROM l2.Product AS l2p(pid, pname, price, category
JOIN l2.Supplier AS l2s(sid, sname, category, address) USING (category)
WHERE price > 100;
global:Product owl:equivalentClass (l1:Product or l2:Product) . # GAV-like
l1:Supplier owl:equivalentClass global:Supplier and (global:country value 'Norway') . # LAV-like
l1:Supplier or l2:Supplier owl:equivalentClass global:supplies some global:Product . # GLAV-like
l1:Supplier owl:equivalentClass global:Supplier and not l2:Supplier . # GLAV-like
(Mixing Manchester and Turtle syntax)
Global vocabulary/ontology:
global:NorwegianSupplier rdfs:subClassOf global:Supplier .
global:SwedishSupplier rdfs:subClassOf global:Supplier .
(GLAV) Mapping for local schema 1:
(GLAV) Mapping for local schema 2:
target: l2:supplier{supplier_id} a global:SwedishSupplier .
source: SELECT supplier_id FROM l2.Supplier
Example query:
SELECT ?sup
WHERE {
?sup a global:Supplier .
}
' || '
' || Rewriting (using ontology) '
' \/ '
SELECT ?sup
WHERE {
{ ?sup a global:Supplier . }
UNION
{ ?sup a global:NorwegianSupplier . }
UNION
{ ?sup a global:SwedishSupplier . }
}
' || '
' || Unfolding (using mappings) '
' \/ '
SELECT 'l1:supplier' || id
FROM l1.Supplier
UNION ALL
SELECT 'l2:supplier' || supplier_id
FROM l2.Supplier
l2.product
-info into
global OWL-schema:target: global:product{p.product_id} a global:Product ;
global:name {p.name} ;
global:price {p.price} ;
global:supplier global:supplier{s.supplier_id} .
source: SELECT p.product_id, p.name, p.price, s.supplier_id
FROM l2.Product AS p
JOIN l2.Supplier AS s USING (category);
l2:product rdfs:subClassOf global:Product .
l2:product_name rdfs:subPropertyOf global:name .
l2:product_price rdfs:subPropertyOf global:price .
l2:supplier_of_category a owl:ObjectProperty ; # Needed for below axiom
owl:inverseOf l2:supplier_ref_category .
# State that global:supplier is the join of l2.product_ref_category and l2.supplier_of_category
global:supplier owl:propertyChainAxiom ( l2.product_ref_category l2.supplier_of_category ) .
-- Assume this is the database containing the l1-schema
-- (from example above) called "norwegian_db" and we are
-- logged in as user "norvegian", and that
-- and we want to include l2-schema from
-- a different database called "swedish_db" into this db.
-- Create the foreign data wrapper extension:
-- (this is already done for you in your IFI-databases)
CREATE EXTENSION postgres_fwd;
-- Then create a reference to the other database.
-- Lets assume "swedish_db" is on host "123.4.5.6"
-- and port 5432:
CREATE SERVER swedish_db_server
FOREGISN DATA WRAPPER postgres_fdw
OPTIONS (host '123.4.5.6', port '5432', dbname 'swedish_db');
-- Now need to create a user-mapping to gain access
-- to the swedish_db:
-- Lets assume we can use user with username 'swede' and password 'pwd123':
CREATE USER MAPPING FOR norwegian
SERVER swedish_db_server
OPTIONS (user 'swede', password 'pwd123');
-- Create schema to map foreign source into
CREATE SCHEMA l2; -- Could also be named something different
-- Map the foreign source into the created schema
IMPORT FOREIGN SCHEMA l2
FROM SERVER swedish_db_server
INTO l2;
-- Can now run queries over l2 as if it were
-- a normal schame in this database.
-- Can now integrate under a global schema (e.g. public) as above,
-- e.g. with VIEW-based GAV-mappings from above
CREATE VIEW Product(pid, name, category) AS
SELECT id, name, NULL
FROM l1.Product
UNION ALL
SELECT product_id, name, category
FROM l2.Product;
CREATE VIEW Supplier(sid, name, address, country) AS
SELECT id, name, address, 'Norway'
FROM l1.Supplier
UNION ALL
SELECT product_id, name, address, 'Sweden'
FROM l2.Supplier;
CREATE VIEW Supply(supplier, product, price) AS
SELECT supplier, id, price
FROM l1.product
UNION ALL
SELECT s.supplier_id, p.product_id, p.price
FROM l2.Product AS p
JOIN l2.Supplier AS s USING (category);
For each example, what approach to integration makes the most sense?
Example 1: A big company A
with a very
large DB buys a much smaller company B
with a small DB. All
of B
’s programs/services will be merged or replaced with
A
’s programs/services.
Suggestion: Data exchange, where B
’s
data is moved into A
’s DB, where A
’s DB is
used as global schema with GAV-mappings.
Example 2: A reasearch organization has lots of biologists making small datasets about different species. Different biologiests may use different terms for the same thing (due to scientific conventions), however the differences are simple. New datasets are made continously. The organization wants to be able to combine all of the data for analysis.
Suggestion: Define a global schema that matches the vocabulary used in the analysis, which works as a data mediator. Use LAV-mappings from the biologists’ (local) schemas.
Example 3: Norway’s public transport companies wants to make a single service that allows people to buy all tickets from one place. However, the companies have not been able to aggree on any global schema that fits with all companies data, however, companies that are geographically close have schemas that are eaier to align. The companies have lots of data, but relatively small metadata.
Suggestion: Use P2P-integration, with mappings between companies that are geographically close. Use GLAV-mappings to define simple relationships between terms in the different vocabularies. Use a weak/efficient mapping/ontology language (e.g. OWL 2 QL).