IN5800 – Mandatory Assignment
The city of Urb has gathered quite a lot of data about its citizens and organizations. However, the city has two city parts, Suburbia and Urbania. Urbania is the city’s center with tall buildings, busy streets and lots of people. Suburbia is the suburban part of the city, with farms, parks and villa houses. The two parts are quite different, leading to them having rather different data needs and therefore also different structure and content in their databases.
For city planning and other forms of city management, the city
council of Urb now realizes that they need to integrate their data into
a common database. They have managed to get most of their data into the
same PostgreSQL database. Suburbia’s data is in the schema with name
suburbia
, and Urbania’s data is in the schema with name
urbania
. However, as noted above, the structure and content
of these two schemas are too different to be merged directly. Thus, a
more complex data integration job is needed and you are hired to do this
task.
The City of Urb’s Data
Suburbia
Suburbia has a database with the following tables (described via the SQL-commands creating the schema):
CREATE TABLE suburbia.home(
int PRIMARY KEY,
hid NOT NULL,
home_address text NOT NULL
zip_code text
);
CREATE TABLE suburbia.citizen(
int PRIMARY KEY,
cid UNIQUE,
urb_id text NOT NULL,
name text int REFERENCES suburbia.home(hid),
home -- farmer, fisher, student, pupil
occupation text );
So suburbia.home
contains addresses and zip codes for
all houses in the city part, and suburbia.citizen
contains
each person’s Urb ID (an identification string given uniquely to each
citizen of the entire city of Urb) each person’s name, a reference to
the house that is each person’s home, a string containing the each
person’s occupation (only farmers, fishers, students and pupils live in
Suburbia). If the occupation column is empty (i.e. NULL
),
this means that the person is unemployed.
Urbania
Urbania has a somewhat more complex schema, with the following tables (again, described via the SQL-commands creating the schema):
CREATE TABLE urbania.building(
int PRIMARY KEY,
bid NOT NULL,
address text NOT NULL
zipcode text
);
CREATE TABLE urbania.organization(
oid int PRIMARY KEY,
NOT NULL,
name text int REFERENCES urbania.building(bid)
offices_in
);
CREATE TABLE urbania.citizen(
int PRIMARY KEY,
cid UNIQUE,
urb_id text NOT NULL,
name text int NOT NULL,
apartment_number int REFERENCES urbania.building(bid),
apartment_building int REFERENCES urbania.organization(oid)
works_at );
So urbania.building
contains addresses and zip codes for
all buildings in the city part; urbania.organization
contains the name of all organizations (companies, school, university,
etc.) and a reference to the building that the organization has offices
in; and finally urbania.citizen
contains each person’s Urb
ID (an identification string given uniquely to each citizen of the
entire city of Urb) each person’s name, a reference to the apartment
building that each person lives in together with the apartment number,
and a reference to the organization that the person is working at. If
works_at
is empty (i.e. NULL
), then the person
is unemployed. Note that in this schema, a student works_at
the university, and a pupil works_at
the school.
Setup
Make a new private Git-repo at UiO’s Github, and check it out to either your local computer (note: you need a working PostgreSQL server running and Java 11 on you local machine to solve the oblig) or at an IFI Linux-machine (you can also access the IFI-machines remotely, see this page (Norwegian) or this page (English) for remote access).
Download this Makefile, and store it in the repo you just made. If you are logged inn to a remote IFI-machine, you can download the file by executing
wget http://leifhka.org/in5800/oblig/Makefile
Open the Makefile in your favorite text editor and update the
variables db_name
, db_user
,
db_password
and db_host
with the name of your
PostgreSQL database, and the username, password and host for this
database. Use
git add Makefile
to add the file to the repo, and finally
git commit -m "Init repo."
git push origin master
to commit and push the changes (give your UiO-username and password
when prompted). Now make a new branch called feature/oblig
and switch to it by executing:
git checkout -b feature/oblig
Add all you changes done on the following exercises to this branch.
Now, execute
make setup
This will create a folder structure you should use for the following
exercises, download the programs you need, as well as download the
source data from the city council of Urb. Note: Never commit anything
from the data
or program
folders to you Git
repo, as these are downloaded or made by the Makefile.
Note that you can always execute
make clean_all
make setup
to get a completely fresh start if you need it. This will not delete any files that you make, only files downloaded as part of the setup, and clean your database and derived data.
The Makefile contains all commands needed for the mandatory exercise, however each command uses files that you need to make, described in each exercise below. Apart from the veriables for the database connection you edited above, you do not need to add anything to the Makefile. However, it is wise to study it and get familiar with what the each command does, as you will need to make a similar Makefile for your project work later in the semester.
You are now set to solve all of Urbania’s problems!
Exercise 1: Direct Mappings
Make a file mappings/direct_mappings.lore
that contains
the Lore-commands for adding direct mappings for all the tables in the
suburbia
and urbania
schemas. You should use
the base-URI http://leifhka.org/in5800/urb/raw_data/
, and
prefix raws
for the mappings in
suburbia
-schema and rawu
for the mappings in
urbania
-schema. Make your mappings forward-chaining. (Note:
You do not need to map the sequence home_hid_seq
, this is
only used by PostgreSQL to generate new keys for
home(hid)
.)
Now apply your mappings with:
make direct_mappings
You can test that your mappings work by executing:
make query_test
This will count the number of triples in the triplestore and should
return 8850
. (Note: You can change the query in
queries/test.sparql
to have the above command execute that
query instead.)
If you make mistakes, you can clean you database and reapply your mappings by executing
make clean_dbs
make direct_mappings
Exercise 2: Integration via OWL Semantics
The city council is very happy with having all of the data in the same triplestore. However, they realize that despite the data now having the same format (i.e. RDF), it is still difficult to make use of it, as different terms and structures are used to represent the same information.
Thus, you realize that adding an ontology harmonizing the data under
a common vocabulary would be beneficial. So make an ontology using the
Lore-OWL fragment of OWL
(click the link for info on this fragment) in the file
semantics/urb_ontology.ttl
with the following prefixes:
@prefix raws: <http://leifhka.org/in5800/urb/raw_data/suburbia/> .
@prefix rawu: <http://leifhka.org/in5800/urb/raw_data/urbania/> .
@prefix urb: <http://leifhka.org/in5800/urb/> .
so that the following holds:
urb:Person
is the class of all Urb citizensurb:Building
is the class of all buildings in Urb (both homes and other buildings)urb:Organization
is the class of all organizations in Urb, i.e. the place people workurb:name
relates people and organizations to their namesurb:locatedAt
relates people to the building they live in and organizations (from Urbania) to the building they have offices in (Note: we will fix work locations for Suburbian citizens in the next exercise)urb:contains
should be the inverse ofurb:locatedAt
urb:address
should relate buildings (individuals ofurb:Building
) to their addressesurb:zipcode
should relate buildings (individuals ofurb:Building
) to their zip codesurb:worksAt
relates the people of Urbania to the organizations they work at (note that students and pupils are said to “work at” their universities and schools) (Note: we will fix where people work for Suburbian citizens in the next exercise)urb:hasEmployee
should be the inverse ofurb:worksAt
- Finally,
urb:urbId
should relate every person (individual ofurb:Person
) to their Urb ID.
Note: You only need to specify domain of properties if this
corresponds to a single class (e.g. the range of
urb:worksAt
should be urb:Organization
, but
the domain of urb:name
is the union of
urb:Person
and urb:Organization
which we
cannot express directly in Lore-OWL). Note that you need to give all
classes the type owl:Class
, all object properties the type
owl:ObjectProperty
and all datatype properties the type
owl:DatatypeProperty
and a range that is an
xsd
-datatype (e.g. xsd:string
,
xsd:int
, etc.).
You can add your semantics to the triplestore with
make urb_ontology
.
If you make mistakes, as always, you can clean your database and reapply all of the mappings and the ontology by executing
make clean_dbs
make direct_mappings
make urb_ontology
Exercise 3: bOTTR Mappings
The database from Suburbia does not contain any data about buildings
other than homes nor any data about organizations. However, the local
administration of Suburbia has kept a list of important organizations
and the addresses of the buildings they have offices in in a separate
CSV file, located at data/organizations.csv
. This data will
be important for making the information about who works where
explicit.
Thus, you must make a file
mappings/organizations-map.ttl
containing a bOTTR mapping
and a corresponding template in a file
templates/organizations.stottr
in stOTTR format that is
used in the mapping. The mapping should map all the data in the file
data/organizations.csv
into RDF using the vocabulary from
the previous exercise such that e.g. the following lines in the
CSV-file:
Abc, Street road 1, 1234
Def, Road street 2, 5678
would be mapped to the following triples:
@prefix urb: <http://leifhka.org/in5800/urb/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
urb:Abc rdf:type urb:Organization ;
urb:name "Abc" ;
urb:locatedAt [ rdf:type urb:Building ;
urb:address "Street road 1" ;
urb:zipcode "1234" ] .
urb:Def rdf:type urb:Organization ;
urb:name "Def" ;
urb:locatedAt [ rdf:type urb:Building ;
urb:address "Road street 2" ;
urb:zipcode "5678" ] .
You can add apply the bOTTR-mapping and load the resulting triples
into the triplestore with make organizations
.
Again, if you make mistakes, you can clean your database and reapply all of the mappings by executing
make clean_dbs
make direct_mappings urb_ontology organizations
Exercise 4: Integration via rules
The city council is impressed with your work, and see now that much
of their data is integrated. However, one thing still remains: The
citizens of Suburbia only have occupations in the form of a string
(farmer
, fisher
, pupil
or
student
).
After talking to the people in the council for a while, you get the following information:
- Farmers are self-employed, and thus should also be organizations and
then
urb:worksAt
-related to themselves (this also means that the place they work is their own home, i.e. the farm they live on) - Fishers work at the docks
- Students go to the university in Urbania (i.e. the organization with name “University”)
- Pupils go to school at the Suburbian School
Note that information about the Docks, Suburbia School, etc. was
contained in the file data/organizations.csv
that you
mapped into the triplestore as part of exercise 2.
What remains for you to do is add semantics stating how the
occupations should relate to the different organizations they work at.
You quickly realize that this is not possible with Lore-OWL. Thus, you
should make Lore-rules to do this. Put the rules in
semantics/urb_rules.lore
. Apply you reasoning with
make urb_rules
.
If you make mistakes, as always, you can clean your database and reapply all of the mappings, the semantics and rules by executing
make clean_dbs
make triplestore
Exercise 5: People and Organizations via SPARQL
Now all of the data is integrated and fully usable. The city council of Urb now wants you to create lists of people and organizations within the city.
So, write a SPARQL-query in the file
queries/people.sparql
that for each person outputs the
name, the Urb ID, the address and zipcode to the building they live in,
the apartment number (if they have one), and the IRI of the organization
they work for (if not unemployed). Order the result by the name of the
person.
Also write a query in the file
queries/organizations.sparql
that for each organization in
the city outputs the IRI, the name and the address and zipcode of their
offices, as well as the number of people employed by the organization.
Order the result by number of employees.
Execute your queries with
make query_people
and
make query_organizations
Exercise 6: Sectors
The city council thinks your query for finding information about organizations is useful, but see that it is difficult to get an overview for analytical purposes. They therefore want you to introduce sectors (as classes) into your data that groups people into the sector they work in:
utb:UnemployedSector
is the class of all people that do not work.urb:PublicSector
is the class of people working in public organizations, i.e. the Hospital, Library, Docks, Theater and University.urb:SelfEmployedSector
is the class of all people that work for themselves (i.e.?p ex:worksAt ?p
).urb:PrivateSector
is the class of all other working people.
Note that you need to define these IRIs as owl:Class
es
in a separate ontology semantics/sectors.ttl
, and then make
Lore rules for the actual semantics in
semantics/sectors.lore
. Make the definitions virtual,
i.e. use backward rules.
You can load your sector-semantics with
make sectors
Hint: It is possible to use NOT IN
in the
WHERE
-part of Lore-rules,
e.g. R(x, y) <- P(x, y) : y NOT IN (SELECT z FROM T)
.
Finally, make a SPARQL query that finds the number of unemployed
people living in each zipcode queries/unemployed.sparql
.
Order the result by number of unemployed people.
You can execute this query with
make query_unemployed
If you make mistakes, as always, you can clean your database and reapply all of the mappings, the semantics and rules by executing
make clean_dbs
make pipeline
Exercise 7: Documentation
Before you finish everything and hand over all of your code to the city council of Urb, you realize that the Makefile in the project is completely undocumented.
So make a README.md
file (using Markdown syntax) and
write a small guide on how to use the Makefile for the city council of
Urb. It should contain instructions on how to do setup (like you did in
the beginning), build the triplestore with all the data and semantics,
execute your two queries, and how to clean both the files and the
databases. Explain what the effect of each command is in non-techincal
language, e.g. “This command loads all data about people into the
triplestore…”.
Delivery
Before the deadline, add all your changes to the branch
feature/oblig
, make a pull request into the
master
-branch (however, do not merge the
pull request!). Finally, add me (username leifhka
) to you
repo and upload a file to Devilry containing only the URL of your repo.
I will correct it and make comments inline in you code on the pull
request.
If you have any questions, please feel free to ask on any of the
available channels for the course, or email me at
leifhka [at] ifi.uio.no
.