IN5800 – Mandatory Assignment

The city of Urb has gathered quite a lot of data about its citizens and organizations. However, the city has two city parts, Suburbia and Urbania. Urbania is the city’s center with tall buildings, busy streets and lots of people. Suburbia is the suburban part of the city, with farms, parks and villa houses. The two parts are quite different, leading to them having rather different data needs and therefore also different structure and content in their databases.

For city planning and other forms of city management, the city council of Urb now realizes that they need to integrate their data into a common database. They have managed to get most of their data into the same PostgreSQL database. Suburbia’s data is in the schema with name suburbia, and Urbania’s data is in the schema with name urbania. However, as noted above, the structure and content of these two schemas are too different to be merged directly. Thus, a more complex data integration job is needed and you are hired to do this task.

The City of Urb’s Data

Suburbia

Suburbia has a database with the following tables (described via the SQL-commands creating the schema):

CREATE TABLE suburbia.home(
    hid int PRIMARY KEY,
    home_address text NOT NULL,
    zip_code text NOT NULL
);

CREATE TABLE suburbia.citizen(
    cid int PRIMARY KEY,
    urb_id text UNIQUE,
    name text NOT NULL,
    home int REFERENCES suburbia.home(hid),
    occupation text -- farmer, fisher, student, pupil
);

So suburbia.home contains addresses and zip codes for all houses in the city part, and suburbia.citizen contains each person’s Urb ID (an identification string given uniquely to each citizen of the entire city of Urb) each person’s name, a reference to the house that is each person’s home, a string containing the each person’s occupation (only farmers, fishers, students and pupils live in Suburbia). If the occupation column is empty (i.e. NULL), this means that the person is unemployed.

Urbania

Urbania has a somewhat more complex schema, with the following tables (again, described via the SQL-commands creating the schema):

CREATE TABLE urbania.building(
    bid int PRIMARY KEY,
    address text NOT NULL,
    zipcode text NOT NULL
);

CREATE TABLE urbania.organization(
    oid int PRIMARY KEY,
    name text NOT NULL,
    offices_in int REFERENCES urbania.building(bid)
);

CREATE TABLE urbania.citizen(
    cid int PRIMARY KEY,
    urb_id text UNIQUE,
    name text NOT NULL,
    apartment_number int NOT NULL,
    apartment_building int REFERENCES urbania.building(bid),
    works_at int REFERENCES urbania.organization(oid)
);

So urbania.building contains addresses and zip codes for all buildings in the city part; urbania.organization contains the name of all organizations (companies, school, university, etc.) and a reference to the building that the organization has offices in; and finally urbania.citizen contains each person’s Urb ID (an identification string given uniquely to each citizen of the entire city of Urb) each person’s name, a reference to the apartment building that each person lives in together with the apartment number, and a reference to the organization that the person is working at. If works_at is empty (i.e. NULL), then the person is unemployed. Note that in this schema, a student works_at the university, and a pupil works_at the school.

Setup

Make a new private Git-repo at UiO’s Github, and check it out to either your local computer (note: you need a working PostgreSQL server running and Java 11 on you local machine to solve the oblig) or at an IFI Linux-machine (you can also access the IFI-machines remotely, see this page (Norwegian) or this page (English) for remote access).

Download this Makefile, and store it in the repo you just made. If you are logged inn to a remote IFI-machine, you can download the file by executing

wget http://leifhka.org/in5800/oblig/Makefile

Open the Makefile in your favorite text editor and update the variables db_name, db_user, db_password and db_host with the name of your PostgreSQL database, and the username, password and host for this database. Use

git add Makefile

to add the file to the repo, and finally

git commit -m "Init repo."
git push origin master

to commit and push the changes (give your UiO-username and password when prompted). Now make a new branch called feature/oblig and switch to it by executing:

git checkout -b feature/oblig

Add all you changes done on the following exercises to this branch.

Now, execute

make setup

This will create a folder structure you should use for the following exercises, download the programs you need, as well as download the source data from the city council of Urb. Note: Never commit anything from the data or program folders to you Git repo, as these are downloaded or made by the Makefile.

Note that you can always execute

make clean_all
make setup

to get a completely fresh start if you need it. This will not delete any files that you make, only files downloaded as part of the setup, and clean your database and derived data.

The Makefile contains all commands needed for the mandatory exercise, however each command uses files that you need to make, described in each exercise below. Apart from the veriables for the database connection you edited above, you do not need to add anything to the Makefile. However, it is wise to study it and get familiar with what the each command does, as you will need to make a similar Makefile for your project work later in the semester.

You are now set to solve all of Urbania’s problems!

Exercise 1: Direct Mappings

Make a file mappings/direct_mappings.lore that contains the Lore-commands for adding direct mappings for all the tables in the suburbia and urbania schemas. You should use the base-URI http://leifhka.org/in5800/urb/raw_data/, and prefix raws for the mappings in suburbia-schema and rawu for the mappings in urbania-schema. Make your mappings forward-chaining. (Note: You do not need to map the sequence home_hid_seq, this is only used by PostgreSQL to generate new keys for home(hid).)

Now apply your mappings with:

make direct_mappings

You can test that your mappings work by executing:

make query_test

This will count the number of triples in the triplestore and should return 8850. (Note: You can change the query in queries/test.sparql to have the above command execute that query instead.)

If you make mistakes, you can clean you database and reapply your mappings by executing

make clean_dbs
make direct_mappings

Exercise 2: Integration via OWL Semantics

The city council is very happy with having all of the data in the same triplestore. However, they realize that despite the data now having the same format (i.e. RDF), it is still difficult to make use of it, as different terms and structures are used to represent the same information.

Thus, you realize that adding an ontology harmonizing the data under a common vocabulary would be beneficial. So make an ontology using the Lore-OWL fragment of OWL (click the link for info on this fragment) in the file semantics/urb_ontology.ttl with the following prefixes:

@prefix raws: <http://leifhka.org/in5800/urb/raw_data/suburbia/> .
@prefix rawu: <http://leifhka.org/in5800/urb/raw_data/urbania/> .
@prefix urb:  <http://leifhka.org/in5800/urb/> . 

so that the following holds:

Note: You only need to specify domain of properties if this corresponds to a single class (e.g. the range of urb:worksAt should be urb:Organization, but the domain of urb:name is the union of urb:Person and urb:Organization which we cannot express directly in Lore-OWL). Note that you need to give all classes the type owl:Class, all object properties the type owl:ObjectProperty and all datatype properties the type owl:DatatypeProperty and a range that is an xsd-datatype (e.g. xsd:string, xsd:int, etc.).

You can add your semantics to the triplestore with make urb_ontology.

If you make mistakes, as always, you can clean your database and reapply all of the mappings and the ontology by executing

make clean_dbs
make direct_mappings
make urb_ontology

Exercise 3: bOTTR Mappings

The database from Suburbia does not contain any data about buildings other than homes nor any data about organizations. However, the local administration of Suburbia has kept a list of important organizations and the addresses of the buildings they have offices in in a separate CSV file, located at data/organizations.csv. This data will be important for making the information about who works where explicit.

Thus, you must make a file mappings/organizations-map.ttl containing a bOTTR mapping and a corresponding template in a file templates/organizations.stottr in stOTTR format that is used in the mapping. The mapping should map all the data in the file data/organizations.csv into RDF using the vocabulary from the previous exercise such that e.g. the following lines in the CSV-file:

Abc, Street road 1, 1234
Def, Road street 2, 5678

would be mapped to the following triples:

@prefix urb: <http://leifhka.org/in5800/urb/> . 
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

urb:Abc rdf:type urb:Organization ;
    urb:name "Abc" ;
    urb:locatedAt [ rdf:type urb:Building ;
                    urb:address "Street road 1" ;
                    urb:zipcode "1234" ] .

urb:Def rdf:type urb:Organization ;
    urb:name "Def" ;
    urb:locatedAt [ rdf:type urb:Building ;
                    urb:address "Road street 2" ;
                    urb:zipcode "5678" ] .

You can add apply the bOTTR-mapping and load the resulting triples into the triplestore with make organizations.

Again, if you make mistakes, you can clean your database and reapply all of the mappings by executing

make clean_dbs
make direct_mappings urb_ontology organizations

Exercise 4: Integration via rules

The city council is impressed with your work, and see now that much of their data is integrated. However, one thing still remains: The citizens of Suburbia only have occupations in the form of a string (farmer, fisher, pupil or student).

After talking to the people in the council for a while, you get the following information:

Note that information about the Docks, Suburbia School, etc. was contained in the file data/organizations.csv that you mapped into the triplestore as part of exercise 2.

What remains for you to do is add semantics stating how the occupations should relate to the different organizations they work at. You quickly realize that this is not possible with Lore-OWL. Thus, you should make Lore-rules to do this. Put the rules in semantics/urb_rules.lore. Apply you reasoning with make urb_rules.

If you make mistakes, as always, you can clean your database and reapply all of the mappings, the semantics and rules by executing

make clean_dbs
make triplestore

Exercise 5: People and Organizations via SPARQL

Now all of the data is integrated and fully usable. The city council of Urb now wants you to create lists of people and organizations within the city.

So, write a SPARQL-query in the file queries/people.sparql that for each person outputs the name, the Urb ID, the address and zipcode to the building they live in, the apartment number (if they have one), and the IRI of the organization they work for (if not unemployed). Order the result by the name of the person.

Also write a query in the file queries/organizations.sparql that for each organization in the city outputs the IRI, the name and the address and zipcode of their offices, as well as the number of people employed by the organization. Order the result by number of employees.

Execute your queries with

make query_people

and

make query_organizations

Exercise 6: Sectors

The city council thinks your query for finding information about organizations is useful, but see that it is difficult to get an overview for analytical purposes. They therefore want you to introduce sectors (as classes) into your data that groups people into the sector they work in:

Note that you need to define these IRIs as owl:Classes in a separate ontology semantics/sectors.ttl, and then make Lore rules for the actual semantics in semantics/sectors.lore. Make the definitions virtual, i.e. use backward rules.

You can load your sector-semantics with

make sectors

Hint: It is possible to use NOT IN in the WHERE-part of Lore-rules, e.g. R(x, y) <- P(x, y) : y NOT IN (SELECT z FROM T).

Finally, make a SPARQL query that finds the number of unemployed people living in each zipcode queries/unemployed.sparql. Order the result by number of unemployed people.

You can execute this query with

make query_unemployed

If you make mistakes, as always, you can clean your database and reapply all of the mappings, the semantics and rules by executing

make clean_dbs
make pipeline

Exercise 7: Documentation

Before you finish everything and hand over all of your code to the city council of Urb, you realize that the Makefile in the project is completely undocumented.

So make a README.md file (using Markdown syntax) and write a small guide on how to use the Makefile for the city council of Urb. It should contain instructions on how to do setup (like you did in the beginning), build the triplestore with all the data and semantics, execute your two queries, and how to clean both the files and the databases. Explain what the effect of each command is in non-techincal language, e.g. “This command loads all data about people into the triplestore…”.

Delivery

Before the deadline, add all your changes to the branch feature/oblig, make a pull request into the master-branch (however, do not merge the pull request!). Finally, add me (username leifhka) to you repo and upload a file to Devilry containing only the URL of your repo. I will correct it and make comments inline in you code on the pull request.

If you have any questions, please feel free to ask on any of the available channels for the course, or email me at leifhka [at] ifi.uio.no.

Good luck! :)