Lecture 9: Data Transformation

In this lecture, we will look more in depth on data transformations, such as mappings, querying and reasoning, both for mapping between formats, structures, and changes affecting the physical, logical and conceptual schema/model of the data, using the technologies we have seen so far in the course.

Slides

Relevant Wiki-pages

Exercises

To mix things up a little, we are this week going to work with JSON data. So in this exercise we are going to work with weather data from the Norwegian Meterological Institute (Yr), which is published as JSON. JSON is a key-value format heavily used for data interchange. The exercises will use this Makefile.

To start, download the Makefile into an empty folder, update the database connection variables (username, database name, host and password), and execute

make setup

1. Different Pipelines

The Makeflie will make a similar folder structure as the mandatory exercise and download the weather forecast for a couple of points in Norway. This is done in the Makefile with the curl-command and Yr’s Weather API as follows:

curl "https://api.met.no/weatherapi/locationforecast/2.0/compact?lat=60.10&lon=9.58" > data/met60_10-9_58.json

This downloads the JSON-file with the weather forecast for the next few hours at location (60.10, 9.58) into the file with name data/met60_10-9_58.json. To see what the file looks like you can of course simply open it in a text editor. However, the JSON-file is better formatted by your browser, so you can open it in e.g. Firefox by running

firefox data/met60_10-9_58.json

The Makefile will also download the forecast for another point, (59.2, 10.10), done with:

curl "https://api.met.no/weatherapi/locationforecast/2.0/compact?lat=59.20&lon=10.10" > data/met59_20-10_10.json

We will now look at different pipelines for transforming this data into RDF, under the following vocabulary:

exo:Forecast - class of all weather forecasts (for a given location and timepoint)
exo:time - relates a weather forecast to the time of the forecast
exo:lat - relates a weather forecast to the latitude of the location of the forecast
exo:lon - relates a weather forecast to the longitude of the location of the forecast
exo:windSpeed - relates a weather forecast to the wind speed of the forecast
exo:humidity - relates a weather forecast to the relative humidity of the forecast
exo:temperature - relates a weather forecast to the temperature of the forecast
exo:precipitation - relates a weather forecast to the amount of precipitation of the forecast

where exo is prefix for http://example.org/ont/. The vocabulary is described in the weather_ont.ttl-file.

Below are two pipelines translating the JSON-files into RDF described with the above vocabulary. These are called pipeline1 and pipeline2 in the Makefile.

The first pipeline starts by copying the JSON-data directly into an SQL-table, and uses PostgreSQL’s built-in support for handling JSON via SQL to make a view containing the desired data. Details on the functions used can be seen in the PostgreSQL Manual, but a quick outline of the used functionality is as follows:

object->"key" returns the JSON-object object’s value for "key" as a JSON-object
object->>"key" returns the JSON-object object’s value for "key" as text (which can then be cast into a proper PostgreSQL type, such as integer or timestamp)
json_array_elements(json_array) returns a table of JSON-values with one value per element in json_array

The pipeline then uses direct mappings to translate the view into RDF and applies an ontology to map it into the target vocabulary.

The second pipeline uses a program called json2rdf which directly maps the JSON-file into RDF. It then uses rules to map the data into the desired vocuabulary. Note that json2rdf does not preserve ordering in arrays, so we need to add a small hack to map the coordinates array into separate lat and lon values within the JSON-file with the somewhat cryptic sed-command in the Makefile.

For each of the following pipelines, state what type of data transformation each step is, i.e. which of the following (can be one or more for each step):

format transformation
structural transformation
physical transformation
logical transformation
conceptual transformation

Note that it is not always clear cut which of these each step is, as it depends on where one sets the boundary for the different schemas, formats and structures.

You are advised to execute the commands listed yourself, to see what each step does and how it transforms the data. In each step only the make-command is shown, and you need to look it up in the Makefile to see what is does.

1. PostgreSQL + VIEWS + direct mappings + ontology

make table_weather_forecast_json
make table_weather
make direct_map_weather
make load_weather_ontology

or simply make pipeline1.

2. json2rdf + rules

make data/met_weather.ttl
make apply_weather_rules

or simply make pipeline2.

2. Extending the Pipelines

In this exercise we will extend the above pipelines to add more data, and in the process see how easy they are to extend.

The data we will extend our pipelines with is also weather forecasts, but from the German equivalent of the Norwegian Meterological Institute, namely Deutcher Wetterdienst. We will use a service called Bright Sky which provides the DWD data as JSON.

Thus, start by adding the folloing rule to the Makefile:

data/dwd52_0-7_60.json:
    curl "https://api.brightsky.dev/weather?lat=52&lon=7.6&date=2022-03-18" > data/dwd52_0-7_60.json

and add this as a dependency to the setup-rule.

This will download a JSON-dataset containing the weather forecast for the location (52.0, 7.60). Note that the date is part of the URL in the Makefile, so this might need to be updated to the current date depending on when you do this exercise.

The exercise is as follows: Extend each of the above pipelines to also include this new dataset from DWD, so that the data is intergrated under the same vocaublary as listed in the previous exercise.

Note how much effort one needs to put into the extension and how much can be reused from the previous pipeline.

Solution

The solution can be found here.