Lecture 9: Data Transformation
In this lecture, we will look more in depth on data transformations, such as mappings, querying and reasoning, both for mapping between formats, structures, and changes affecting the physical, logical and conceptual schema/model of the data, using the technologies we have seen so far in the course.
Relevant Wiki-pages
Exercises
To mix things up a little, we are this week going to work with JSON data. So in this exercise we are going to work with weather data from the Norwegian Meterological Institute (Yr), which is published as JSON. JSON is a key-value format heavily used for data interchange. The exercises will use this Makefile.
To start, download the Makefile into an empty folder, update the database connection variables (username, database name, host and password), and execute
make setup
1. Different Pipelines
The Makeflie will make a similar folder structure as the mandatory
exercise and download the weather forecast for a couple of points in
Norway. This is done in the Makefile with the curl
-command
and Yr’s Weather API as follows:
curl "https://api.met.no/weatherapi/locationforecast/2.0/compact?lat=60.10&lon=9.58" > data/met60_10-9_58.json
This downloads the JSON-file with the weather forecast for the next
few hours at location (60.10, 9.58)
into the file with name
data/met60_10-9_58.json
. To see what the file looks like
you can of course simply open it in a text editor. However, the
JSON-file is better formatted by your browser, so you can open it in
e.g. Firefox by running
firefox data/met60_10-9_58.json
The Makefile will also download the forecast for another point,
(59.2, 10.10)
, done with:
curl "https://api.met.no/weatherapi/locationforecast/2.0/compact?lat=59.20&lon=10.10" > data/met59_20-10_10.json
We will now look at different pipelines for transforming this data into RDF, under the following vocabulary:
exo:Forecast
- class of all weather forecasts (for a given location and timepoint)exo:time
- relates a weather forecast to the time of the forecastexo:lat
- relates a weather forecast to the latitude of the location of the forecastexo:lon
- relates a weather forecast to the longitude of the location of the forecastexo:windSpeed
- relates a weather forecast to the wind speed of the forecastexo:humidity
- relates a weather forecast to the relative humidity of the forecastexo:temperature
- relates a weather forecast to the temperature of the forecastexo:precipitation
- relates a weather forecast to the amount of precipitation of the forecast
where exo
is prefix for
http://example.org/ont/
. The vocabulary is described in the
weather_ont.ttl-file.
Below are two pipelines translating the JSON-files into RDF described
with the above vocabulary. These are called pipeline1
and
pipeline2
in the Makefile.
The first pipeline starts by copying the JSON-data directly into an SQL-table, and uses PostgreSQL’s built-in support for handling JSON via SQL to make a view containing the desired data. Details on the functions used can be seen in the PostgreSQL Manual, but a quick outline of the used functionality is as follows:
object->"key"
returns the JSON-objectobject
’s value for"key"
as a JSON-objectobject->>"key"
returns the JSON-objectobject
’s value for"key"
as text (which can then be cast into a proper PostgreSQL type, such asinteger
ortimestamp
)json_array_elements(json_array)
returns a table of JSON-values with one value per element injson_array
The pipeline then uses direct mappings to translate the view into RDF and applies an ontology to map it into the target vocabulary.
The second pipeline uses a program called json2rdf
which
directly maps the JSON-file into RDF. It then uses rules to map the data
into the desired vocuabulary. Note that json2rdf
does not
preserve ordering in arrays, so we need to add a small hack to map the
coordinates array into separate lat
and lon
values within the JSON-file with the somewhat cryptic
sed
-command in the Makefile.
For each of the following pipelines, state what type of data transformation each step is, i.e. which of the following (can be one or more for each step):
- format transformation
- structural transformation
- physical transformation
- logical transformation
- conceptual transformation
Note that it is not always clear cut which of these each step is, as it depends on where one sets the boundary for the different schemas, formats and structures.
You are advised to execute the commands listed yourself, to see what
each step does and how it transforms the data. In each step only the
make
-command is shown, and you need to look it up in the
Makefile
to see what is does.
1. PostgreSQL + VIEWS + direct mappings + ontology
make table_weather_forecast_json
make table_weather
make direct_map_weather
make load_weather_ontology
or simply make pipeline1
.
2. json2rdf + rules
make data/met_weather.ttl
make apply_weather_rules
or simply make pipeline2
.
2. Extending the Pipelines
In this exercise we will extend the above pipelines to add more data, and in the process see how easy they are to extend.
The data we will extend our pipelines with is also weather forecasts, but from the German equivalent of the Norwegian Meterological Institute, namely Deutcher Wetterdienst. We will use a service called Bright Sky which provides the DWD data as JSON.
Thus, start by adding the folloing rule to the Makefile:
data/dwd52_0-7_60.json:
curl "https://api.brightsky.dev/weather?lat=52&lon=7.6&date=2022-03-18" > data/dwd52_0-7_60.json
and add this as a dependency to the setup
-rule.
This will download a JSON-dataset containing the weather forecast for
the location (52.0, 7.60)
. Note that the date is part of
the URL in the Makefile, so this might need to be updated to the current
date depending on when you do this exercise.
The exercise is as follows: Extend each of the above pipelines to also include this new dataset from DWD, so that the data is intergrated under the same vocaublary as listed in the previous exercise.
Note how much effort one needs to put into the extension and how much can be reused from the previous pipeline.
Solution
The solution can be found here.