Solution to exercises on data transformation

1. Different Pipelines

1. PostgreSQL + VIEWS + direct mappings + ontology

make table_weather_forecast_json This is a simple format transformation from JOSN-file to SQL-table
make table_weather This step extracts the data from the JSON-strings and creates view (tabular structure) of the data. Thus, this is a structural transformation, as it transform the data from one structure to another. Furthermore, the transformtion is virtual, structuring and arguably lifing. One can also argue that this transformation changes the conceptual schema or meaning of the data. What in the JSON-file is aditional data, listed as data->instant->details is made into weather forecast entites. Thus, one can argue that this is also a conceptual transformation.
make direct_map_weather This step directly maps the data in the view into RDF, and is thus another structural transformation, from relational to triples. Furthermore, this transformation is stored, destructuring and lifting.
make load_weather_ontology This step loads the ontology and applies the reasoning rules. It does not change the format nor structure, but does change the vocabulary used. One can argue that this is part of the logical schema, and therefore a logical transformation. However, one can also say that the adding of the ontology, and applying the reasoner changes the conceptual schema of the original data, and is therefore a conceptual transformtaion as well.

2. json2rdf + rules

make data/met_weather.ttl This step uses the rdf2json-tool that directly maps the data from JSON into RDF, and is therefore both a format transformation (from JSON-file to RDF-file), and a structural transformation (from key-value structure to triples). The structural transformation is stored and lifting.
make apply_weather_rules This step applies the reasoning rules, making the data conform to the desired vocabulary. This changes the encoding of the data in terms of the underlying structure, and is therefore a structural transformation. However, this transformation can also be said to change the conceptual schema, by a similar reasoning as for the second and last steps in the above pipeline. It is also stored, as the data is persisted into a new file.

2. Extending the Pipelines

Pipeline 1

There are two main approaches to do this, we can either integrate the data within the relational view. Another approach is to make a separate view for the DWD-data that is also directly mapped, and then extend the original ontology to do the integration. However, the second approach involves more work, so we will stick with the former.

So in this pipeline, we first need to update the two rules in the first step in the original pipeline to also load the DWD data, as follows (note the updated table name):

table_weather_forecast_json_create:
    $(psql) -c "CREATE TABLE weather.weather_forecast_json_met(rownr SERIAL PRIMARY KEY, data json);"
    $(psql) -c "CREATE TABLE weather.weather_forecast_json_dwd(rownr SERIAL PRIMARY KEY, data json);"

table_weather_forecast_json: table_weather_forecast_json_create
    cat data/met*.json | $(psql) -c "COPY weather.weather_forecast_json_met(data) FROM stdin"
    cat data/dwd*.json | $(psql) -c "COPY weather.weather_forecast_json_dwd(data) FROM stdin"

Then we need to update the definition of the implication with a translation of the new data, as follows:

CREATE IMPLICATION weather.weather_forecast AS
WITH
  -- MET (Same as before, but moved final query into WITH-clause as forecasts_met)
  coordinates_met AS (
    SELECT rownr,
        data->'geometry'->'coordinates'->>0 AS lon,
        data->'geometry'->'coordinates'->>1 AS lat
    FROM weather.weather_forecast_json_met
  ),
  timeseries_met AS (
    SELECT rownr, json_array_elements(data->'properties'->'timeseries') AS elem
    FROM weather.weather_forecast_json_met
  ),
  forecasts_met AS (
    SELECT lat::float,
        lon::float,
        (elem->>'time')::timestamp AS time,
        (elem->'data'->'instant'->'details'->>'wind_speed')::float AS wind_speed,
        (elem->'data'->'instant'->'details'->>'relative_humidity')::float AS relative_humidity,
        (elem->'data'->'instant'->'details'->>'air_temperature')::float AS air_temperature,
        (elem->'data'->'instant'->'details'->>'precipitation_amount')::float AS precipitation_amount
    FROM coordinates_met INNER JOIN timeseries_met USING (rownr)
  ),
  -- DWD (Similar, but need to join on source_id between weather and sources to get coordinates)
  sources_dwd AS (
    SELECT rownr, 
        elem->>'id' AS source_id,
        elem->>'lat' AS lat,
        elem->>'lon' AS lon
    FROM (SELECT rownr, json_array_elements(data->'sources')
          FROM weather.weather_forecast_json_dwd) AS t(rownr, elem)
  ),
  forecasts_dwd AS (
    SELECT s.lat::float AS lat,
        s.lon::float AS lon,
        (elem->>'timestamp')::timestamp AS time,
        (elem->>'wind_speed')::float AS wind_speed,
        (elem->>'relative_humidity')::float AS relative_humidity,
        (elem->>'temperature')::float AS air_temperature,
        (elem->>'precipitation')::float AS precipitation_amount
    FROM (SELECT rownr, json_array_elements(data->'weather')
          FROM weather.weather_forecast_json_dwd
         ) AS t(rownr, elem)
         INNER JOIN sources_dwd AS s
         ON (s.rownr = t.rownr AND s.source_id = elem->>'source_id')
  ),
  -- Both, take union to combine them into one table
  forecasts AS (
    SELECT * FROM forecasts_met UNION ALL SELECT * FROM forecasts_dwd
  )
SELECT
    row_number() OVER () AS wfid, -- Add the row number as wfid to each row
    fc.*
FROM forecasts AS fc
ORDER BY fc.lat, fc.lon, fc.time;

We are now done, as both datasets are now mapped into the same view, which is then transformed in the same way to RDF.

Note: It is also possible to split the big implication above into two implications, one for each source. This would leave the old implication untouched. However, as we use the row numbers as IDs (wfid), we would have to make these, and they cannot be made by the implications themselves (as the row number in one implication is not the same as the row number in the final relation). It would however be possible to just drop using wfid as key and rather use (lat, lon, time) as key for each forecast.

Pipeline 2

In this pipeline, we need to update the rule for mapping the data to RDF as follows:

data/met_weather.ttl:
    cat data/met60_10-9_58.json | sed -e "s/\"coordinates\":\[\([^,]*\),\([^,]*\),\([^,]*\)\]/\"lon\":\1,\"lat\":\2/g" | $(json2rdf) "http://example.org/data" >> data/met_weather.ttl
    cat data/met59_20-10_10.json | sed -e "s/\"coordinates\":\[\([^,]*\),\([^,]*\),\([^,]*\)\]/\"lon\":\1,\"lat\":\2/g" | $(json2rdf) "http://example.org/data" >> data/met_weather.ttl
    cat data/dwd52_0-7_60.json | $(json2rdf) "http://example.org/data" >> data/met_weather.ttl

We then need to make a complete ruleset similar to the original, but adapted to the new data. The new rules to add to semantics/weather_rules.rules are as follows:

[dwdlat: (?o ex-d:source_id ?id) (?src ex-d:id ?id) (?src ex-d:lat ?lat) -> (?o ex-o:lat ?lat) ]
[dwdlon: (?o ex-d:source_id ?id) (?src ex-d:id ?id) (?src ex-d:lon ?lon) -> (?o ex-o:lon ?lon) ]

[dwdfc: (?w ex-d:weather ?o) -> (?o rdf:type ex-o:Forecast) ]
[dwdtime: (?o ex-d:timestamp ?t) -> (?o ex-o:time ?t) ]
[dwdtemp: (?d ex-d:temperature ?t) -> (?d ex-o:temperature ?t) ]
[dwdprec: (?d ex-d:precipitation ?pa) -> (?d ex-o:precipitation ?pa) ]

Discussion of the two Pipelines

The first consists of more steps and more code, and also requires more lines of code to extend to new datasets. Thus, this approach is arguably more error prone, as there are simply more places for errors to occur. The first pipeline also involves two structural transformations and two format transformations, which seems rather redundant.

So for this use-case, the second pipeline is clearly the best. However, note that since the first pipeline actually does two structural and two format transformations, we can utilize features and functionality in both formats/structures for doing the integration. For example, if we wanted to translate wind speed from meters per second to kilometers per hous, this can easily be done with SQL-code in the definition of the view. Such a translation would be difficult (or sometimes even impossible) in the second pipeline, as this can neither be done in the direct mapping from JSON to RDF, nor with the rules (Jena Rules do have some built-in functions, but more complex computations quickly becomes difficult or impossible).

Furthermore, the first pipeline can still benefit from a more complex semantics, as we still have an ontology as part of the transformation. So if we wanted more complex axioms applied to the resulting RDF, this is still possible with the first pipeline.

Redundancy in a pipeline can lead to greater flexibility, at the cost of performance and maintainability. However, note that one can easily discard pipelines mapping back and forth multiple times between the same two structures, as this clearly gives no benefit.

Thus, shorter pipelines is not always better, but neither are longer pipelines. A pipeline needs to be adapted to the use-case, and for any use-case there are typically many ways of constructing a data transformation with the correct input and output, so careful consideration of the desired properties of the pipeline is necessary.

We will talk more about this in the Pipelines-lecture later.