Lecture 3: Query Languages
This lecture gives an overview of different query languages, in particular SQL and SPARQL, and how they differ. We will also look at the relationship between a query language and the underlying data structure it queries.
Relevant Wiki-pages
- PostgeSQL (relational DB and
backend for triplestore)
- PostGIS (Spatial extension for PostgreSQL)
- RDF (Triple-based data representation)
- Jena
- SQL
- SPARQL
Exercises
Set-up
- For SQL, use your personal database, either the one set up by USiT for you, or a database on your personal computer. You should have gotten a mail with details on your personal database at IFI.
- Download the example dataset on species used in the lecture here, and use
psql
to execute the script. - For SPARQL, use e.g. Jena (we will use a different triplestore later in the course)
- Download the example dataset on species used in the lecture here
- Download the example RDF-dataset on people used in the lecture here and
- To execute a SPARQL-query, simply run the following command:
<path-to-jena>/bin/arq --data <RDF-file> --query <query-file>
where <path-to-jena>
is the path to the unzipped
Jena-folder, <RDF-file>
is the file containing the
RDF-data you want to query, and <query-file>
is a
file containing the SPARQL-query you want to execute.
SQL and SPARQL
Use the dataset on species, and for each of the exercises below, write both an SQL-query and a SPARQL-query:
- Find the genus of all species locally extinct (i.e. having
local_conservation
equal to0
. - Find the genus and common name (if it exists) for all species having
a
local_conservation
smaller thanglobal_conservation
. Order the result on genus alphabetically. - Find out how many observations there are for species in each of the
global conservation statuses (e.g. how many observations are there of
species with conservation status
0
, how many for those with conservation status1
, and so on). Hint: Use grouping.
Use the dataset on people, and for each of the exercises below, write a SPARQL-query:
- Find the name (i.e. the
rdfs:label
) of all mothers. - Find find the name of all parents (i.e. both mothers and fathers)
- Find the names of all ancestors of
ex:bob
(You might get aClassCastException
when trying to execute your solutio on this, see the solution for more info on this)
Issues with SQL
An important point with the relational model is that most (all?) data can be represented as relations. However, this does not mean that it is convenient to manipulate by the query languages working on tables, such as SQL.
So, now imagine that we had represented the data in
people.ttl
in the following relational table:
CREATE TABLE people (
id int PRIMARY KEY,
name text,
hasFather int REFERENCES people(id),
hasMother int REFERENCES people(id)
);
where we replace IRIs of resources with integer id
s. The
people
-table would then look as follows:
id | name | hasmother | hasfather
----+-------+-----------+-----------
1 | Blank | |
2 | Peter | |
3 | Mina | |
4 | Ole | |
5 | Maria | |
6 | Karl | 3 | 2
7 | Mary | 5 | 4
8 | Bob | 7 | 6
The SQL-script making this table can be downloaded here.
How would an SQL-query that finds the name of all ancestors of the
person with name Ole
look? (Note: difficult, need recursive
queries!)
Issues with SPARQL
An important feature of RDF, similarly as with the relational model,
is that most data (any?) can be represented as a collection of triples.
This is equally true for complex/nested structures, such as trees,
lists, and other complex data types. For example, to state that
ex:peter
is ex:knowsTheNumbers
-related to a
(linked) list with the elements 1
, 2
and
3
can be represented in RDF with the following triples:
ex:peter ex:knowsTheNumbers
[rdf:first 1 ;
rdf:rest [ rdf:first 2 ;
rdf:rest [ rdf:first 3 ;
rdf:rest rdf:nil ] ] .
where rdf:first
denotes the first element of the list,
rdf:rest
denotes the list of elements comming after the
first (recursively), and rdf:nil
is the empty list. Note
that in the Turtle serialization there is a short hand syntax (syntactic
sugar) for writing such lists (occuring within a triple), so one could
rather write
ex:peter ex:knowsTheNumbers (1 2 3) .
in place of the triples above.
Lists are not that much used in RDF in practise, as the above statement could just as easily be represented as
ex:peter ex:knowsTheNumber 1, 2, 3 .
That is, with three triples, one for each number. However, there are
some benefits to using lists (or similar structures), e.g. lists are
closed, so one cannot just add statements to exted a list (one would
need to delete the rdf:rest rdf:nil
-triple, and replace it
with the desired extension, one knows when one has read all triples
describing a list (if one have triples connecting ex:peter
to rdf:nil
, we know we have read all triples describing the
list), and finally, a list imposes an ordering of the elements (e.g. it
makes sense that Peter learned 1
before 2
).
Similar arguments can be made for other (more domain specific) complex
structures. Thus, the problems discussed in this exercise are still
relevant.
So, imagine we had a graph that keeps track of the numbers different people know (i.e. can count to), but when a person learns even more numbers, new statements are simply added to the graph. So, we might have data that looks like this (this file can be downloaded here):
@prefix ex: <http://example.org/> .
ex:peter ex:knowsTheNumbers (1 2 3) .
ex:kari ex:knowsTheNumbers (1 2 3 4 5) .
ex:peter ex:knowsTheNumbers (4 5 6 7) .
ex:ole ex:knowsTheNumbers (1 2) .
ex:kari ex:knowsTheNumbers (6 7 8 9 10) .
ex:peter ex:knowsTheNumbers (8 9) .
- Write a SPARQL-query that finds all the numbers that
ex:peter
knows. (Hint: Use property paths) - Write a SPARQL-query that creates a new graph (i.e. a
CONSTRUCT
-query) that contains only one list per person that contains all the numbers that person knows (i.e. concatenate all the lists). (Hint: Impossible! But try to find out why)
Solution
The solution to the exercises can be found here.