IN5800 – Constraints

Leif Harald Karlsen

What are Constraints?

Constraints vs. Semantics

Constraints: What must be true about the data

Semantics: What is true in the domain (of the data)

Forms of Constraints

Constraints in Relational Databases

  • Shape of data:
    • Forced structure: Tables and columns
    • PRIMARY KEY, UNIQUE and FOREIGN KEY
  • Shape of values:
    • Types
    • NOT NULL
    • CHECK
  • Triggers

Constraints and RDF

  • RDF does not include constraints as part of the language (like relational databases)
  • Can be difficult to know whether your data actually looks the way you think
  • RDF’s metadata, (i.e. semantics), is not a sufficient language
  • Will now look at ways of constraining RDF graphs

Constraints in Mappings: Direct Mappings

Constraints in Mappings: General Mappings

Constraints in Mappings: OTTR

  • However, OTTR templates do to some degree force the shape of the resulting RDF
  • A template instance must have the correct number of arguments
  • Each instance of the same template will result in the same graph shape
  • The type system ensures that values have the correct type
  • But, no checks accross templates
    • E.g. all ex:Employee-instances must be ex:worksFor-related to a ex:Company-instance

Example: OTTR as Constraints

ex:Employee[ 
  ottr:IRI ?person,
  xsd:string ?name,
  xsd:string ?phone,
  xsd:int ?age,
  ! ottr:IRI ?worksFor
] :: {
  o-rdf:Type(?person, ex:Employee),
  ottr:Triple(?person, ex:hasName, ?name),
  ottr:Triple(?person, ex:hasPhoneNumber, ?phone),
  ottr:Triple(?person, ex:hasAge, ?age),
  ottr:Triple(?person, ex:worksFor, ?woksFor)
} .

ex:Company[ 
  ottr:IRI ?company,
  xsd:string ?name
] :: {
  o-rdf:Type(?company, ex:Company),
  ottr:Triple(?person, ex:hasName, ?name)
} .
# OK

ex:Employee(ex:per, "Per", "98765432", 32, ex:uio) .
ex:Employee(ex:kari, "Kari", "123456", 34, ex:dnb) .

ex:Company(ex:uio, "Universitetet i Oslo") .
ex:Company(ex:ntnu, "NTNU") .
# NOT OK

ex:Employee(ex:ole, "Ole", 12345678, "34", ex:uio) .
ex:Employee(ex:nils, "Nils", "23456789", 34) .     
ex:Employee(ex:mari, "Mari", "34567890", 34, _:b) .     

ex:Company(ex:ruter, "Ruter", <http://ruter.no>) .

Constraints in RDF: SHACL – Example

Vocabulary:

ex:Employee a owl:Class .
ex:Company a owl:Class .

ex:hasName a owl:DatatypeProperty;
    rdfs:range xsd:string .

ex:age a owl:DatatypeProperty;
    rdfs:domain ex:Employee; 
    rdfs:range xsd:integer .

ex:hasPhoneNumber a owl:DatatypeProperty;
    rdfs:domain ex:Employee; 
    rdfs:range xsd:string .

ex:worksFor a owl:ObjectProperty;
    rdfs:domain ex:Employee; 
    rdfs:range ex:Company .

ex:contactPerson a owl:ObjectProperty;
    rdfs:domain ex:Company; 
    rdfs:range ex:Person .

SHACL Shapes:

ex:EmployeeShape
    a sh:NodeShape ;
    sh:targetClass ex:Employee ;        # Applies to all individuals of ex:Employee
    sh:property [                 
        sh:path ex:hasPhoneNumber ;           
        sh:datatype xsd:string ;
        sh:pattern "^\\d{8}$" ;         # Phone numbers are strings of 8 digits
    ] ;
    sh:property [                 
        sh:path ex:hasName ;           
        sh:minCount 1 ;                 # All employees must have at least one name
        sh:maxCount 1 ;                 # All employees must have at most one name
        sh:datatype xsd:string ;
        sh:pattern "^[A-Z][a-z]+$" ;    # Names starts with capitals, then lower-case letters
    ] ;
    sh:property [
        sh:path ex:age ;                
        sh:minInclusive 16 ;            # Age is an xsd:int >= 16
        sh:maxInclusive 150 ;           # Age is <= 150
    ] ;
    sh:property [                 
        sh:path ex:worksFor ;
        sh:minCount 1 ;                 # All employees must work for a company
        sh:node ex:CompanyShape ;       # The object must conform to the ex:CompanyShape
    ] .

ex:CompanyShape
    a sh:NodeShape ;
    sh:targetClass ex:Company ;         # Applies to all individuals of ex:Employee
    sh:property [                 
        sh:path ex:hasName ;
        sh:minCount 1 ;                 # All companies must have at least one name
        sh:datatype xsd:string ;        # But can be any string value
    ] .

Valid data:

ex:per rdf:type ex:Employee ;
    ex:hasPhoneNumber "12345678" ;
    ex:hasName "Per" ;
    ex:hasAge 32 ;
    ex:worksFor ex:UiO .

ex:kari rdf:type ex:Person ;       # Not checked (not ex:Employee)
    ex:hasPhoneNumber "98765432" ;
    ex:hasName "Kari" ;
    ex:worksFor ex:UiO .
    
ex:UiO rdf:type ex:Company ;
    ex:hasName "Universitetet i Oslo" ;
    ex:hasName "University of Oslo" .

Invalid data:

ex:peter rdf:type ex:Employee ;
    ex:hasPhoneNumber "12345678" ;
    ex:hasAge 32 ;
    ex:hasName "Per" .             # Missing ex:worksFor-relationship

ex:kari rdf:type ex:Employee ;
    ex:hasPhoneNumber 12345678 ;   # Wrong type
    ex:hasName "Kari" ;
    ex:hasName "Karry" ;           # Two names
    ex:worksFor ex:UiO .           # ex:uib does not conform to the ex:CompanyShape
    
ex:UiO rdf:type ex:Company .       # No name

ex:NTNU rdf:type ex:Company ;
    ex:name "NTNU" .               # Wrong relationship

Constraints in RDF: SHACL

  • The best way to ensure an RDF-graph is correct is by adding constraints directly to it
  • SHACL (Shapes Constraint Language) is a constraint language for RDF
  • Constraints are written in RDF and are called shapes
  • Specify the shape of the data by specifying the properties of paths through the graph
  • Specify target nodes for the constraints based on classes, properties
  • Can e.g. specify constraints on all members of a particular class
  • Shapes can reuse other shapes
  • Note: SHACL also has support for defining inference rules based on shapes
ex:EmployeeShape
    a sh:NodeShape ;
    sh:targetClass ex:Employee ;
    sh:property [                 
        sh:path ex:hasPhoneNumber ;           
        sh:datatype xsd:string ;
        sh:pattern "^\\d{8}$" ;
    ] ;
    sh:property [                 
        sh:path ex:hasName ;           
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:datatype xsd:string ;
        sh:pattern "^[A-Z][a-z]+$" ;
    ] ;
    sh:property [
        sh:path ex:age ;                
        sh:minInclusive 16 ;
        sh:maxInclusive 150 ;
    ] ;
    sh:property [                 
        sh:path ex:worksFor ;
        sh:minCount 1 ;
        sh:node ex:CompanyShape ;
    ] .

SHACL vs. relational constraints: Limitations of relational constraints

CREATE TABLE employee(
    eid int PRIMARY KEY,
    ename text NOT NULL,
    age int,
    phone text,
    worksFor int NOT NULL REFERENCES company(cid)
);
CREATE TABLE company(
    cid int PRIMARY KEY,
    cname text NOT NULL,
    contactPerson int REFERENCES person(pid)
);

SHACL vs. relational constraints: Limitations of relational constraints

ex:CompanyShape
    # [...]
    sh:property [
      sh:path ex:hasContactPerson ; 
      sh:minCount 1 ;               
      sh:and (                          # sh:and is logical conjunction of shapes
        ex:EmployeeShape                # Must be employee (Note: Recusive/cyclic)
        [                               # Must have a phone number
          sh:property [
              sh:path ex:hasPhoneNumber ; 
              sh:minCount 1 ;               
          ]
        ] 
      ) 
    ] .

SHACL vs. relational constraints: Triggers for relational constraints

CREATE FUNCTION contactperson_trig_fn() RETURNS TRIGGER AS
$$
  BEGIN
  IF (SELECT phone IS NULL
      FROM employee
      WHERE eid = NEW.contactPerson
     )
  THEN
    RAISE EXCEPTION 'Company ' || NEW.cname || ' has contact person without phone.';
  END IF;
  
  RETURN NEW;
END;
$$ language plpgsql;

CREATE TRIGGER contactperson_trig
BEFORE INSERT ON company
FOR EACH ROW EXECUTE PROCEDURE contactperson_trig_fn();

Constraints via Queries

-- Checks that all contactPersons have a phone number
-- Every answer is a violation of the constraint: "Every contact person must have phone number."

SELECT 'Contact person for company ' || c.cname
       || ' does not have a contact number!' AS violation
FROM employee AS e JOIN company AS c ON (e.worksFor = c.cid)
WHERE e.phone IS NULL;

Constraints in RDF: SPARQL/SPIN

SELECT (CONCAT("ERROR: Mising name for ", STR(?p)) AS ?error)
WHERE {
    ?p rdf:type ex:Employee .
    FILTER NOT EXISTS { ?p ex:hasName ?n . }
}

Constraints vs. Semantics Revisited

Case 1:

# Mixing Turtle and OWL Manchester syntax here

:Company subClassOf
    :hasContactPerson some (:Employee and :hasPhoneNumber some :PhoneNr ).

:abc a :Company ;
    :hasContactPerson :mary .
:mary a :Employee ; :hasPhoneNr [ rdf:type :PhoneNr ] .        #Inferred

Case 2:

:Company subClassOf
    :hasContactPerson some (:Employee and :hasPhoneNumber some :PhoneNr ).

:id rdf:type owl:InverseFunctionalProperty .

:abc :hasContactPerson :mary .
:mary a :Person ;
    :id "123" .
    
_:p :id "123" ;
    :hasPhoneNumber "98765432" .
:mary :hasPhoneNr "987654332" .                    #Inferred

Note: Can speficy that values should be non-blank in SHACL

Semantics as Constraints

  • Semantics gives meaning to data
  • Thus, certain combinations of statements can therefore be considered contradictory
  • Contradictions are impossible in the real world
  • So the data (or the semantics) must be incorrect
  • Thus, a form of constraint on correctness

Semantics as Constraints

CREATE RELATION inconsitency(description text);

inconsistency('Company ' || pname || ' has contact person without phone number present!')
    <- person(pid, pname, phone), company(cid, cname, pid) : phone IS NULL;

inconsistency('IRI ' || p || ' is both a ex:Person and a ex:Cat!')
    <- rdf.type(p, qn('ex', 'Person')), rdf.type(p, qn('ex', 'Cat'));


CREATE FUNCTION inconsistency_fn() RETURNS trigger AS
$body$
BEGIN
  RAISE EXCEPTION 'Inconsitency detected: ' || NEW.description;
  RETURN NEW;
END;
$body$ language plpgsql;

CREATE TRIGGER inconsitency_trigger
BEFORE INSERT ON company
FOR EACH ROW EXECUTE PROCEDURE inconsitency_fn();

Semantics as Constraints

# (Mixing Turtle and OWL Manchester syntax here)

# Try to state "all companies MUST have a contact person that has a phone number"

:Company subClassOf
    :hasContactPerson some (:Employee and :hasPhoneNumber some :PhoneNr ).

:abc a :Company ;
    :hasContactPerson :mary .
:mary a :Employee ; :hasPhoneNr [ rdf:type :PhoneNr ] .          #Inferred

Constraints and Open/Closed World

What do Constraints Really Check?

  • Obviously, correctness of (shape of) data
  • However, data is produced by a (possibly complex) pipeline
  • Thus, also checks correctness of
    • mappings (transformations and cleaning)
    • semantics
    • integration
    • assumptions about data
  • Fails at insert/processing step instead of at runtime
  • Points directly to what went wrong

When to Define Constraints

Types

Temporal data and types

  • Data about time, i.e. when something happened
  • Can be points or intervals on the time axis
  • Always have the special point now
    • Partitioning the axis into past, present and future
  • Examples:
    • Dates
    • Timestamps, with or without timezone (e.g. 2021-01-21 10:15:00+01)
    • Unix time and other relative measures
    • Epochs, eras, geological time, astronomical time, etc.
  • Valid time vs. transactional time

Complexity of Temporal data

  • Difficult to measure time
    • No absolute scale
    • Typically measured with respect to something (e.g. position of sun, moon)
    • Depends on location, means need to translate (e.g. between time zones)
    • General relativity (time depends on gravity, speed, etc.)
    • Different scales: Astronomical, geological, historical, daily, nano-scale
  • Contains discontinuities, ambiguities, etc.
    • Daylight savings time
    • Leap years and seconds
    • Reforms of calendars (e.g. 10-days missing in Gregorian Calendar 05.10.1582 - 15.10.1582, New Year’s moved, etc.)
    • Start of week (Sunday in US, Monday in Europe) and week numbers
  • The type system removes many of these pains
    • Timezone-aware types
    • Operations take edge cases into account

Spatial data

  • Data about location and extent
  • Typically points, lines, polygons, polyhedra, etc.
    • E.g. POINT(1.0 2.0), LINESTRING(1.0 2.0, 2.1 3.2, 4.1 7.3)
  • Also multi-point, multi-lines, etc.
  • Special constant here
  • Examples:
    • Geographic and map data
    • Models of objects (cars, organs, etc.)
    • Geological, astronomical, archaeological, etc.

Complexity of Spatial data

  • Lots of functions, operations and relations
  • Complicated algorithms
  • Use of floats complicate computations
    • E.g. might be impossible construct intersection of two intersection objects
  • We live on a globe, complicates maps (e.g. multiple projections)
  • Contains lots of implicit data

Spatial data in Query Languages

  • Spatial data is complex, need extensions
  • PostGIS is a state-of-the-art geospatial extension for PostgreSQL
  • GeoSPARQL is a similar extension for SPARQL
  • Often not or only partially supported by triplestores/SPARQL implementations
  • Otherwise, need to translate quantitative geospatial data into qualitative
    • A usefull exercise
  • Also plays better with semantics