Evolution of the Data Landscape

From 2000 onward we have had a sharp increase in:

Digitization (analog to digital content)
Digitalization (processes centered around digital technologies)
User contributed content
Sensors and sensing devices (mobiles, cameras, satellites, etc.)

Evolution of the Data Landscape

Managing data size and complexity difficult, in many areas such as:

Big companies
Government
Healthcare
Science
Media

“Big Data”

Big Data is characterized by:

Volume: Lots of data
Velocity: Generated at high speed
Variety: Different formats, structures and types of content
Veracity: Varying greatly in quality

“Big Data”

Introduces the need for data-centric roles

Data analyst
Database administrators
Data scientist
Data engineer
…

What is Data Engineering?

From Quanthub:

“Data” engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth.

By me:

Data engineering is the art of creating data pipelines producing a coherent data source from disparate (possibly unstructured and messy) data sets, that is usable by data scientists, computer programs, and other data consumers.

Data Engineering vs. Data Science

A data scientist produces insights from data, typically via applying statistics, machine learning or other methods of analysis
A data engineer produces the data consumed by data scientists
Data engineering more technology focused
Data science more model focused

Declarative?

A sentence is declarative if it makes something known
- E.g. “It is snowing outside now.”
A sentence is imperative if it commands someone/something to do something
- E.g. “Do your homework!”

Imperative languages in computer science

Imperatively ask for a glass of water:

Could you please go 2 meters to the left, stretch your arm out, pull the door handle down and towards you. Then go through the door, turn left, go 4 meters forwards, turn right, …, and let go of the glass.

Focus on stating what the computer should do
Which steps to do to arrive at a solution to a problem
Examples:
- Scripting languages (Python, Lua, Bash, etc.)
- Most programming languages (C, Java, etc.)

Declarative languages in Computer Science

Declaratively ask for a glass of water:

Water is liquid H₂O and a glass is melted sand shaped so that its content doesn’t pour out. Could you get me a glass of water, please?

Focus on stating what is true/should be true
Describing a problem, a domain or the properties of a desired solution
Examples:
- Query languages (SQL, SPARQL, Cypher, Datalog, etc.)
- Build languages (Maven, Ant, Make, etc.)
- Markup languages (LaTeX, HTML, etc.)
- Modeling languages (ER, ORM, OWL, etc.)
- Some programming languages (Prolog, Haskell, etc.)

Declarative Data Engineering

Data engineering focusing on declarative techniques
Pipelines with declaratively defined components
Components states what should hold/become true
Why?
- Easier to involve domain experts
- Easier to change encoding of knowledge
- Solutions often more general
- Can be more efficient

Technologies for Data Engineering

A data engineer typically use:

Database management systems
Modeling languages
Query languages
Programming/scripting languages
Cloud platforms and high performance computing

Technologies of focus in IN5800

In this course, we will focus on:

Databases (Relational, triple/graph)
Semantic/Data modeling (rules, OWL, RDFS, OTTR)
Query languages (SQL, SPARQL, Datalog)
Validation (constraints, SHACL)
Programming/scripting languages (Bash, Make)

…no Python and Pandas?

Python/Pandas commonly used for (imperative) data engineering
We focus on declarative techniques
Semantic technologies, business intelligence, etc. often uses Java
Sometimes still need imperative code (can use Python/Pandas)
Techniques/concepts from IN5800 still transferable to Python/Pandas-based projects
Will use Bash and Make as “glue”
- Mostly just invoking commands in a chain
- Make handles dependencies/structure
- Cross-platform, easily available

Techniques for Data Engineering

Clean/Transform/Structure [IN5800]
Saturation [IN5800]
Integration [IN5800]
Aggregation/Abstraction [IN5800]
Cloud and scaling [IN3020, IN3200, IN5040, IN5050]
Data analysis/science [IN3050, IN3310, IN-STK5000, IN-STK5100]
Data security [IN3210]

Data Pipelines

These techniques are applied as components in a pipeline
Pipeline can be applied whenever source data is updated
- discretely (e.g. daily, weekly)
- continuously (real-time data)
Makes the resulting data reproducible, updateable, shareable, etc.

Course Overview

Date	Lecture	Work
26.01	Intro
02.02	Data structure
09.02	Query languages
16.02	Views and rules
23.02	Semantics and reasoning
02.03	Templates
09.03	Mapping languages	[O]
16.03	Constraints	[O]
23.03	Transforming/Structuring	[O]
30.03	Oblig solution
13.04	Saturation
20.04	Integration
27.04	Ontology engineering	[P]
04.05	Cleaning and Validation	[P]
11.05	Pipelines	[P/Pr]
25.05	Project presentations	[P/Pr]
01.06	(no lecture)	[P]

[O] – Oblig

[P] – Project work

[Pr] – Presentation work

(See semester plan for details)

Mandatory Assignment

Focused on technologies
2-3 weeks
Two attempts
Needs to be passed
Similar structure as project work

Project work

Pass/fail
Make a declarative data pipeline on real-world data
- Given two or more data sets
- Answer questions that require combining data
- Documented and made as a proper project
Will have (at least) one feedback cycle
Group work possible (1-2 students per group)
Real-world data implies complexity
- Might not be able to accomplish a proper integration/saturation/etc.
- Failure is also useful, and can be accepted
- Document the reason for failure

Presentation

Each student/group will make a presentation
Not present everything you have done
Should pick one or two interesting insights, surprises, results, etc.
10-15 mins.
Pass/fail

Disclaimer!

I am not a professional data engineer
Know the theory, technologies and the techniques
Experience from research projects with e.g. Aibel, Grundfos, Artsdatabanken
Data engineering is an evolving field
Learn by doing/teaching/research
Correct me if I am wrong!
If any of you have experience, I would love to hear it!
Want this to be discussion-based

Lectures

Practical, example based
Interaction and discussion
May also have guests or guest lectures

Lecture slides

Made with Slidy
Keybindings
- A – read mode
- B – bigger text
- S – smaller text
See footer for table of contents and help

Course Wiki

The course wiki will

contain info on technologies and techniques relevant for the course
contain a page for each lecture
- slides
- exercises
- relevant pointers
pointers to guides/tutorials/useful resources
will be updated continuously

Mattermost

We will use Mattermost

Ask questions
Create discussion
Chat

General expectations

I expect you to

continuously work with the course
learn the technologies and techniques as they are presented
actively participate in discussions

IN5800 – Introduction and Overview

IN5800 – Declarative Data Engineering

Evolution of the Data Landscape

Evolution of the Data Landscape

“Big Data”

“Big Data”

What is Data Engineering?

Data Engineering vs. Data Science

Declarative?

Imperative languages in computer science

Declarative languages in Computer Science

Declarative Data Engineering

Technologies for Data Engineering

Technologies of focus in IN5800

…no Python and Pandas?

Techniques for Data Engineering

Data Pipelines

Course Overview

Mandatory Assignment

Project work

Presentation

Disclaimer!

Lectures

Lecture slides

Course Wiki

Mattermost

General expectations

Homework for next week