IN5800 – Introduction and Overview

Leif Harald Karlsen

IN5800 – Declarative Data Engineering

Welcome to a relatively new course!

Evolution of the Data Landscape

From 2000 onward we have had a sharp increase in:

  • Digitization (analog to digital content)
  • Digitalization (processes centered around digital technologies)
  • User contributed content
  • Sensors and sensing devices (mobiles, cameras, satellites, etc.)

Evolution of the Data Landscape

Managing data size and complexity difficult, in many areas such as:

  • Big companies
  • Government
  • Healthcare
  • Science
  • Media

“Big Data”

Big Data is characterized by:

  • Volume: Lots of data
  • Velocity: Generated at high speed
  • Variety: Different formats, structures and types of content
  • Veracity: Varying greatly in quality

“Big Data”

Introduces the need for data-centric roles

  • Data analyst
  • Database administrators
  • Data scientist
  • Data engineer

What is Data Engineering?

From Quanthub:

“Data” engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth.

By me:

Data engineering is the art of creating data pipelines producing a coherent data source from disparate (possibly unstructured and messy) data sets, that is usable by data scientists, computer programs, and other data consumers.

Data Engineering vs. Data Science

Declarative?

Imperative languages in computer science

Imperatively ask for a glass of water:

Could you please go 2 meters to the left, stretch your arm out, pull the door handle down and towards you. Then go through the door, turn left, go 4 meters forwards, turn right, …, and let go of the glass.

Declarative languages in Computer Science

Declaratively ask for a glass of water:

Water is liquid H2O and a glass is melted sand shaped so that its content doesn’t pour out. Could you get me a glass of water, please?

Declarative Data Engineering

Technologies for Data Engineering

A data engineer typically use:

Technologies of focus in IN5800

In this course, we will focus on:

…no Python and Pandas?

Techniques for Data Engineering

Data Pipelines

  • These techniques are applied as components in a pipeline
  • Pipeline can be applied whenever source data is updated
    • discretely (e.g. daily, weekly)
    • continuously (real-time data)
  • Makes the resulting data reproducible, updateable, shareable, etc.

Course Overview

Date Lecture Work
26.01 Intro
02.02 Data structure
09.02 Query languages
16.02 Views and rules
23.02 Semantics and reasoning
02.03 Templates
09.03 Mapping languages [O]
16.03 Constraints [O]
23.03 Transforming/Structuring [O]
30.03 Oblig solution
13.04 Saturation
20.04 Integration
27.04 Ontology engineering [P]
04.05 Cleaning and Validation [P]
11.05 Pipelines [P/Pr]
25.05 Project presentations [P/Pr]
01.06 (no lecture) [P]

[O] – Oblig

[P] – Project work

[Pr] – Presentation work

(See semester plan for details)

Mandatory Assignment

Project work

Presentation

Disclaimer!

  • I am not a professional data engineer
  • Know the theory, technologies and the techniques
  • Experience from research projects with e.g. Aibel, Grundfos, Artsdatabanken
  • Data engineering is an evolving field
  • Learn by doing/teaching/research
  • Correct me if I am wrong!
  • If any of you have experience, I would love to hear it!
  • Want this to be discussion-based

Lectures

Lecture slides

Course Wiki

The course wiki will

Mattermost

We will use Mattermost

General expectations

I expect you to

Homework for next week

Read up on the fundamental technologies for this course: