IN5800 – Data Pipelines and Architectures

Leif Harald Karlsen

Data pipelines and architectures

Principles for data pipelines

Principles of pipelines: From software engineering

  • Separation of concern/modularity
    • Break code into distinct parts
    • Split pipeline into multiple semi-independent parts
    • Should be able to execute each independent part independently
    • For example:
      • Make-rules coupled via dependencies
      • Functions and function calls
  • Encapsulation
    • Hide complexity from caller/user
    • Only relevant details should be visible from outside a module/part
  • DRY (Do not repeat yourself)
    • Move repeated logic into single callable entity (function, rule, etc.)
    • E.g. how to construct IRIs, translation of units, etc.
  • How well have our pipelines followed these principles?

Replayability

  • Can execute a pipeline easily and repeatedly
  • Simple to execute (e.g. a single Make-rule)
  • Idempotent (able to execute pipeline over and over with same result)
  • Enables experimentation
  • Easier to reproduce data in case of corruption, systems going down, etc.
  • Easier to test (testing vs. production data)
  • How replayable have our pipelines been?

Scalability

  • Scale with increase in data (or be easy to make scale)
  • Use buffered reading/writing
  • Use scalable components and platforms
    • DBs instead of files
    • Cloud vs. own server
    • Distributed vs. centralized
  • Make it easy to move (cross-platform, containerized, etc.)
  • How scalable have our pipelines been?

Auditability

Reliable

Flexible

Testing and development

  • Just like software, pipelines are typically developed in a “safe” environment
  • Testing environment/data vs. production environment/data

Updates

Canonical data architectures

Canonical architectures: Data warehouse

  • Store everything in one DB
  • Integrates/combines data from production DBs
  • Intended for analysis/decision making
  • Typically lives alongside other DBs
  • Highly structured
  • Data mart: Smaller, more domain specific variant
  • Main approaches: Normalized or dimensional (Star schema)

Data warehouse: Normalized approach

Data warehouse: Dimensional approach

  • Most common is perhaps Star Schema
  • Tailored for particular use (reporting, analytics, etc.)
  • Split data into fact tables and dimension tables
  • Fact tables:
    • Record a concrete fact
    • Typically numerical values
    • Foreign keys to one or more dimensions
    • Can also be aggregates
  • Dimension tables:
    • Describes values/entities of a particular type in more detail
    • Often denormalized
  • Advantages:
    • Efficient queries
    • Simpler schema
  • Disadvantages:
    • Difficult to construct (more complex pipeline)
    • Difficult to modify/add new types of information
    • Inflexible for analytical purposes
  • Normalized dimensional: Snowflake schema

Dimensional approach: Example

Canonical architectures: Data lake

  • Store everything in one DB
  • Store everything in its raw form (documents, logs, etc.)
  • Need not know how it will be used
  • Transform/clean/integrate only when needed
  • Attempts to remove “data silos”

Canonical architectures: Data mesh