Principles for data pipelines

Principles of pipelines: From software engineering

Separation of concern/modularity
- Break code into distinct parts
- Split pipeline into multiple semi-independent parts
- Should be able to execute each independent part independently
- For example:
  - Make-rules coupled via dependencies
  - Functions and function calls
Encapsulation
- Hide complexity from caller/user
- Only relevant details should be visible from outside a module/part
DRY (Do not repeat yourself)
- Move repeated logic into single callable entity (function, rule, etc.)
- E.g. how to construct IRIs, translation of units, etc.
How well have our pipelines followed these principles?

Scale with increase in data (or be easy to make scale)
Use buffered reading/writing
Use scalable components and platforms
- DBs instead of files
- Cloud vs. own server
- Distributed vs. centralized
Make it easy to move (cross-platform, containerized, etc.)
How scalable have our pipelines been?

Clear what is done, when it is done, and what the result were
Readable and clear code
Make results easy to inspect
Logging and provenance
- Where (source) did the data come from
- Who/what made it
- When was it made
How auditable have the pipelines we have seen in this course been?

Adapt to changes in data sources, types, needs, requirements, etc.
These may change often
Fewer steps/layers may be more efficient, simplier, etc.
But may be less flexible
Example: Data transformation exercises pipelines
- E.g. want to change wind-speed from km/h to m/s
In general: How flexible have our pipelines been?

Architecture for whole data systems
- Combination of pipelines and structures/formats/systems
We will see a few examples
New additions come quite often
More on this in IN5040 - Advanced Database Systems for Big Data

Standard normalized schema
- Typically 3NF/BCNF
- Or even 6NF (see Anchor modelling)
Advantages:
- Easy to add new types of information
- Flexible for analytics
Disadvantages:
- Inefficient queries (many joins, aggregates)
- More complex schema

Most common is perhaps Star Schema
Tailored for particular use (reporting, analytics, etc.)
Split data into fact tables and dimension tables
Fact tables:
- Record a concrete fact
- Typically numerical values
- Foreign keys to one or more dimensions
- Can also be aggregates
Dimension tables:
- Describes values/entities of a particular type in more detail
- Often denormalized
Advantages:
- Efficient queries
- Simpler schema
Disadvantages:
- Difficult to construct (more complex pipeline)
- Difficult to modify/add new types of information
- Inflexible for analytical purposes
Normalized dimensional: Snowflake schema