Data Pipelines

A data pipeline is simply a sequence of data transformations. Thus, a pipeline might both clean, integrate and saturate data from multiple sources into a single source.

An important aspect of making data pipelines is determining the architecture of the pipeline, so that it scales to the necessary data volume and velocity, that it is easy to understand and maintain, and that it is reliable, robust and correct.

How one constructs the pipeline greatly affects the properties of it. For example, whether cleaning is placed at the beginning or end of the pipeline can make big differences in the assmptions the other transformations in the pipeline can make.

Relevant Articles and Tutorials