Lecture 1: Introduction and Course Overview

This lecture gives an introduction to declarative data engineering and an overview of the course, with practical information.

Slides

Relevant Wiki pages

Exercises

Start by reading up on the fundamental technologies, by following the links provided on each of the wiki-pages listed above. You do not need advanced knowledge of any of these technologies, but should know the basics so that you are able to learn more advanced features if necessary.

The exercises below will test your basic knowledge of the fundamental technologies listed above.

Each exercise is prefixed with the technology (from the above list) you should use.

Setup

Before you start, you should make sure you have access to a terminal with Bash. Linux and Mac machines have this installed by default. If your are using Windows you can either use Windows Subsystem for Linux or use Putty to log into a Linux machine at IFI remotely. Also make sure you have a text editor available on the machine you are working on (e.g. Vim, Nano, Emacs, Sublime, Atom).

Then, make a new Git repository to do the exercises in. Follow the steps below to do this:

  1. Go to UiO’s Github and log in with your UiO username and password

  2. Click on the green “New”-button on the left side of the page

  3. Fill in a name for your repo, e.g. IN5800-intro, and click on the green “Create repository”-button

  4. Open up a terminal (running Bash)

  5. Execute the following command to clone the Git repository to your computer:

    git clone <url>

    where <url> is the URL of your newly created repository (you can simply copy-paste the URL from your browser right after you have created the repository). E.g. I would run:

    git clone https://github.uio.no/leifhka/IN5800-intro

    Type in your UiO username and password when prompted for this.

Congratulations! :D You have now made a Git repository and cloned it to your computer.

Exercise 1: Make a README-file

We will start by making a simple README-file in the repo and push it to the remote repo.

  1. [Bash] Change your working directory to the newly cloned repo.
  2. [Markdown] Use your favorite text editor and create a new Markdown file with the name README.md containing a header Readme and the text This repo is used for the intro exercises in IN5800.. (You can see a stylized view of your README-file on the main page of your repo (same URL as you used to clone it))
  3. [Git] Add, commit and push the changes to the repo with the commit message Add README-file.

Exercise 2: Download and manage files

We are now interested in the info about our course (IN5800) contained in a data file.

  1. [Bash] Make new directories called downloads and data
  2. [Bash] Download the Zip-file at https://leifhka.org/in5800/lectures/intro/data.zip into the newly created downloads-folder using wget
  3. [Bash] Unzip the folder and move the unzipped file (data.csv) into the data-folder
  4. [Bash] Use cat, pipe (|) and grep to print out the line starting with in5800 (hint: the regular expression ^in5800.* will match lines starting with in5800) from data.csv
  5. [Bash] Remove the folders downloads and data and all files contained in them

Exercise 3: Make Makefile

As the data in the CSV-file from the previous exercise might change, we want to automate the steps done above, so we will make a Makefile for this.

  1. [Make] Open up a new file named Makefile in your favorite text edtior and create one Make-rule per sub-exercise in the previous exercise
  1. [Make] Execute the in5800_data rule (note that when you execute a Makefile, it also outputs all the Bash-commands it executes)
  2. [Make] Execute the clean rule
  3. [Markdown] Add a new subheader Use to your README.md-file containing the text Below is a list of useful commands: (where commands is bold) followed by a list containing the two items in5800_data and clean
  4. [Git] Add, commit and push the changes done to the repo with the commit message Add a Makefile to automate the information extraction.

Exercise 4: Branching and Make-variables

It is often nice to keep URLs out of the rules in Makefiles, and rather put them in separate variables. Thus, you will now fix your Makefile, but to be on the safe side, lets do the changes in a separate Git-brach and test it before merging it into your master-branch.

  1. [Git] Create a new branch (and switch to it) with the name feature/URLs-in-vars
  2. [Make] Make a new variable in your Makefile (e.g. data_url) that contains the URL of the ZIP-file to download and replace the URL in the make-rule with the use of the newly made variable instead
  3. [Make] Check that everything works by executing the in5800_data and clean rules
  4. [Git] Add, commit and push your changes to the feature/URLs-in-vars-branch
  5. [Git] Switch back to your master-branch and merge it with the feature/URLs-in-vars-branch

Solution

A solution to the exercises is provided here. It is wise to make an honest attempt at the exercises before consulting the solution ;)