This page gives general information about the project work and presentation. For concrete dates on deadlines etc. see The Semester Plan.
In the project work, you will work with several real datasets that have some overlap. Your task is to use a combination of data from the datasets to answer one or more questions.
You will recieve a project desription via mail when the project work phase starts, that contains links to the datasets and information about the data. In addition, there will be a list of questions that should be answered (or attempted to be answered) over the datasets. So your goal is to transform and integrate the data, make necessary abstractions and aggregatoins, etc. so that answers to the questions are possible to find using SQL or SPARQL. You should mostly use technologies and techniques from the curriculum.
However, note that the questions are only there to guide your work with the datasets. So you you should not only answer the questions, but try to engineer data that is genereally useful, also for other (similar) questions over the datasets.
You are free to make (reasonable) assumptions. For example, if you have a dataset on historic weather, and a question to ansert is “On which dates did it rain?” you could e.g. assume that it “rained” on any date that has
percipitation > 0.0. You can also make reasonable simplifying assumptions. E.g. if you have two datasets on historic weather, one with only
percipitation and another with two attributes
snow, you could treat them all as
The main points with the project work are:
- To apply the technologies and techniques from the curriculum and deepen your understanding of these
- To see how the different technologies and techniques can work together and be combined into a full pipeline
- To get some intuition on the complexities of working with real data
- To evaluate what you have learned, and to use it as a basis for feedback on the course
Below is a list of requirements for the project work:
- The project shold be hosted on UiO’s Github
- Use a private repository
- Try to follow good Git-practises (e.g. write good commit messages, use branches, etc.)
- You should not add large datafiles to the repository (these should instead be downloaded by a Makefile)!
- The pipelines should be managed using Make and a Makefile
- The project should be properly documented
- Both within the different files and with a README-file describing the project and how to use it
- The project should be properly structured and easy to understand
- E.g. similar folder structure as the mandatory assignment
A couple of weeks into the project work, you should share your project with the lecturer to get feedback on the work done so far. You should make a Merge Request (like you did for the mandatory exercise) and will then get feedback directly on the merge request.
Note that you are of course very welcome to ask questions and request help at any time during your project from me (Leif Harald).
Failure and Success
Note that for success on the project it is not necessary to only be able to answer the question(s) listed in the project description. For one, you also need to have satisfied all of the requirements listed above, the project should be easy to use, etc.
However, for success it is not even required to answer the question(s) listed. You might fail to answer the question(s), as the project might be more difficult than intended, or there might be technical issues along the way. In these cases, success follows from documenting and understanding why you failed and how a solution might still be found given enough resources and/or time. The point of the project is not to answer the question(s), but, as always in education, to learn something!
For the last lecture of the course, you should prepare a 10 min. presentation of your project. The presentation should start by giving a short (2-4 min.) presentation of the datasets, the questions to answer and an overview of how you have chosen to solve the problems (e.g. technologies used, how does your pipeline look, etc.). The rest of the time should be used to focus on something interseting you have encountered during your project work. This can for instance be a difficult problem with the datasets, something that is particularly challeling with your data, or interesting use of a technique or technology in your work. You can also use the time to describe a problem that you want help with.
During the presentation, you can show slides, show code or data, etc. You are free to structure the presentation as you like.
We will have a few minutes for questions and discussion after your presentation.
The main points with these presentations are as follows:
- To develop you presentation skills on technical topics and your ability to focus a presentation
- To develop your abilities in speaking about data and data pipelines
- To get feedback on your work
- To get help with problems you might have
- To give other students an overview of other projects and how other’s are solving their problems