For your graded assignment, you will work on a semester-long data science project using R. The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choice. You will acquire and preprocess the data, design your visualizations, run machine learning algorithms, and communicate your results.

Project Team

You work in a team of 3 to 5 students. In general, the grades for each group member will most likely be the same. However, if one team member evidently (a) did not contribute a fair share of the team’s work, (b) delivered poor or incomplete work, (c) missed deadlines, (d) did not assist team mates and/or (e) threatened to quit if the work became difficult, this team member will receive a lowered grade.

Project Milestones

There are a few milestones for your final project, see the table below. Please note that no extensions will be given for any of the project due dates for any reason. Projects submitted after the final due date will not be graded. Mandatory deliverables submitted after due date will be assessed as not submitted.
If you anticipate any issues, e.g. due to travel or health-related, you need to send an email at least one week in advance. There are several deliverables for your project that will be graded individually to make up your final project score:

Date Description
20.05. team formation & project proposal submission; registration
22.05. project proposal feedback
08.07. final project submission due
10.07. final presentations shown in class, location/exact time to be updated

Any changes that you make to your GitHub repositories and webpages after the due date will be ignored. Please have all your work submitted and tested (websites, screencasts, etc.) before the deadline.

You have to fill this form to register this course officially via the examination office. After 20.05., registration is no longer possible.

Team formation and project proposal

You start your project by forming your groups. Submit a project proposal as R Markdown document where you describe what topic you are interested in exploring. Within your proposal, you must provide the following information:

  • Project title
  • Name(s) of team member(s)
  • Background and motivation
  • Project objectives
  • Name(s) of dataset(s) you use
  • Design overview (algorithms and methods you plan to use)
  • Time plan including distribution of responsibilities and workload among team members written as weekly deadlines

Each team will only need to submit one proposal. You will schedule a project review meeting with Uli Niemann on 22.05. (exact times will be announced). Make sure all of your team members are present at the meeting.

Topic

In your project, you will work on a self-chosen dataset. Please consider that you have to use a dataset that hasn’t been studied extensively. For example, you shouldn’t use a dataset from the UCI ML repository or Kaggle.
Also, don’t use very small, trivial or toy datasets like iris or play golf. A list of websites with interesting datasets is provided on the FAQ site.

R Markdown process notebook

An important part of your project is your R Markdown process notebook. Your notebook details all your steps in developing your solution, including how you collected the data, alternative solutions you tried, describing machine learning algorithms/techniques you used, and the insights you got. It is strongly recommended to include many visualizations. Your process notebook should include the following topics:

  • Overview and motivation: overview of the project goals and the motivation for it
  • Related work: anything related, such as a paper, a website, a newspaper article or something else
  • Initial questions:
    • What questions are you trying to answer?
    • How did these questions evolve over the course of the project?
    • What new questions did you consider in the course of your analysis?
  • Data: source, scraping method, cleanup, storage, etc.
  • Exploratory data analysis:
    • What visualizations did you use to look at your data in different ways?
    • What are the different machine learning methods you considered?
    • Justify the decisions you made, and show any major changes to your ideas.
    • How did you reach these conclusions?
  • Final analysis:
    • What did you learn about the data?
    • How did you answer the questions?
    • How can you justify your answers?

Make sure that your process notebook is a standalone document that fully describes your process and results.

Code

You are expected you to write high-quality and readable R code, considering aspects such as reusability, error handling and documentation.

Project website

You will create a public website for your project using Google Sites, GitHub Pages, Netlify (using blogdown) or any other web hosting service of your choice. The web site should effectively summarize the main results of your project and tell a story. Consider your audience (the site is public) and keep the level of discussion at the appropriate level. Your R Markdown process notebook and data should be linked to the web site as well, either using a zip file, GitHub, Bitbucket, or another code hosting site. Also embed your main visualizations and your screencast in your website.

Project screencast

Each team will create a two minute screencast with narration showing a demo of your R Markdown notebook and/or some slides. There a various screencast software packages available, including Camtasia (30-day trial) for Windows & Mac and Bandicam (non-registered version with watermarks) for Windows. Please ensure a sufficient sound quality.
Upload the video to an online video-platform such as YouTube or Vimeo and embed it into your project web page. The video is shown as teaser before your final presentation in class. Focus the majority of your screencast on your main contributions rather than on technical details.

Final presentation

You will prepare a 20-min presentation on your project summarizing your project for your fellow students. The presentations will take place in the last two course weeks. Exact dates and times to be announced. You should fairly distribute the speech parts among all team members, i.e., there should not be a presentation where one team member does most or all of the talking.

Grading

You register your project as term paper / Hausarbeit at the examination office. The project is graded in three parts:

  1. (10%) Project proposal. Due date: 20.05.2020
  2. (70%) R Markdown and HTML files in the GitHub repository, website and screencast. Due date: 08.07.2020
    Grading criteria are:
    1. (40%) quality of R Markdown notebook including correctness, comprehensibility and reproducibility of data analysis
    2. (25%) complexity and level of difficulty of the project
    3. (15%) quality, robustness, reliability of R code and adequacy of documentation/ code comments
    4. (10%) screencast
    5. (10%) completeness and overall functionality of the repository and website
  3. (20%) Final presentation. Due date: 10.07.2019

Submission Instructions

Each team must use a single shared GitHub repository1. If your work cannot be accessed because these directions are not followed correctly, this part will be considered as not submitted. You will need to specify your project GitHub URL in the project proposal form. Store the following in your GitHub repository:

  • R Markdown Notebook: your project process notebook
  • Data: include all the data that you used in your project. If the data is too large for GitHub store it on an external cloud storage provider, such as Dropbox, Google Drive or OneDrive.
  • README: the README.md file must give an overview of what you are handing in: your project notebook, data, and URLs to your project websites and screencast videos.

  1. If you are unfamiliar with Git, you may have a look at the ebook Happy Git with R.↩︎