`R`

(DataSciR)06.07.2020: The definite schedule for the final presentations is fixed now, please see the table below.

UPDATE 09.07: project website URLs added.

Date | Time | Title | |
---|---|---|---|

1 | 10.07.20 | 13:00-13:30 | Weather Data Analysis: Using Machine Learning Models and Trend Analysis Methods |

2 | 10.07.20 | 13:30-14:00 | Help in Yelp: Decision Support System For New Business Ventures |

3 | 10.07.20 | 14:00-14:30 | Asteroid diameter prediction using NASA jet propulsion lab dataset |

4 | 10.07.20 | 14:30-15:00 | COVID-19 Prediction using Explainable Machine Learning |

5 | 14.07.20 | 10:00-10:30 | Market Research for Supermarket Customer Loyality Improvement (password-protected) |

6 | 15.07.20 | 10:00-10:30 | Multi-label Classification of News Articles on Coronavirus |

7 | 15.07.20 | 10:30-11:00 | Analysis and Predictions of the Impact of COVID-19 |

- Course day/time: Fri., 13:15-16:45 (building 22A, room 105)
- Instructor: Uli Niemann
- Course type: seminar
- ECTS credits: 6
- Audience: all FIN Master degree programs
- Course language: english
- Registration: see section Application & registration
- Prerequisites: see section Prerequisites
- Technical requirements: bring your own laptop with
`R`

and RStudio installed on it - Grading: based on several deliverables in the context of a semester-long data science project → see Project page

The course is limited to max. 30 students. Please apply for DataSciR as follows:

- Register yourself until 03.
~~10.~~04.2020 on the LSF - Complete the application form until 03.
~~10.~~04.2020.

You will be notified on 05.~~12.~~04.2020 on course admission.

After admission, please complete this registration form from the examination office and hand it to Uli until 20.05.2020.

*Ten years ago, who would have thought, that R, the self-proclaimed “environment for statistical computing and graphics”, would become one of the most popular programming languages for data scientists?*

The impressive growth of `R`

is not a coincidence. As free & open-source alternative to expensive & proprietary software like SPSS, Matlab and Excel, `R`

’s strengths have alway been its capabilities for statistical data analysis as well as its functionalities to create powerful, aesthetically appealing graphics and charts.

While `R`

attracted a rather exclusively academic audience in the 90’s & 00’s, the `R`

community since has grown not only by sheer number but also in diversity, as people from different industries and backgrounds discover `R`

’ usefulness for a wide range of applications. As of February 2020, more than 15,000 (!) packages have been published to CRAN, ca. half of them since 2015.

Especially in the last decade, the functionality and versatility of `R`

has gained momentum. The team around Hadley Wickham from RStudio, the public benefit coorporation behind the eponymous IDE specifically made for `R`

, has dedicated to developing `R`

packages with a focus on increasing productivity and reproducibility of the workflows of data scientists, including the highly popular

- “
`tidyverse`

” packages like`dplyr`

and`tidyr`

for data manipulation, `ggplot2`

for data visualization,`rmarkdown`

and`knitr`

for reproducible & automated reporting,`shiny`

for interactive web applications, and`tidymodels`

for inferential and predictive modeling.

In **Data Science with R** (DataSciR), you will learn fundamentals of `R`

and how to use the above mentioned packages.

Further, you will work on a semester-long graded data science project using `R`

.

There are no mandatory prerequisites for DataSciR. However, you are expected to have a profound knowledge of fundamental data mining techniques, such as classification, regression and clustering. Hence, it is recommended that you have heard at least one of the following lectures (or comparable):

Also, you should have a basic programing and statistics knowledge. For example, you will learn the most important vector types and classes in `R`

, but you will not learn what a vector or a class *is* in general. Accordingly, you should know what the terms mean, standard deviation, probability, hypothesis test, p-value, etc. mean.

It is recommended to bring your laptop to each course meeting. Class meetings are a mix of lecture and short coding exercises. You will get the most out of the meetings if you have a laptop and can work on these exercises. Hence, you should set-up your laptop until the end of the first week as described in the Software section.

Data Mining / Statistical Analysis:

- Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning. Springer, 2017.
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction.. Springer, 2009.
- Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Pearson, 2005.
- Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.

`R`

-specific:

- Hadley Wickham, and Garrett Grolemund. R for Data Science. O’Reilly, 2017.
- Hadley Wickham. Mastering Shiny. O’Reilly, 2020. Draft version.
- Max Kuhn. The
`caret`

package. Online documentation. - Max Kuhn, and Kjell Johnson. Applied Predictive Modeling. Springer, 2013.
- Yihui Xie, J. J. Allaire, and Garrett Grolemund. R Markdown: The Definitive Guide. Chapman & Hall/CRC, 2018.
- Hadley Wickham. Advanced R. 2nd edition, Chapman & Hall/CRC, 2019.
- Hadley Wickham. ggplot2 - Elegant Graphics for Data Analysis. 3rd edition. Draft version.
- Max Kuhn, and Kjell Johnson. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press, 2019.
- Bradley Boehmke. Hands-on Machine Learning with R. Chapman and Hall/CRC, 2019.

other:

- Jeffrey Leak. Organizing Data Science Projects. Learnpub.com.
- Jenny Bryan, and others. Happy Git and GitHub for the useR. 2018.

- RStudio primers
- RStudio cheat sheets
- RStudio webinars
- Quick-R (short tutorials on various topics, e.g. data import, statistics and graph generation)

By the end of the first week, you should have installed the following software on your own laptop:

Also, please check whether you can successfully install packages. To do so, click on the *Packages* tab in the bottom-right pane in RStudio. Then, click on the *Install* button and specify an arbitrary package, e.g. `dplyr`

. Finally, click on *Install*. Alternatively, you can install a package from the console with `install.packages("dplyr")`

. If everything is set up correctly, no error messages should be displayed when you load the installed package with `library(dplyr)`

.

We will use the following packages. Install them in one go using the console in RStudio with:

```
# Install packages from CRAN
cran_pkgs <- c("remotes",
"tidyverse",
"tidymodels",
"gapminder",
"patchwork",
"showtext",
"ggthemes",
"ggrepel",
"socviz",
"ggiraph",
"ggforce",
"janitor",
"rpart",
"rpart.plot",
"kknn",
"rmarkdown",
"knitr",
"rticles",
"xaringan",
"Hmisc",
"kableExtra",
"maps",
"mapproj",
"concaveman",
"AmesHousing")
install.packages(cran_pkgs)
# Install development packages from GitHub
github_pcks <- c("rstudio/gt",
"gadenbuie/countdown")
remotes::install_github(github_pcks)
# To generate PDF output from R Markdown documents, you need to install LaTeX.
# If you have not installed LaTeX before, consider to install the lightweight
# LaTeX distribution [tinytex](https://yihui.name/tinytex/) which automatically
# installs missing LaTeX packages when rendering R Markdown documents.
install.packages(tinytex)
tinytex::install_tinytex()
```