Getting Started

This course will be taught in R, which is a free statistical software environment (/ programming language) that makes it very handy to run data analysis. It is a fully functional programming language, and by that I mean two things:

    1. In my experience, any operation that you want to do to data, you can do it in R. There are also many user-contributed libraries (called “packages”), which continually expand the capabilities that you can achieve with R. In particular, I would like to highlight innovative tools like Jupyter notebooks and R Markdown documents which allow incredible flexibility in setting up a data analysis pipeline.
    1. For programming language enthusiasts, R is also a functional programming language. There are some neat tricks that you can do by exploiting R’s more advanced functional capabilites.

In particular, the R community is very active, which is another plus as it allows users to easily find answers online (e.g., by searching and asking on Stack Overflow).

How to download R and RStudio

R is free, and can be downloaded from the R Project website. We will be using RStudio, which is an excellent development environment that makes it much friendlier to use R. You can download the free Desktop version at the RStudio website.

There are many online resources for learning both R and RStudio. For example, the RStudio website has several Primers covering many relevant basic data science topics. There are also freely available books on how to use R, like R for Data Science.

How to use R Markdown

R Markdown documents are an extremely useful tool that professional data scientists and business analysts use in their day-to-day work. They allow for R code (and the output of R code) to be neatly formatted (using Markdown) into various output types, such as PDFs, HTML, or Microsoft Word Documents (note, for MS Word, you’ll need MS Word installed on your system).

This process is called Knitting. If you open a R Markdown document (.rmd) in RStudio, RStudio should automatically detect it and you should see a little command called Knit at the top of your source window. (If this is your first time, you may have to install a few packages like knitr; RStudio will helpfully suggest to install them.)

R Markdown is nice because it allows you to embed code and writeup into the same document, and it produces presentable output, so you can use it to generate reports from your homework, and, when you eventually go out to work in a company, for your projects.

Here’s how you embed a “chunk” of R code. We use three apostrophes to denote the start and end of a code chunk

```
{r, example-chunk-1, echo=TRUE, eval=FALSE} 
1+1
```

which (after removing the space between the apostrophes and the {…} on the following line) produces:

1+1

After the three apostrophes, you’ll need r, then you can give the chunk a name. Please note that NAMES HAVE TO BE A SINGLE-WORD, NO SPACES ALLOWED. Also, names have to be unique, that is, every chunk needs a different name. You can give chunks names like:

  • chunk1
  • read-in-data
  • run-regression

or, what will help you with homework:

  • q1a-read-in-data
  • q1b-regression

These names are for you to help organize your code. (In practice it will be very useful when you have files with thousands of lines of code…). Names are optional. If you do not give a name, it will default to unnamed-chunk-1, etc

After the name of the chunk, you can give it certain options, separated by commas. Here are some important options:

  • echo=T / echo=TRUE or echo=F / echo=FALSE: whether or not the code chunk will be copied into the output file.
  • eval=T / eval=TRUE or eval=F / eval=FALSE: whether or not the code chunk will be evaluated. If you set one chunk eval=F, it will be as if you commented all the code in that chunk; they just won’t run. This can be super useful, if e.g. you have some analyses that take up more time, and you don’t want to run them every single time you knit, so you can just comment them out using this one liner.

There is a lot to syntax to learn using the R Markdown. And lots of cool stuff too (e.g., interactive HTML documents).

In real work environments, one very applicable use-case for the R Markdown document is to generate regular reports using different data. For example, let’s say every week you get some data (weekly sales) from another team. When you get a new data file, you could just edit one line of the R Markdown file (which data file it’s reading in), and you can generate the exact same report that you did last week, just that now it’s on this week’s data. Exact same analysis, exact same graphs, just with new data. And it’s professional enough to show your manager.

Note about working directories in R Markdown. If you do not specify your working directory via setwd('...'), and you hit Knit, the document will assume that the working directory is the directory that the .rmd file is in. Thus, if your rmd is in XYZ/folder1/code.rmd and your dataset is XYZ/folder1/data.csv, then you can simply read in a data file using the local path: d0 <- read.csv('data.csv') without running setwd().