3 R Basics

Now that you can read your data into R, I will show you how you can analyse your data. I will start with basic subsetting and piping. Second, I will teach you the most important loops and apply it to solve the famous Monty Hall problem.

3.1 Analyse your first data set

You can read sample data in R that is freely available to test your new skill. The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. You can get the data directly in R.

Let’s take a look at the first 5 rows of your dataframe (df). You can do this with the head (tail) command.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

3.2 Basic subsetting

In R data can be subsetted using brackets “[ , ]”. In the following sections I will use the freely iris flower dataset. The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. The data contains a sample of 50 species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. The basic notation for subsetting is as follows:

## [1] 3.5

If you want multiple rows:

##  [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1

You can also get a whole column. For this just leave the row argument empty!

##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
##  [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
##  [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2
##  [52] 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7
##  [69] 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0
##  [86] 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7
## [103] 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6
## [120] 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0
## [137] 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

3.3 Piping

Piping is a powerful data handling tool in the dplyr package. It allows you to easily subset your data. The method is called piping because you pipe your data through a channel of arguments seperated by the %>% operator.

So lets see an example. Imagine you want to filter the iris data set for all flowers that have a sepal.length larger than 5 and you want to get only the columns sepal.length and species.

First, install and load the dyplr package:

##    Sepal.Length    Species
## 1           4.9     setosa
## 2           4.7     setosa
## 3           4.6     setosa
## 4           4.6     setosa
## 5           4.4     setosa
## 6           4.9     setosa
## 7           4.8     setosa
## 8           4.8     setosa
## 9           4.3     setosa
## 10          4.6     setosa
## 11          4.8     setosa
## 12          4.7     setosa
## 13          4.8     setosa
## 14          4.9     setosa
## 15          4.9     setosa
## 16          4.4     setosa
## 17          4.5     setosa
## 18          4.4     setosa
## 19          4.8     setosa
## 20          4.6     setosa
## 21          4.9 versicolor
## 22          4.9  virginica

The filter function filters your data according to your argument. The select function selects the columns you specified. Another approach: Imagine you want to get the mean Sepal.Length for each species. When your data contains groups (e.g. some factor variables) you can use the group function in the dplyr package.

## # A tibble: 3 x 2
##   Species    Sepal.Length
##   <fct>             <dbl>
## 1 setosa             5.01
## 2 versicolor         5.94
## 3 virginica          6.59

Notice that when you want to get some information for each group, the first argument should always be group_by, followed by some further piping arguments. The most important piping functions are the following once.

  • group_by: This will group your data
  • filter: Filter your data, you can specify multiple arguments here.
  • select: Get a specific column.
  • summarise_each: There are multiple summarise functions but this will be the most useful one

Remember that you must always start with your dataframe and then start piping. The original dataframe should not appear later in your pipe. Notice also that you can assign your piping result to a new variable!

## # A tibble: 3 x 2
##   Species    Sepal.Length
##   <fct>             <dbl>
## 1 setosa             5.01
## 2 versicolor         5.94
## 3 virginica          6.59