2 Read data into R

Data can come in several data formats and you should be familiar with all of them and how to read them into R. We will start with the most basic data formats such as:

  • Excel (.xls)
  • CSV (.csv)
  • Table (.txt)
  • Access database (.accs)
  • Stata files (.dta)
  • XML (.xml)

2.1 .xls

First, I would never recommend you to read .xls files directly into R. It often takes a lot of time. Therefore, you can save the excel file as comma separated file (.csv). This will make your life easier and your code faster. However, if you are interested in reading excel files into R it is possible, but requires a specific package:

The gdata package allows you to specify the sheet in the Excel file. However, as previously mentioned I do not recommend reading .xls files. Simply open your .xls file and go to Data > save as > CSV(.csv). After saving your data as csv-file you can import the data much faster into R.

2.2 .csv

This will be the data type you will often encounter and you should be familiar with reading such data into R. The base function in R allows you to read .csv files easily. You can download freely available data sets from Kaggle.

Start by setting your working directory again and then read the data with the command read.csv. First, notice that the file name must be in apostrophes. Second, the sep argument in the read function can either be a semikolon (;) or a comma (,). It is easy to identify which kind of separator the file has: simply look at the dimension of your data frame in your global environment (e.g. number of columns). Third, you must assign the file you read into R to a variable. In this case I use df for data frame. It is up to you how you name your data frame.

2.3 .txt

I was not very often confrontet with .txt files but it is necessary that you know how to import them.

2.4 .accs

Larger data sets come often in a database such as access, since .csv and .xml files are limited to about 1 million rows. When I was working as data scientist for the FH in St. Gallen I had data of over 14 Million clients of Swiss banks and the data came in an access database. Connecting to the access database took me quite a while but I will show you that it can be very easy.

2.5 .dta

These files were exported from Stata and you can easily read them into R. Use the fromEncoding = “utf-8” argument, when your data contains character strings in German such as ä,ö,ü.