Kategorien
Business Data Science Finance

Advanced web scraping techniques in R – or how to obtain data from the web

Web scraping has become one of the data scientists‘ essential skills. In this short tutorial I will teach you some awesome web scraping techniques that you can use for your academic research.

My R-Code can be accessed here.

Leave a like if this short tutorial helped you extracting the desired data.

Kategorien
Data Science

R: Merge function

The merge function in R allows the user to combine multiple data frames into a single data frame. Imagine you have information about students (name and average grades) in one data frame (A) and the student’s age in the data frame (B).
However, the goal is to access age and average grades easily. Therefore, you need to merge the data.

There are 3 options to merge your data:
Option 1: inner join
Return only the rows in which the left table have matching keys in the right table.
Option 2: outer join
Returns all rows from both tables, join records from the left which have matching keys in the right table.
Option 3: left join
Return all rows from the left table, and any rows with matching keys from the right table.
Option 4: right join
Return all rows from the right table, and any rows with matching keys from the left table.

Setup:

# Data frame A
name <- c("Tom","Jack","Johanna","Simon","Dario")
grade <- c(5,5.5,4,6,3.5)
A <- data.frame(cbind(A,as.numeric(A2)))
colnames(A) <- c("name","grade")

# Data frame B
B <- c("Tom","Johanna","Lukas")
age <- c(21,22,23)
B <- data.frame(cbind(B,as.numeric(age)))
colnames(B) <- c("name","age")
Inner join:
 merge(x = A, y = B, by = "name",) 
Outer join
 merge(x = A, y = B, by = "name", all = TRUE) 
Left outer
 merge(x = A, y = B, by = "name", all.x = TRUE) 
Right outer 
merge(x = A, y = B, by = "name", all.y = TRUE) 
Kategorien
Data Science

Connect RStudio to an access database (x64 bit)

When working with larger datasets you are frequently confronted with an access database. In this post, I will show you that it is very easy to connect RStudio to different databases.

First you need to load the required package/ install it when you don’t have it yet.

install.packages("RODC")
require(RODC)

In a next step you neet to set up the connection to your database. This can be done in one line of code:

con<-odbcDriverConnect("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:/Users/lliebi/Dropbox/Daten Luca/PhD/other Projects/Run python in R/Test.accdb")

Notice for your purpose you just need to change the path to your file. Simply replace the string after (DBQ= ) with your specific file path.

Having connected to the database you will see the connection in the global environment. 

If you want to read data from the database use sqlFetch:

data <- sqlFetch(con, "Tabelle1")

You can also write into the database (e.g. after having done some analysis). This could also be very helpful when your crawling data from the web, save it in the database and at the same time read it from it.

sqlSave(con,as.data.frame(test), tablename="Table2")

I remember when I worked the first time with access databases and it took me quite a while to access the data. However, I hope that this small tutorial will help you when you are confronted with such a problem.