RSelenium: How to crawl the web?

In this tutorial, you will learn how to scrape most webpages!

Before we get deeper into the R-Code you must know several aspects of web crawling and the several methodologies. The webpage structure determines which methodologies to use:

  • Static & well structured webpage: Static GET and POST requests can be used.
  • API available: Connect R to the API provided by the service provider or organisation (e.g. Google).
  • Dynamic webpage (e.g. Facebook, LinkedIn): To scrape these webpages you must use an automated Webbrowser

With RSelenium you can scrape most webpages. Hence, I will introduce you to this methodology.

Step 1: Get the package

# you need this packages

Step 2: Start the browser

# to which URL do you want to go?
URL <- "https://blogabet.com/tipsters"

# start your Google Chrome browser
rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
remDr <- rD[["client"]]
# go to the URL

Step 3: Get the information you want
This is probably the trickiest part. However, with only a few lines of code, you can access the required information. You must be able to understand some HTML and identify the XPath of the object.Step 3.1: Understanding XPath

Step 3.1: XPath

  • Right click on the webpage > Untersuchen (engl. Inspect) > Contr. + Shift + C
    This will open a side window where you can inspect the elements. Identify the required object with you mouse.

Step 4: Important functions you will need
These are probably the most useful functions you will need to download data from Facebook or LinkedIn.

#scroll down webpage
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

# scroll up the webpage
webElem$sendKeysToElement(list(key = "home"))

# system sleep
sleepTime  <- 10 # in seconds

# get information using xpath
 picks_archive<-remDr$findElements(using='xpath',".//*[@class='block media _feedPick feed-pick']")
 sapply(picks_archive, function(x) x$highlightElement())
 picks_archive<-as.matrix(sapply(picks_archive, function(x) x$getElementText()))

Add on: A full facebook crawler:

 packages required in this analysis
 require(RSelenium) # automate webbrowser
 require(qdap) # for regular expressions
 require(stringr) # regular expressions
 library(plyr) # data handling
 require(tikzDevice) # for Latex output
 require(stargazer) # for Latex output
 require(magick) # for image analysis
 library(dplyr) # data handling
 start R selenium
 rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
 remDr <- rD[["client"]]

 Step one: Download all car names and the corresponding hyperlink
 Goal: Being able to acess each car individually and retreive more information
 initialise base URL
 base.URL <- c("https://www.facebook.com")
 go to page
 login credentials
 user <- "XXX"
 pass <- "XXX"
 username <- remDr$findElement(using = "name", value = "email")
 password <- remDr$findElement(using = "name", value = "pass")
 enter the login credential
 enter the login
 click login button
 login_button <- remDr$findElement(using = 'id',"u_0_8")
 base.URL <- c("https://www.facebook.com/groups/sharingiscaringunisg/",
 run the following code all 15 minutes
 repeat {
   startTime <- Sys.time()
 make a loop that goes throught the webpages
 for(i in 1:length(base.URL)){
 wait for 5 seconds
 testit <- function(x)
   p1 <- proc.time()
   proc.time() - p1 # The cpu usage should be negligible
 wait for a moment
 scroll down webpage
 webElem <- remDr$findElement("css", "body")
 webElem$sendKeysToElement(list(key = "end"))
 body <- remDr$findElements(using = 'xpath',".//*[@class='text_exposed_root']/p")
 result<-as.matrix(sapply(body, function(x) x$getElementText()))
 result <- unlist(result)
 result <- data.frame(result)
 scroll down webpage
 webElem <- remDr$findElement("css", "body")
 webElem$sendKeysToElement(list(key = "end"))
 post.by <- remDr$findElements(using = 'xpath',".//*[@class='fwb fcg']/a")
 post.by<-as.matrix(sapply(post.by, function(x) x$getElementText()))
 post.by <- unlist(post.by)
 post.by <- data.frame(post.by)
 convert to character
 result$result <- as.character(result$result)
 post.by$post.by <- as.character(post.by$post.by)
 colnames(result) <- c("body","Contact")
 result$body <- as.character(result$body)
 find if a specific string occurs
 string <- c(" R ","R ","Coding","Nachhilfe")
 an empty vector
 important.post <- rep(NA,nrow(result))
 check weather the strings occur in the webpage
 for(i in 1:length(string)){
     print("Nothing interesting")
 sleepTime <- startTime + 100*100 - Sys.time() if (sleepTime > 0)

Visualize spatial data in R

This will be a short tutorial on how to do the following spatial plots.

Fig 1: House prices in Bern (Switzerland)

Working with spatial data means that you for each data point you have information on the specific location (e.g. longitude, latitude, city name). If you have this type of data available it will be easy for you do make nice plots. Spatial data is freely available online (cf. here).

Second, you will a map of a country as a spatial object. Spatial maps are freely available here. However, you don’t need to download the data manually – there is an easy way to access all information within R.

R example:

# you will need the following package

# download the map of switerland: level 3 means that you want to have all   # municipalities.
CH  <- getData("GADM",country="Switzerland",level=3)

# now you can just plot the map easily:

You can visualize your data also interactively (cf. here)

Data Science

R: Merge function

The merge function in R allows the user to combine multiple data frames into a single data frame. Imagine you have information about students (name and average grades) in one data frame (A) and the student’s age in the data frame (B).
However, the goal is to access age and average grades easily. Therefore, you need to merge the data.

There are 3 options to merge your data:
Option 1: inner join
Return only the rows in which the left table have matching keys in the right table.
Option 2: outer join
Returns all rows from both tables, join records from the left which have matching keys in the right table.
Option 3: left join
Return all rows from the left table, and any rows with matching keys from the right table.
Option 4: right join
Return all rows from the right table, and any rows with matching keys from the left table.


# Data frame A
name <- c("Tom","Jack","Johanna","Simon","Dario")
grade <- c(5,5.5,4,6,3.5)
A <- data.frame(cbind(A,as.numeric(A2)))
colnames(A) <- c("name","grade")

# Data frame B
B <- c("Tom","Johanna","Lukas")
age <- c(21,22,23)
B <- data.frame(cbind(B,as.numeric(age)))
colnames(B) <- c("name","age")
Inner join:
 merge(x = A, y = B, by = "name",) 
Outer join
 merge(x = A, y = B, by = "name", all = TRUE) 
Left outer
 merge(x = A, y = B, by = "name", all.x = TRUE) 
Right outer 
merge(x = A, y = B, by = "name", all.y = TRUE)