Kategorien
Business

RSelenium: How to crawl the web?

In this tutorial, you will learn how to scrape most webpages!

Before we get deeper into the R-Code you must know several aspects of web crawling and the several methodologies. The webpage structure determines which methodologies to use:

  • Static & well structured webpage: Static GET and POST requests can be used.
  • API available: Connect R to the API provided by the service provider or organisation (e.g. Google).
  • Dynamic webpage (e.g. Facebook, LinkedIn): To scrape these webpages you must use an automated Webbrowser

With RSelenium you can scrape most webpages. Hence, I will introduce you to this methodology.

Step 1: Get the package

# you need this packages
install.packages("RSelenium")
require(RSelenium)

Step 2: Start the browser

# to which URL do you want to go?
URL <- "https://blogabet.com/tipsters"

# start your Google Chrome browser
rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
remDr <- rD[["client"]]
# go to the URL
remDr$navigate(URL)

Step 3: Get the information you want
This is probably the trickiest part. However, with only a few lines of code, you can access the required information. You must be able to understand some HTML and identify the XPath of the object.Step 3.1: Understanding XPath

Step 3.1: XPath

  • Right click on the webpage > Untersuchen (engl. Inspect) > Contr. + Shift + C
    This will open a side window where you can inspect the elements. Identify the required object with you mouse.

Step 4: Important functions you will need
These are probably the most useful functions you will need to download data from Facebook or LinkedIn.

#scroll down webpage
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

# scroll up the webpage
webElem$sendKeysToElement(list(key = "home"))

# system sleep
sleepTime  <- 10 # in seconds
Sys.sleep(sleepTime)

# get information using xpath
 picks_archive<-remDr$findElements(using='xpath',".//*[@class='block media _feedPick feed-pick']")
 sapply(picks_archive, function(x) x$highlightElement())
 picks_archive<-as.matrix(sapply(picks_archive, function(x) x$getElementText()))

Add on: A full facebook crawler:

rm(list=ls())
 packages required in this analysis
 require(RSelenium) # automate webbrowser
 require(dplyr)
 require(tidyr)
 require(installr)
 library(xml2)
 library(rvest)
 library(httr)
 require(qdap) # for regular expressions
 require(stringr) # regular expressions
 library(plyr) # data handling
 require(ggplot2)
 require(tikzDevice) # for Latex output
 require(ggcorrplot) 
 require(stargazer) # for Latex output
 require(magick) # for image analysis
 library(dplyr) # data handling
 library(tidyr)
 require(svDialogs)
 library(beepr)
 start R selenium
 rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
 remDr <- rD[["client"]]

 Step one: Download all car names and the corresponding hyperlink
 Goal: Being able to acess each car individually and retreive more information
 initialise base URL
 base.URL <- c("https://www.facebook.com")
 go to page
 remDr$navigate(base.URL)
 login credentials
 user <- "XXX"
 pass <- "XXX"
 login
 username <- remDr$findElement(using = "name", value = "email")
 username$highlightElement()
 password <- remDr$findElement(using = "name", value = "pass")
 password$highlightElement()
 enter the login credential
 enter the login
 username$sendKeysToElement(list(user))
 password$sendKeysToElement(list(pass))
 click login button
 login_button <- remDr$findElement(using = 'id',"u_0_8")
 login_button$clickElement()
 base.URL <- c("https://www.facebook.com/groups/sharingiscaringunisg/",
               "https://www.facebook.com/groups/sharingiscaringuniszurich/")
 run the following code all 15 minutes
 repeat {
   startTime <- Sys.time()
 make a loop that goes throught the webpages
 for(i in 1:length(base.URL)){
 remDr$navigate(base.URL[i])
 wait for 5 seconds
 testit <- function(x)
 {
   p1 <- proc.time()
   Sys.sleep(x)
   proc.time() - p1 # The cpu usage should be negligible
 }
 wait for a moment
 testit(5)
 scroll down webpage
 webElem <- remDr$findElement("css", "body")
 webElem$sendKeysToElement(list(key = "end"))
 testit(5)
 body <- remDr$findElements(using = 'xpath',".//*[@class='text_exposed_root']/p")
 result<-as.matrix(sapply(body, function(x) x$getElementText()))
 result <- unlist(result)
 result <- data.frame(result)
 testit(5)
 scroll down webpage
 webElem <- remDr$findElement("css", "body")
 webElem$sendKeysToElement(list(key = "end"))
 testit(5)
 post.by <- remDr$findElements(using = 'xpath',".//*[@class='fwb fcg']/a")
 post.by<-as.matrix(sapply(post.by, function(x) x$getElementText()))
 post.by <- unlist(post.by)
 post.by <- data.frame(post.by)
 convert to character
 result$result <- as.character(result$result)
 post.by$post.by <- as.character(post.by$post.by)
 result<-cbind(result[1:nrow(post.by),],post.by)
 colnames(result) <- c("body","Contact")
 result$body <- as.character(result$body)
 find if a specific string occurs
 string <- c(" R ","R ","Coding","Nachhilfe")
 an empty vector
 important.post <- rep(NA,nrow(result))
 check weather the strings occur in the webpage
 for(i in 1:length(string)){
   important.post<-grepl(string[i],result$body)
   if(any(important.post)==T){
     beep(sound=3)
     dlgMessage(result[important.post==T,])
     }else{
     print("Nothing interesting")
   }
 }
 }
 sleepTime <- startTime + 100*100 - Sys.time() if (sleepTime > 0)
   Sys.sleep(sleepTime)
 }

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.