Kategorien
Business Data Science Finance

Advanced web scraping techniques in R – or how to obtain data from the web

Web scraping has become one of the data scientists‘ essential skills. In this short tutorial I will teach you some awesome web scraping techniques that you can use for your academic research.

My R-Code can be accessed here.

Leave a like if this short tutorial helped you extracting the desired data.

Kategorien
Business

How to code your own email sender – or how to spam the world :)

This short blog will teach you how to code your own spam email-sender in R. Nevertheless, I do not recommend to use this code for spam emails!;)

Yes, I know there are several online tools that allow you to send multiple emails (e.g. mailchimp). However, coding is fun and this code is faster, respectively can easily be changed according to your needs.

So let’s get started: You will need to following packages. Notice that RDCOMClient must be installed directly from github. Use the correct path to the github directory and install the package [line 2]. The „gender“ package will help you to find the gender when your email database only includes first name, last name and email adress. A professional email should include the correct salutation. Check out the package yourself – I am only sharing a basic email-sender code.

library("devtools")
install_github('omegahat/RDCOMClient')
library(RDCOMClient)
require(gender)
require(huxtable)
require(dplyr)
require(xlsx)

Step 1: Create some sample data

final <- data.frame("first.name" = c("Luca","Patrick","James"), "last.name" = c("Liebi","Star","Bond"), "email.adress" = c("luca.liebi@helloworld.ch", "patrick.star@helloworld.ch","james.bond@helloworld.ch"), "gender"=c("male","male","male"), stringsAsFactors = FALSE)

Step 2: Code your email sender

# start a loop that sends each row (name) an email
for(i in 1:nrow(x)){

# this function simply catches errors in your loop
tryCatch({

# you need outlook :)
OutApp <- COMCreate("Outlook.Application")

# create an email 
outMail = OutApp$CreateItem(0) 

# make the solution correct
anrede <- NA 
if(final$gender[i]=="male"){ 
anrede <- paste("Dear Mr.",final$last.name[i])
}else{ 
anrede <- paste("Dear Mrs.",final$last.name[i])
} 

# here enter your text for your email:
body1 <- "Hello - it's me Luca and this is an awesome code :)" 

body1.1 <-"Paragraph 2: Insert some other information"

# past your email text together - notice that you can use HTML code to make # a new paragraph
body1 <- paste(body1,body1.1,sep="<br><br>")

# complete your email by including the solution (anrede)
text <- paste(anrede,body1,sep="<br><br>")

# configure important email parameter

    # who should receive the email?
    email.adress <- final$email.adress[i]
    outMail[["To"]] <- email.adress
    # Subject of the email
    outMail[["subject"]] = "I want to spam you"
    # from which email do you want to sent your email
    outMail[["sentonbehalfofname"]] = "luca.liebi@smartman.ch"
    
    outMail[["HTMLBody"]] = text
    
    outMail$Send()
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

Now you are already done 🙂 – this code will send 3 email only. However, dependant on your database you can send much more emails!

Step 3: Some advanced tips

PDF attachements in your email can easily be done:

outMail[["attachments"]]$Add("Path\\to\\your\\pdf.pdf")

HTML formatting – simply use HTML in the body of your text:

body1 <- "Hello - it's me <b>Luca</b> and this is an awesome code :)" 

Inserting an image:

outMail[["attachments"]]$Add("Path\\to\\your\\PNG\\file\\logo.PNG")

# specify the size of the logo
logo <- paste0("<img src='cid:",
                basename("logo.PNG"),
               "' width = '340' height = '70'>") 
Kategorien
Business

RSelenium: How to crawl the web?

In this tutorial, you will learn how to scrape most webpages!

Before we get deeper into the R-Code you must know several aspects of web crawling and the several methodologies. The webpage structure determines which methodologies to use:

  • Static & well structured webpage: Static GET and POST requests can be used.
  • API available: Connect R to the API provided by the service provider or organisation (e.g. Google).
  • Dynamic webpage (e.g. Facebook, LinkedIn): To scrape these webpages you must use an automated Webbrowser

With RSelenium you can scrape most webpages. Hence, I will introduce you to this methodology.

Step 1: Get the package

# you need this packages
install.packages("RSelenium")
require(RSelenium)

Step 2: Start the browser

# to which URL do you want to go?
URL <- "https://blogabet.com/tipsters"

# start your Google Chrome browser
rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
remDr <- rD[["client"]]
# go to the URL
remDr$navigate(URL)

Step 3: Get the information you want
This is probably the trickiest part. However, with only a few lines of code, you can access the required information. You must be able to understand some HTML and identify the XPath of the object.Step 3.1: Understanding XPath

Step 3.1: XPath

  • Right click on the webpage > Untersuchen (engl. Inspect) > Contr. + Shift + C
    This will open a side window where you can inspect the elements. Identify the required object with you mouse.

Step 4: Important functions you will need
These are probably the most useful functions you will need to download data from Facebook or LinkedIn.

#scroll down webpage
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

# scroll up the webpage
webElem$sendKeysToElement(list(key = "home"))

# system sleep
sleepTime  <- 10 # in seconds
Sys.sleep(sleepTime)

# get information using xpath
 picks_archive<-remDr$findElements(using='xpath',".//*[@class='block media _feedPick feed-pick']")
 sapply(picks_archive, function(x) x$highlightElement())
 picks_archive<-as.matrix(sapply(picks_archive, function(x) x$getElementText()))

Add on: A full facebook crawler:

rm(list=ls())
 packages required in this analysis
 require(RSelenium) # automate webbrowser
 require(dplyr)
 require(tidyr)
 require(installr)
 library(xml2)
 library(rvest)
 library(httr)
 require(qdap) # for regular expressions
 require(stringr) # regular expressions
 library(plyr) # data handling
 require(ggplot2)
 require(tikzDevice) # for Latex output
 require(ggcorrplot) 
 require(stargazer) # for Latex output
 require(magick) # for image analysis
 library(dplyr) # data handling
 library(tidyr)
 require(svDialogs)
 library(beepr)
 start R selenium
 rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
 remDr <- rD[["client"]]

 Step one: Download all car names and the corresponding hyperlink
 Goal: Being able to acess each car individually and retreive more information
 initialise base URL
 base.URL <- c("https://www.facebook.com")
 go to page
 remDr$navigate(base.URL)
 login credentials
 user <- "XXX"
 pass <- "XXX"
 login
 username <- remDr$findElement(using = "name", value = "email")
 username$highlightElement()
 password <- remDr$findElement(using = "name", value = "pass")
 password$highlightElement()
 enter the login credential
 enter the login
 username$sendKeysToElement(list(user))
 password$sendKeysToElement(list(pass))
 click login button
 login_button <- remDr$findElement(using = 'id',"u_0_8")
 login_button$clickElement()
 base.URL <- c("https://www.facebook.com/groups/sharingiscaringunisg/",
               "https://www.facebook.com/groups/sharingiscaringuniszurich/")
 run the following code all 15 minutes
 repeat {
   startTime <- Sys.time()
 make a loop that goes throught the webpages
 for(i in 1:length(base.URL)){
 remDr$navigate(base.URL[i])
 wait for 5 seconds
 testit <- function(x)
 {
   p1 <- proc.time()
   Sys.sleep(x)
   proc.time() - p1 # The cpu usage should be negligible
 }
 wait for a moment
 testit(5)
 scroll down webpage
 webElem <- remDr$findElement("css", "body")
 webElem$sendKeysToElement(list(key = "end"))
 testit(5)
 body <- remDr$findElements(using = 'xpath',".//*[@class='text_exposed_root']/p")
 result<-as.matrix(sapply(body, function(x) x$getElementText()))
 result <- unlist(result)
 result <- data.frame(result)
 testit(5)
 scroll down webpage
 webElem <- remDr$findElement("css", "body")
 webElem$sendKeysToElement(list(key = "end"))
 testit(5)
 post.by <- remDr$findElements(using = 'xpath',".//*[@class='fwb fcg']/a")
 post.by<-as.matrix(sapply(post.by, function(x) x$getElementText()))
 post.by <- unlist(post.by)
 post.by <- data.frame(post.by)
 convert to character
 result$result <- as.character(result$result)
 post.by$post.by <- as.character(post.by$post.by)
 result<-cbind(result[1:nrow(post.by),],post.by)
 colnames(result) <- c("body","Contact")
 result$body <- as.character(result$body)
 find if a specific string occurs
 string <- c(" R ","R ","Coding","Nachhilfe")
 an empty vector
 important.post <- rep(NA,nrow(result))
 check weather the strings occur in the webpage
 for(i in 1:length(string)){
   important.post<-grepl(string[i],result$body)
   if(any(important.post)==T){
     beep(sound=3)
     dlgMessage(result[important.post==T,])
     }else{
     print("Nothing interesting")
   }
 }
 }
 sleepTime <- startTime + 100*100 - Sys.time() if (sleepTime > 0)
   Sys.sleep(sleepTime)
 }
Kategorien
Business

Visualize spatial data in R

This will be a short tutorial on how to do the following spatial plots.

Fig 1: House prices in Bern (Switzerland)

Working with spatial data means that you for each data point you have information on the specific location (e.g. longitude, latitude, city name). If you have this type of data available it will be easy for you do make nice plots. Spatial data is freely available online (cf. here).

Second, you will a map of a country as a spatial object. Spatial maps are freely available here. However, you don’t need to download the data manually – there is an easy way to access all information within R.

R example:

# you will need the following package
require(raster)

# download the map of switerland: level 3 means that you want to have all   # municipalities.
CH  <- getData("GADM",country="Switzerland",level=3)

# now you can just plot the map easily:
plot(CH,col="darkgrey")

You can visualize your data also interactively (cf. here)