Kategorien
Business Data Science Finance

Advanced web scraping techniques in R – or how to obtain data from the web

Web scraping has become one of the data scientists‘ essential skills. In this short tutorial I will teach you some awesome web scraping techniques that you can use for your academic research.

My R-Code can be accessed here.

Leave a like if this short tutorial helped you extracting the desired data.

Kategorien
Business

How to code your own email sender – or how to spam the world :)

This short blog will teach you how to code your own spam email-sender in R. Nevertheless, I do not recommend to use this code for spam emails!;)

Yes, I know there are several online tools that allow you to send multiple emails (e.g. mailchimp). However, coding is fun and this code is faster, respectively can easily be changed according to your needs.

So let’s get started: You will need to following packages. Notice that RDCOMClient must be installed directly from github. Use the correct path to the github directory and install the package [line 2]. The „gender“ package will help you to find the gender when your email database only includes first name, last name and email adress. A professional email should include the correct salutation. Check out the package yourself – I am only sharing a basic email-sender code.

library("devtools")
install_github('omegahat/RDCOMClient')
library(RDCOMClient)
require(gender)
require(huxtable)
require(dplyr)
require(xlsx)

Step 1: Create some sample data

final <- data.frame("first.name" = c("Luca","Patrick","James"), "last.name" = c("Liebi","Star","Bond"), "email.adress" = c("luca.liebi@helloworld.ch", "patrick.star@helloworld.ch","james.bond@helloworld.ch"), "gender"=c("male","male","male"), stringsAsFactors = FALSE)

Step 2: Code your email sender

# start a loop that sends each row (name) an email
for(i in 1:nrow(x)){

# this function simply catches errors in your loop
tryCatch({

# you need outlook :)
OutApp <- COMCreate("Outlook.Application")

# create an email 
outMail = OutApp$CreateItem(0) 

# make the solution correct
anrede <- NA 
if(final$gender[i]=="male"){ 
anrede <- paste("Dear Mr.",final$last.name[i])
}else{ 
anrede <- paste("Dear Mrs.",final$last.name[i])
} 

# here enter your text for your email:
body1 <- "Hello - it's me Luca and this is an awesome code :)" 

body1.1 <-"Paragraph 2: Insert some other information"

# past your email text together - notice that you can use HTML code to make # a new paragraph
body1 <- paste(body1,body1.1,sep="<br><br>")

# complete your email by including the solution (anrede)
text <- paste(anrede,body1,sep="<br><br>")

# configure important email parameter

    # who should receive the email?
    email.adress <- final$email.adress[i]
    outMail[["To"]] <- email.adress
    # Subject of the email
    outMail[["subject"]] = "I want to spam you"
    # from which email do you want to sent your email
    outMail[["sentonbehalfofname"]] = "luca.liebi@smartman.ch"
    
    outMail[["HTMLBody"]] = text
    
    outMail$Send()
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

Now you are already done 🙂 – this code will send 3 email only. However, dependant on your database you can send much more emails!

Step 3: Some advanced tips

PDF attachements in your email can easily be done:

outMail[["attachments"]]$Add("Path\\to\\your\\pdf.pdf")

HTML formatting – simply use HTML in the body of your text:

body1 <- "Hello - it's me <b>Luca</b> and this is an awesome code :)" 

Inserting an image:

outMail[["attachments"]]$Add("Path\\to\\your\\PNG\\file\\logo.PNG")

# specify the size of the logo
logo <- paste0("<img src='cid:",
                basename("logo.PNG"),
               "' width = '340' height = '70'>") 
Kategorien
Business

RSelenium: How to crawl the web?

In this tutorial, you will learn how to scrape most webpages!

Before we get deeper into the R-Code you must know several aspects of web crawling and the several methodologies. The webpage structure determines which methodologies to use:

  • Static & well structured webpage: Static GET and POST requests can be used.
  • API available: Connect R to the API provided by the service provider or organisation (e.g. Google).
  • Dynamic webpage (e.g. Facebook, LinkedIn): To scrape these webpages you must use an automated Webbrowser

With RSelenium you can scrape most webpages. Hence, I will introduce you to this methodology.

Step 1: Get the package

# you need this packages
install.packages("RSelenium")
require(RSelenium)

Step 2: Start the browser

# to which URL do you want to go?
URL <- "https://blogabet.com/tipsters"

# start your Google Chrome browser
rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
remDr <- rD[["client"]]
# go to the URL
remDr$navigate(URL)

Step 3: Get the information you want
This is probably the trickiest part. However, with only a few lines of code, you can access the required information. You must be able to understand some HTML and identify the XPath of the object.Step 3.1: Understanding XPath

Step 3.1: XPath

  • Right click on the webpage > Untersuchen (engl. Inspect) > Contr. + Shift + C
    This will open a side window where you can inspect the elements. Identify the required object with you mouse.

Step 4: Important functions you will need
These are probably the most useful functions you will need to download data from Facebook or LinkedIn.

#scroll down webpage
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

# scroll up the webpage
webElem$sendKeysToElement(list(key = "home"))

# system sleep
sleepTime  <- 10 # in seconds
Sys.sleep(sleepTime)

# get information using xpath
 picks_archive<-remDr$findElements(using='xpath',".//*[@class='block media _feedPick feed-pick']")
 sapply(picks_archive, function(x) x$highlightElement())
 picks_archive<-as.matrix(sapply(picks_archive, function(x) x$getElementText()))

Add on: A full facebook crawler:

rm(list=ls())
 packages required in this analysis
 require(RSelenium) # automate webbrowser
 require(dplyr)
 require(tidyr)
 require(installr)
 library(xml2)
 library(rvest)
 library(httr)
 require(qdap) # for regular expressions
 require(stringr) # regular expressions
 library(plyr) # data handling
 require(ggplot2)
 require(tikzDevice) # for Latex output
 require(ggcorrplot) 
 require(stargazer) # for Latex output
 require(magick) # for image analysis
 library(dplyr) # data handling
 library(tidyr)
 require(svDialogs)
 library(beepr)
 start R selenium
 rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
 remDr <- rD[["client"]]

 Step one: Download all car names and the corresponding hyperlink
 Goal: Being able to acess each car individually and retreive more information
 initialise base URL
 base.URL <- c("https://www.facebook.com")
 go to page
 remDr$navigate(base.URL)
 login credentials
 user <- "XXX"
 pass <- "XXX"
 login
 username <- remDr$findElement(using = "name", value = "email")
 username$highlightElement()
 password <- remDr$findElement(using = "name", value = "pass")
 password$highlightElement()
 enter the login credential
 enter the login
 username$sendKeysToElement(list(user))
 password$sendKeysToElement(list(pass))
 click login button
 login_button <- remDr$findElement(using = 'id',"u_0_8")
 login_button$clickElement()
 base.URL <- c("https://www.facebook.com/groups/sharingiscaringunisg/",
               "https://www.facebook.com/groups/sharingiscaringuniszurich/")
 run the following code all 15 minutes
 repeat {
   startTime <- Sys.time()
 make a loop that goes throught the webpages
 for(i in 1:length(base.URL)){
 remDr$navigate(base.URL[i])
 wait for 5 seconds
 testit <- function(x)
 {
   p1 <- proc.time()
   Sys.sleep(x)
   proc.time() - p1 # The cpu usage should be negligible
 }
 wait for a moment
 testit(5)
 scroll down webpage
 webElem <- remDr$findElement("css", "body")
 webElem$sendKeysToElement(list(key = "end"))
 testit(5)
 body <- remDr$findElements(using = 'xpath',".//*[@class='text_exposed_root']/p")
 result<-as.matrix(sapply(body, function(x) x$getElementText()))
 result <- unlist(result)
 result <- data.frame(result)
 testit(5)
 scroll down webpage
 webElem <- remDr$findElement("css", "body")
 webElem$sendKeysToElement(list(key = "end"))
 testit(5)
 post.by <- remDr$findElements(using = 'xpath',".//*[@class='fwb fcg']/a")
 post.by<-as.matrix(sapply(post.by, function(x) x$getElementText()))
 post.by <- unlist(post.by)
 post.by <- data.frame(post.by)
 convert to character
 result$result <- as.character(result$result)
 post.by$post.by <- as.character(post.by$post.by)
 result<-cbind(result[1:nrow(post.by),],post.by)
 colnames(result) <- c("body","Contact")
 result$body <- as.character(result$body)
 find if a specific string occurs
 string <- c(" R ","R ","Coding","Nachhilfe")
 an empty vector
 important.post <- rep(NA,nrow(result))
 check weather the strings occur in the webpage
 for(i in 1:length(string)){
   important.post<-grepl(string[i],result$body)
   if(any(important.post)==T){
     beep(sound=3)
     dlgMessage(result[important.post==T,])
     }else{
     print("Nothing interesting")
   }
 }
 }
 sleepTime <- startTime + 100*100 - Sys.time() if (sleepTime > 0)
   Sys.sleep(sleepTime)
 }
Kategorien
Business

Visualize spatial data in R

This will be a short tutorial on how to do the following spatial plots.

Fig 1: House prices in Bern (Switzerland)

Working with spatial data means that you for each data point you have information on the specific location (e.g. longitude, latitude, city name). If you have this type of data available it will be easy for you do make nice plots. Spatial data is freely available online (cf. here).

Second, you will a map of a country as a spatial object. Spatial maps are freely available here. However, you don’t need to download the data manually – there is an easy way to access all information within R.

R example:

# you will need the following package
require(raster)

# download the map of switerland: level 3 means that you want to have all   # municipalities.
CH  <- getData("GADM",country="Switzerland",level=3)

# now you can just plot the map easily:
plot(CH,col="darkgrey")

You can visualize your data also interactively (cf. here)

Kategorien
Data Science

R: Merge function

The merge function in R allows the user to combine multiple data frames into a single data frame. Imagine you have information about students (name and average grades) in one data frame (A) and the student’s age in the data frame (B).
However, the goal is to access age and average grades easily. Therefore, you need to merge the data.

There are 3 options to merge your data:
Option 1: inner join
Return only the rows in which the left table have matching keys in the right table.
Option 2: outer join
Returns all rows from both tables, join records from the left which have matching keys in the right table.
Option 3: left join
Return all rows from the left table, and any rows with matching keys from the right table.
Option 4: right join
Return all rows from the right table, and any rows with matching keys from the left table.

Setup:

# Data frame A
name <- c("Tom","Jack","Johanna","Simon","Dario")
grade <- c(5,5.5,4,6,3.5)
A <- data.frame(cbind(A,as.numeric(A2)))
colnames(A) <- c("name","grade")

# Data frame B
B <- c("Tom","Johanna","Lukas")
age <- c(21,22,23)
B <- data.frame(cbind(B,as.numeric(age)))
colnames(B) <- c("name","age")
Inner join:
 merge(x = A, y = B, by = "name",) 
Outer join
 merge(x = A, y = B, by = "name", all = TRUE) 
Left outer
 merge(x = A, y = B, by = "name", all.x = TRUE) 
Right outer 
merge(x = A, y = B, by = "name", all.y = TRUE) 
Kategorien
Data Science

Connect RStudio to an access database (x64 bit)

When working with larger datasets you are frequently confronted with an access database. In this post, I will show you that it is very easy to connect RStudio to different databases.

First you need to load the required package/ install it when you don’t have it yet.

install.packages("RODC")
require(RODC)

In a next step you neet to set up the connection to your database. This can be done in one line of code:

con<-odbcDriverConnect("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:/Users/lliebi/Dropbox/Daten Luca/PhD/other Projects/Run python in R/Test.accdb")

Notice for your purpose you just need to change the path to your file. Simply replace the string after (DBQ= ) with your specific file path.

Having connected to the database you will see the connection in the global environment. 

If you want to read data from the database use sqlFetch:

data <- sqlFetch(con, "Tabelle1")

You can also write into the database (e.g. after having done some analysis). This could also be very helpful when your crawling data from the web, save it in the database and at the same time read it from it.

sqlSave(con,as.data.frame(test), tablename="Table2")

I remember when I worked the first time with access databases and it took me quite a while to access the data. However, I hope that this small tutorial will help you when you are confronted with such a problem.

 

 

Kategorien
Hobbies

The risks of BASE jumping

BASE jumping has become a widely known extreme sport through media coverage. Influenced through social media the public associates BASE jumping as an extremely dangerous sport with a high death rate. This blog tries to shed some light into the dangers of BASE jumping.
As a reference, I recommend the book „The Great Book of Base“.

BASE stands for „Bridge, Antenna, Span and Earth“ and as the word implies BASE jumpers usually jump from one of these objects. Comparing BASE jumping to skydiving there are several crucial differences from which additional risks arise:

In skydiving, I usually jump from 14’500ft whereas in BASE you will be jumping from a much lower altitude (~2’000ft). As the freefall time decreases you will have less time to deal with problems concerning your body position in freefall and canopy opening issues. As a BASE jumper you will be concerned about your canopy opening. Imaging your jumping from a cliff and when your canopy opens it turns against the cliff. Such scenarios happen from time to time (cf. here). Such a risk does not arise in skydiving as you are in an empty space where you will not hit any buildings, cliffs or antennas. So on what does a smooth canopy opening depend on? In skydiving, it depends mostly on how your canopy was packed (accounts for roughly 70%). The remaining 30% depend on your body position. However, even if you have line twists during a skydive you will have time to solve such issues as you bill be 4000ft above the ground. Furthermore, you will have a reserve parachute that opens faster than your main canopy – so you will have a backup plan when everything goes wrong.

In contrast, in a BASE jump, you have no reserve canopy, open your parachute much lower and have less time to fix issues under your canopy. Due to the fact, that you will have less freefall speed in a BASE jump compared to a skydive the opening of your canopy depends to 60% on your body position and only 10% on your pack job. Other factors that play a key role are the weather, your equipment, and random factors during your jump.

Kategorien
Finance

Financial markets in Switzerland: A network analysis

The term Network is often associated with Social Networks, the Internet or the human brain. However, more recently Network Analysis has become an interdisciplinary field of research and can also be applied to model the interdependence of financial markets. In this short blog, I show you how you can implement your own Network in R.

As always you need to get some data to work with first. I provide here some data of Eikon of all listed stocks in Switzerland (you find the data here: SIX-Total Return Prices). The data contains total return prices of some listed stocks in Switzerland.

Step 1: Load the data and calculate returns – this can be easily done in R.

# clean global environment
rm(list=ls())
# packages
require(networkD3)
require(igraph)
library(visNetwork)
# set working directory
setwd("C:/Users/lliebi/Dropbox/Daten Luca/PhD/Research/Network Analysis/Data")
prices <- read.csv("SwitzerlandStocks.csv",sep=";")
# clean the data
prices$Name <- as.Date(as.character(prices$Name),format="%d.%m.%Y")
# calculate returns function
return.calculation <- function(price){
returns <- rep(NA,length(price))
for(i in 2:length(price)){
returns[i] <- (price[i]-price[i-1])/price[i-1]
}
return(returns)
}
# create a new dataframe with all the returns
returns<-as.data.frame(apply(prices[,2:ncol(prices)],2,return.calculation))
returns <- cbind(prices$Name,returns)
colnames(returns)[1] <- "Date"

# only stocks with complete observations
returns <- returns[-1,]

# delete col that contain missing values
final<-returns[colSums(!is.na(returns))==nrow(returns)]

Step 2: Now you can already start with your Network analysis

There are several methodologies to model the interdependence of stocks that can be used (e.g. Granger Causality,  Spillover tables, Correlations, …).
I use a very easy and intuitive measure introduced by Diebold and Yilmaz. Furthermore, a helpful package is provided in R that you don’t have to calculate the Spillover table yourself.

require(frequencyConnectedness)
number.stocks <- 50
library(stringr)
colnames(final)[2:number.stocks] <- word(colnames(final)[2:number.stocks],1,sep = "\\.")
# Step 1: Find the correct var model
VARselect(final[,2:number.stocks], lag.max = 2, type = "const")
# you can see that the lowest AIC information criterion is found within lag 2
# therefore specify a VAR(2) model
# Step 2: Implement a VAR model for all the stocks in the sample
var.lag2 <- VAR(final[,2:number.stocks], p = 2, type = "const") # the [,-1] is due to the date in the first column
# With this model you can also predict 10 days ahead returns
var.f10 <- predict(var.lag2, n.ahead = 10, ci = 0.95)
# Step 3: Calculate the spillovers
# here use the function in the frequencyConnectedness package
spillover <- spilloverDY09(var.lag2, n.ahead = 10, no.corr = F)
# get the spillover table
solution <- as.data.frame(spillover$tables)*100

Step 3: get the Net Spillover and visualize the Network


# get Net spillovers
net.spillovers <- matrix(NA,nrow=number.stocks-1,ncol=number.stocks-1)
colnames(net.spillovers) <- colnames(solution)
rownames(net.spillovers) <- rownames(solution)

net.spillovers[lower.tri(net.spillovers)] <-solution[lower.tri(solution)]-solution[upper.tri(solution)]
net.spillovers[upper.tri(net.spillovers)] <-solution[upper.tri(solution)]-solution[lower.tri(solution)]
net.spillovers<-ifelse(net.spillovers>0,net.spillovers,0)
# Step 4: Create your own network
m <- t((net.spillovers))
net=graph.adjacency(m,mode="directed",weighted=TRUE,diag=F)
set.seed(1)
plot.igraph(net,vertex.label.color="black",
edge.color="darkgrey",
edge.arrow.size=0.2,
layout=layout.fruchterman.reingold,
edge.curved=F,
edge.lty=1,
frame=F,
vertex.size=5,
vertex.color=rainbow(number.stocks),
vertex.label.dist=0.0)
degree(net,mode = "in")
degree(net,mode = "out")

 

If you wish to do another visualization you can use the „network“ package. 

links <- as.data.frame(get.edgelist(net))
net = network(links, directed = TRUE)
# network plot
require(network)
ggnet2(net, alpha = 0.75, size = 4, edge.alpha = 0.5,color = "black",
label=T,label.size = 1.5,label.color = "darkgrey")

 

A 3D visualization can be here: Network Analysis.

Kategorien
Hobbies

Skydiving – Beginner guide

Skydiving has always been on my bucket list and in July 2017 I fulfilled my dream. I did my first solo jump in Alvor, a small village in the south of Portugal. It was an incredible experience and I would like to share it here with you and give you some tips on how you can start skydiving. Today I have over 220 skydives there will be much more to come!

I just finished my bachelor in Banking and Finance at the University of Zurich and I spontaneously decided to do my skydiving license. After some research on the internet, I discovered the dropzone in Alvor. It is one of the largest and most professional dropzones in Europe and in January 2017 I contacted the dropzone and booked 4 weeks of holiday in Alvor and the Expert Package for 2’861 Euros (cf. here). In total, the package includes 25 jumps and finally, you will receive the American skydiving licence (USPA).

To receive your skydiving licence you start with the Accelerated freefall (AFF) course. This course is subdivided into 8 levels you need to pass one by one. Each level focuses on single movements in the air with the main goal that you can stabilise yourself in the air, move forward & backwards and be confident in freefall.

Step 1: Accelerated freefall (AFF) – all 8 Levels explained

  • Level 1: Your first skydive!
    This will be the scariest one since it will be the first time you jump out of the plane by yourself – however, it will also be the most exciting one. During the jump, two instructors will hold onto you and you need to perform 3 practice pulls. In other words, you just need to touch the bridle 3 times. This might sound very easy but under extreme conditions, this can be hard as we are not used to jumping out of planes and fall 200 km/h to the ground.
  • Level 2: Refine your body position
    The body position is key for a stable exit and freefall. You will have been practising arching on the ground for hours and put it into practice in the air. Strong legs, relaxed arms and a tensed butt will result in a smooth arching position. Again this sounds easy but during the exit students often forget what do to. I had the feeling that during the first 3 seconds of the skydive my brain shut down and didn’t remember that I must arch to get into a stable position. 
  • Level 3: Instructors release you
    This will be the first time you will be flying totally by yourself! It is a great feeling but you will realise very fast that you are less stable when the instructors release you.
  • Level 4: Practice 90 degrees turns
    On this skydive, you will be accompanied by just 1 instructor. You will practice 90 degree turns in both directions.
  • Level 5: Practice 360-degree turns
    Similar in content to level 4, only this time you will practice turning a full 360 degrees in both directions. You will gain the basic skills required to turn around in free fall.
  • Level 6: Gain confidence in your own stability
    You will perform a „front loop“ – a little like a summersault in mid air and then regain your stability. You will also practice „tracking“ – rapid forward moving designed to create distance between you and other skydivers.
  • Level 7: Putting it all together
    You are soon done with your AFF! In this jump you will perform everything you have learned during the first jumps: You will exit the aircraft, perform a front loop, turn 360 degrees to your left and to your right and then track away from the instructor at the end of the skydive.
  • Level 8: Hop ’n‘ Pop
    This is the least technical jump. You will be jumping from 5000ft (compared to the 14500ft jumped in Level 1-7) and open your parachute a few seconds after the exit.

Finally, you’re done with your AFF! When the weather is good and you are a talented skydiver you will pass all the levels in the first go and within a week! I passed all levels in the first go and about 5 days to finish my AFF. 

Step 2: Pack your parachute

For the USPA licence, you are required to be able to pack your own parachute. You will be spending a whole day learning how to pack it and you will realise that it is a pain to pack it properly and each packjob will take you around 40 minutes at the beginning!

Step 3: 12 Consoljumps

Now comes the fun part! You just need to jump 10 times out of a plane by yourself. At this stage, you will get a better feeling for the time in the air. During the AFF I thought that the 60 seconds of freefall seems really short and there is no time to do anything else than what I am supposed to do. However, time is relative and you will realise that time passes slower when you are confident in the air and you will for the first time enjoy the view.

Step 4: Group jumps

Now it gets serious again. To receive the USPA license you need to be able to jump with other skydiver and perform docks. After you did 4 group jumps with maximum 4 other jumpers you will realise that jumping in groups is much more fun compared to jumping solo.

Step 5: Some theory

As with every license you need to do a theory test and some small practical tests (e.g. spotting the airfield, understanding the winds, …) that can be done within a day.

After this, you are a fully fledged skydiver! Congratulations.

Kategorien
Finance Research

Option valuation using Black-Scholes

Financial options have an intrinsic and a time value. The intrinsic value for a call option is simply the spot (S) minus the strike price (X). The time value of the call option can be derived using the Black-Scholes formula. The resulting price of the option minus the intrinsic value of the option results in the time value of the option.

The following graph illustrates the intrinsic value (red line), the price of the option (grey line) and the time value of the option (dark grey area).

Using the following code you can replicate the figure:

spot <- seq(1,100,by=1)
strike <- 50
riskfree <- 0
time <- 1
standarddev <- 0.2

d1 <- (log(spot/strike)+(0+standarddev^2/2)*1)/(standarddev*time)
d2 <- d1-standarddev*time^(1/2)

value.call <- pnorm(d1,0,1)*spot-pnorm(d2,0,1)*strike*exp(-riskfree*time)

inner.value <- spot-strike
inner.value <- pmax(inner.value,0)

require(ggplot2)

ggplot()+
geom_line(aes(spot,value.call))+
geom_line(aes(spot,inner.value),colour="red")+ geom_ribbon(aes(spot,ymin=value.call,ymax=inner.value),fill="darkgrey")