Using rvest and dplyr to scrape data and look at aviation incidents
For a project I recently faced the issue of getting a database of all aviation incidents. As I really wanted to try Hadley’s new rvest
-package, I thought I will give it a try and share the code with you.
The data of aviation incidents starting in 1919 from the Aviation Safety Network can be found here: Aviation Safety Network
First, we needed to install and load the rvest
-package, as well as dplyr
, which I love for removing lots of messy code (if you are unfamiliar with the piping
-operator %>%
have a look at this description: Introduction to Piping in R
# install.packages("rvest")
# install.packages("dplyr")
library(rvest)
## Loading required package: xml2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Let’s try out some functions of rvest
.
Say we want to read all incidents that happened in the year 1920: Aviation Incidents 1920. We need to find the right html table to download and the link to it, to be more precise, the XPath. This can be done by using “inspect element” (right-click on the table, inspect element, right click on the element in the code and “copy XPath”). In our case the XPath is
"//*[@id="contentcolumnfull"]/div/table"
.
To load the html data to R we can use:
url <- "http://aviation-safety.net/database/dblist.php?Year=1920"
# load the html code to R
incidents1920 <- url %>% read_html()
# filter for the right xpath node
incidents1920 <- incidents1920 %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table')
# convert to a data.frame
incidents1920 <- incidents1920 %>% html_table() %>% data.frame()
# or in one go
incidents1920 <- url %>% read_html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
html_table() %>% data.frame()
Which gives us a small data.frame of 4 accidents.
But what happens if we have more than one page of data per year? We certainly don’t want to paste everything by hand. Take 1962 for example Accidents in 1962, which has 3 pages. Luckily we can get the number of pages by using rvest
as well.
We follow the steps above to get the number of pages per year with the XPath “//*[@id="contentcolumnfull"]/div/div[2]
”, with some cleaning we get the maximum pagenumber as:
url <- "http://aviation-safety.net/database/dblist.php?Year=1962"
pages <- url %>% read_html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>%
html_text() %>% strsplit(" ") %>% unlist() %>%
as.numeric() %>% max()
pages
## [1] 3
Now we can write a small loop to get all incidents of 1962, as the link changes with the page number, ie from: http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=1 to http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=2
The code for the loop looks like this:
# initiate empty data.frame, in which we will store the data
dat <- data.frame(
date = numeric(0), type = numeric(0), registration = numeric(0),
operator = numeric(0), fatalities = numeric(0),
location = numeric(0), category = numeric(0)
)
# loop through all page numbers
for (page in 1:pages){
# create the new URL for the current page
url <- paste0(
"http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=",
page
)
# get the html data and convert it to a data.frame
incidents <- url %>% read_html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
html_table() %>% data.frame()
# combine the data
dat <- rbind(dat, incidents)
}
# quick look at the dimens
dim(dat)
## [1] 236 9
head(dat)
## date type registration operator fat.
## 1 02-JAN-1962 Douglas C-47 (DC-3) EP-ABB Iran Air 0
## 2 02-JAN-1962 Convair C-131E (CV-440) 55-4750 USAF
## 3 06-JAN-1962 Curtiss C-46A HR-TNB TAN 1
## 4 08-JAN-1962 Lockheed RB-69A Neptune 54-4038 Republic of China AF 14
## 5 10-JAN-1962 Douglas C-47 (DC-3) Air National Guard 5
## 6 10-JAN-1962 Douglas C-47A (DC-3) 42-92078 USAF
## location Var.7 pic cat
## 1 Kabul-Khwaja... NA NA A1
## 2 NA NA U1
## 3 Belize City-... NA NA A1
## 4 Korea Bay NA NA A1
## 5 South Park T... NA NA A1
## 6 unknown NA NA A1
which gives us a data.frame consisting of 211 incidents of the year 1962.
Lastly, we can write a loop to gather the data over multiple years:
# set-up of initial values
startyear <- 1960
endyear <- 1965
url_init <- "http://aviation-safety.net/database/dblist.php?Year="
# initiate empty dataframe, in which we will store the data
dat <- data.frame(
date = numeric(0), type = numeric(0), registration = numeric(0),
operator = numeric(0), fatalities = numeric(0),
location = numeric(0), category = numeric(0)
)
for (year in startyear:endyear){
# get url for this year
url_year <- paste0(url_init, year)
# get pages
pages <- url_year %>% read_html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>%
html_text() %>% strsplit(" ") %>% unlist() %>%
as.numeric() %>% max()
# loop through the pages
for (page in 1:pages){
url <- paste0(url_year,"&lang=&page=", page)
# get the html data and convert it to a data.frame
incidents <- url %>% read_html() %>%
html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
html_table() %>% data.frame()
# combine the data
dat <- rbind(dat, incidents)
}
}
dim(dat)
## [1] 1363 9
head(dat)
## date type registration operator
## 1 03-JAN-1960 Douglas C-47A (DC-3) VT-CGG Indian Airlines
## 2 03-JAN-1960 Lockheed L-749A Constellation N110A Eastern Air Lines
## 3 04-JAN-1960 Curtiss C-46A PP-SLJ T.A. Salvador
## 4 04-JAN-1960 U-1A Otter (DHC-3) 55-2974 US Army
## 5 05-JAN-1960 Vickers 701 Viscount G-AMNY BEA
## 6 06-JAN-1960 Douglas DC-6B N8225H National Airlines
## fat. location Var.7 pic cat
## 1 9 near Taksing NA NA A1
## 2 0 Philadelphia... NA NA A1
## 3 0 Dianopolis A... NA NA A1
## 4 10 Gulf of Sirte NA NA A1
## 5 0 Malta-Luqa A... NA NA A1
## 6 34 near Bolivia, NC NA NA C1
In the years 1960-1965 there were 1.363 recorded aviation incidents, which we can now use in R
.