Using rvest and dplyr to scrape data and look at aviation incidents

For a project I recently faced the issue of getting a database of all aviation incidents. As I really wanted to try Hadley’s new rvest-package, I thought I will give it a try and share the code with you.

The data of aviation incidents starting in 1919 from the Aviation Safety Network can be found here: Aviation Safety Network

First, we needed to install and load the rvest-package, as well as dplyr, which I love for removing lots of messy code (if you are unfamiliar with the piping-operator %>% have a look at this description: Introduction to Piping in R

# install.packages("rvest")
# install.packages("dplyr")
library(rvest)
## Loading required package: xml2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Let’s try out some functions of rvest.

Say we want to read all incidents that happened in the year 1920: Aviation Incidents 1920. We need to find the right html table to download and the link to it, to be more precise, the XPath. This can be done by using “inspect element” (right-click on the table, inspect element, right click on the element in the code and “copy XPath”). In our case the XPath is "//*[@id="contentcolumnfull"]/div/table". To load the html data to R we can use:

url <- "http://aviation-safety.net/database/dblist.php?Year=1920"

# load the html code to R
incidents1920 <- url %>% read_html() 

# filter for the right xpath node
incidents1920 <- incidents1920 %>% 
  html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') 

# convert to a data.frame
incidents1920 <- incidents1920 %>% html_table() %>% data.frame()

# or in one go
incidents1920 <- url %>% read_html() %>% 
  html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>% 
  html_table() %>% data.frame()

Which gives us a small data.frame of 4 accidents.

But what happens if we have more than one page of data per year? We certainly don’t want to paste everything by hand. Take 1962 for example Accidents in 1962, which has 3 pages. Luckily we can get the number of pages by using rvest as well.

We follow the steps above to get the number of pages per year with the XPath “//*[@id="contentcolumnfull"]/div/div[2]”, with some cleaning we get the maximum pagenumber as:

url <- "http://aviation-safety.net/database/dblist.php?Year=1962"

pages <- url %>% read_html() %>%
  html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>%
  html_text() %>% strsplit(" ") %>% unlist() %>%
  as.numeric() %>% max()

pages
## [1] 3

Now we can write a small loop to get all incidents of 1962, as the link changes with the page number, ie from: http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=1 to http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=2

The code for the loop looks like this:

# initiate empty data.frame, in which we will store the data
dat <- data.frame(
  date = numeric(0), type = numeric(0), registration = numeric(0),
  operator = numeric(0), fatalities = numeric(0),
  location = numeric(0), category = numeric(0)
)

# loop through all page numbers
for (page in 1:pages){
  # create the new URL for the current page
  url <- paste0(
    "http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=", 
    page
  )
  
  # get the html data and convert it to a data.frame
  incidents <- url %>% read_html() %>%
    html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
    html_table() %>% data.frame()

  # combine the data
  dat <- rbind(dat, incidents)
}

# quick look at the dimens
dim(dat)
## [1] 236   9
head(dat)
##          date                    type registration             operator fat.
## 1 02-JAN-1962     Douglas C-47 (DC-3)       EP-ABB             Iran Air    0
## 2 02-JAN-1962 Convair C-131E (CV-440)      55-4750                 USAF     
## 3 06-JAN-1962           Curtiss C-46A       HR-TNB                  TAN    1
## 4 08-JAN-1962 Lockheed RB-69A Neptune      54-4038 Republic of China AF   14
## 5 10-JAN-1962     Douglas C-47 (DC-3)                Air National Guard    5
## 6 10-JAN-1962    Douglas C-47A (DC-3)     42-92078                 USAF     
##          location Var.7 pic cat
## 1 Kabul-Khwaja...    NA  NA  A1
## 2                    NA  NA  U1
## 3 Belize City-...    NA  NA  A1
## 4       Korea Bay    NA  NA  A1
## 5 South Park T...    NA  NA  A1
## 6         unknown    NA  NA  A1

which gives us a data.frame consisting of 211 incidents of the year 1962.

Lastly, we can write a loop to gather the data over multiple years:

# set-up of initial values
startyear <- 1960
endyear <- 1965
url_init <- "http://aviation-safety.net/database/dblist.php?Year="

# initiate empty dataframe, in which we will store the data
dat <- data.frame(
  date = numeric(0), type = numeric(0), registration = numeric(0),
  operator = numeric(0), fatalities = numeric(0),
  location = numeric(0), category = numeric(0)
  )

for (year in startyear:endyear){
  # get url for this year
  url_year <- paste0(url_init, year)

  # get pages
  pages <- url_year %>% read_html() %>%
    html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>%
    html_text() %>% strsplit(" ") %>% unlist() %>%
    as.numeric() %>% max()

  # loop through the pages
  for (page in 1:pages){
    url <- paste0(url_year,"&lang=&page=", page)

    # get the html data and convert it to a data.frame
    incidents <- url %>% read_html() %>%
      html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
      html_table() %>% data.frame()

    # combine the data
    dat <- rbind(dat, incidents)
  }
}

dim(dat)
## [1] 1363    9
head(dat)
##          date                          type registration          operator
## 1 03-JAN-1960          Douglas C-47A (DC-3)       VT-CGG   Indian Airlines
## 2 03-JAN-1960 Lockheed L-749A Constellation        N110A Eastern Air Lines
## 3 04-JAN-1960                 Curtiss C-46A       PP-SLJ     T.A. Salvador
## 4 04-JAN-1960            U-1A Otter (DHC-3)      55-2974           US Army
## 5 05-JAN-1960          Vickers 701 Viscount       G-AMNY               BEA
## 6 06-JAN-1960                 Douglas DC-6B       N8225H National Airlines
##   fat.         location Var.7 pic cat
## 1    9     near Taksing    NA  NA  A1
## 2    0  Philadelphia...    NA  NA  A1
## 3    0  Dianopolis A...    NA  NA  A1
## 4   10    Gulf of Sirte    NA  NA  A1
## 5    0  Malta-Luqa A...    NA  NA  A1
## 6   34 near Bolivia, NC    NA  NA  C1

In the years 1960-1965 there were 1.363 recorded aviation incidents, which we can now use in R.

David Zimmermann, PhD
David Zimmermann, PhD
Data Scientist

I am an economist by training, turned programmer/data scientist who loves to program with R, Python, and C++.