Scraping Austrian COVID19 Data with R

The Austrian government has made it very hard to obtain COVID19 data that can be compared over time. Initially the health ministry started with a website where the total number of cases and deaths were hand-edited into an html file two times a day. Later some regions were added. Some regions themselves started to independently report the number of positive tests for their districts (Bezirke), others explicitly refused to do this. About two weeks ago I started to write e-mails (to government, ministry, and some institutions) to ask for a commitment to the open (government) data policy, pointing out all the advantages of open data and citing the stories and examples from other countries. Usually this is followed by the Austrian authorities. For instance when it comes to elections. The ideal scenario would be a repository on github with daily updates, like in Italy. This would allow journalists, scientists, or whoever is interested to analyse the data and create coherent space-time pictures. The responses by the authorities were either negative or non-existent.

A few days ago the ministry of health launched a dashboard that reports COVID cases by regions and districts (not tests or the number of deaths, though). Albeit following the same initial policy, i.e. just updating the data and not providing data downloads or comparisons across time (except for the country data which is publicly available with the WHO or the ECDC in any case), we now have the opportunity to record a time picture for ourselves by just scraping the data from the dashboard in regular intervals. Here is a quick tutorial on how to do exactly that.

This is one of the cases where it really pays off to first anaylyse the bits and pieces of the website instead of blindly trying to read in the whole site line by line and extract from the resulting soup of data. The dashboard is subdivided in several unique javascript pieces. Turns out the data we want can be found in a simple *.js file. Behind the javascript object lies a plain JSON file which contains all the information we can also find in the dashboard: name of the municipality and number of COVID cases. It is straightforward to read this with R:

json_file <- RCurl::getURL("https://info.gesundheitsministerium.at/data/Bezirke.js")
json_file <- stringr::str_split_fixed(json_file, "\\=", 2)[2] # eleminate beginning
json_file <- stringr::str_split_fixed(json_file, "\\;", 2)[1] # eleminate trailing garbage
json_file <- as.data.frame(jsonlite::fromJSON(json_file))
names(json_file) <- c("bezirk", "faelle") # edit the column names

Then we create a clean dataframe that also contains time information. A timestamp for the day and, just to be on the save side as the page gets updated in irregular intervals, also one for the hour:

df.new <- data.frame(id     = 1:nrow(json_file), # unique id
                     bezirk = json_file[1],
                     datum  = rep(Sys.Date(), nrow(json_file)),
                     time   = rep(Sys.time(), nrow(json_file)),
                     faelle = json_file[2])

And we are done. We now have a clean dataframe in long format that we can do analysis on. The dataframe just has to be printed as a .csv file (or any other format of your choice). If you want to do the exercise for regions instead, just replace the url with https://info.gesundheitsministerium.at/data/Bundesland.js.

Ideally we also want a timedimension of course and thus would want to grow the data overtime with, say, one snapshot a day, taken at the same time in order to make slope and level changes comparable.

A simple and straightforward way to automate this would be to just load the “old” data that we printed (the day) before:

df.old <- readr::read_csv("/path/to/file/AUT_bezirke_scraped.csv")

Then we check with a simple condition whether there were changes at all. And if there were some we stitch the old frames together (the python equivalent to this would be “append”).

if (sum(abs(tail(df.old$faelle, length(df.old$faelle)) - df.new$faelle)) == 0) { # the tail of the old one, check with the new one
  # if 0, then
  cat("No changes compared to last period. Nothing stored.")
} else {
  df.new <- dplyr::bind_rows(df.old, df.new)
  readr::write_csv(df.new, path = "/path/to/file/AUT_bezirke_scraped.csv")
}

Now the old csv will be overwritten and we don’t clutter up our folder with too many files. Of course we do not want to do this by hand everyday and thus we want to have the full .R file executed by automation software so we don’t have to worry about it anymore. My recommendation would be crontab. A good source that explains how its syntax works can be found here.


References

Related

comments powered by Disqus