Chapter 1 Introduction

This book is targeted at practioners of veterinary epidemiology, and aims to provide pathways for practioners in how to build evidence-based tools for decision makers in the surveillance and control of animal diseases and pests.

The book has been written by veterinary epidemiologists at Ausvet (ausvet.com.au) who are engaged as consultants to government and business in delivering better animal health outcomes in regard to animal pests and diseases. It covers the practical issues faced by epidemiologists in dealing with messy and ‘big’ organisational data when delivering sophisticated analytical products to clients, all within an evidence-based framework for decision making. Ausvet has produced several publications beforehand to address data management in a working context (Cameron, Sergeant, and Baldock 2004; Sergeant and Perkins 2015). However, the tools and strategies for sourcing data and dealing with data quality issues have changed dramatically. Principally, much of this work may now be undertaken in the R environment. For more advanced data management then R is but one tool among many within a stack of tools customised to address specific epidemiological management questions. Other tools may include database management systems, Amazon Web Services, dashboards, python and containers from the DevOps ecosystem to increase the speed, simplicity, and scalability of the information feedback required for effective decision making (as was stated generally in this Harvard Business Review article, for example).

1.1 Skill requirements

This book assumes you have intermediate R skills to follow what is presented within this book, even if some of the approaches may appear novel. If the R code presented here appears difficult to comprehend then we suggest that you try out one of the many available free tutorials avaialble. For example, Dataquest provides some suitable pathways to becoming proficient in R. Much of R usage in current applications is in the Tidyverse dialect, which emphasises the manipulation of data as a series of table joins and operations. Base R is somewhat more fluid but less well defined, and provides data typing, statistical analysis tools, matrix operations (i.e., R is first and foremost an engine for linear algebra), and loop and conditional constructs that are useful for building novel operations from first principles. The intent here is to provide Tidyverse code as far as practical, although our background in R predates Tidyverse.

There is of course plenty of material on epidemiology via R online. For example, the work in progress of Tomás Aragon on population health data science provides an easy introduction to setting up R and RStudio (an integrated development environment for R). The not-for-profit R Epidemics Consortium (REC) are also releasing on a continuing basis a suite of R packages directed at different epidemiological activities and data. One view of different epidemiological packages beyond the REC ecosystem, put together by the REC itself, may be located here. A view put together by Ausvet is located here

Write a task view for epidemiology https://cran.r-project.org/web/packages/ctv/vignettes/ctv-howto.pdf

Some of these packages you will be exposed to in this book, given the case studies available at hand.

1.2 Structure of this book

This book starts with the implementation of a reproducible workflow. Veterinary epidemiology is a high stakes game, with the value of animal production for any one national industry tending towards billions of dollars. Any recommendation that a practioner provides needs to be both defensible and reproducible, i.e., show evidence of the correctness of your results. This evidence concerns the cradle-to-the-grave life cycle of the data you have utilised, and the code by which you have transformed the raw, untamed data into an analytical result supporting a scientific recommendation. Something mundane such as not setting a random seed (i.e., a fixed point from which a sequence of random numbers are generated from) for a random sampling of data can come back to haunt you, either in terms of one’s reputation among one’s professional peers and clients or in the judgement of a court of law when one is called on as an expert witness. In the chapter on reproducible workflow we provide a method (one of many) to set up the R environment, a code repository, and to produce a range of output documents so as to enable reproducibility of work. This method has served us well for at least 12 years, and is consistent with best software programming practices. Moreover, there are workflow efficiencies to be gained when one can automatically regenerate a report based on a small update in the original data, and to enable collaboration such that code may be reused by others within and without an organisation.

The following section on data management includes three chapters focussing on the manipulation of data or data munging, interactions with databases and other data stores, and the manipulation of spatial data. In the chapter on data munging we introduce you to the coal-face of epidemiological work: each ‘client’ will have different questions, and will possess novel data sources of differing quality and relevance. These data sources will need to be melded, along with different functionality available in R from different ‘packages’ of functions. After the program logic of a project is defined, and project activities planned to deliver a solution to a question, then the first task is invariably federating the data, cleaning it (i.e., engaging in data quality control), and shaping the data so that it is amenable to analysis. The whole process of federating, cleaning and shaping may be termed data munging (or data wrangling as it is often termed). This is unless one is in the enviable position of controlling the workflow of data production from beginning to end to assure data quality and maximise efficiency in generating a data set that is structured for statistical analysis. Developing a good nous through practice for data quality issues and how to manipulate data to arrive at the target data structure is essential for data munging, and requires some fluency and ease in coding R.

Epidemiological data are more often than not ‘big’. Hence, Chapter 4 introduces to databases and the SQL language, S3 data stores on Amazon Web Services (AWS), and other systems designed for managing big data. Invariably, these different data stores will have a well-established API (application programmer interface) for which an R wrapper has been written, allowing the user to access data from a data store. We show how requests for data may be made from inside R so that data munging, and in particular data federation, may take place. As a starting point, users of this guide should be able to work through on-line texts such as that of Smith et al., and have a reasonable grasp of database principles.

Chapter 5 addresses the fact that epidemiological data are invariably spatial, relating to animals and people in space and time as the subject. These data are increasingly being generated by an internet of things with increased deployment of automated sensors to monitor those animals, people and their environments. Also, as consultant data is private, yet that data often contains many useful examples of what may go ‘wrong’ with data, we emulate that data by generating in this chapter a data ‘fable’. The user may follow how a mythical archipeligo of nation states called Atlantis may emerge from the centre of the Indian Ocean, populated with both people and animals, and the diseases and pests unique to that archipeligo. Atlantis then forms the context of all the case studies presented in this book.

ANALYSIS TOOLS SECTION

The data management section is followed by the analysis tools section where we introduce the reader to our stack of statistical analysis tools in terms of the mixed effects regression model. This stack is primarily motivated by the need to return inference within short project times, be flexible enough to address novel hypotheses, and ensure defensibility of the approach to inference. As such we briefly survey the typical issues found in epidemiological data, particularly observational data, where a data set may be subject to a combination of problems that may include:

  • unbalanced data
  • ‘over-dispersed’ data relative to common model assumptions
  • non-linear responses
  • contain missing data (which is dealt with in the chapter on data munging)
  • may have several strata of uncertainty, or other forms of dependence among the observations including spatial and temporal autocorrelation.
  • are not orthogonal in their design, despite intereactions between explanatory terms likely being present.
  • collinearity is present among explanatory terms
  • large model spaces (i.e., many possible terms may be included in the model)
  • uncertainty around the identifiability of the model ‘in truth’

Our regression stack therefore extends the multiple linear regression models to consider zero-inflation, random effects, temporal and spatial autocorrelation, non-linear responses (i.e., binary and count data), model averaging, non-linear regression terms within the generalised additive model (GAM) framework, a practical approach to model diagnostics, and inference based on model predictions (i.e., marginal means). Although for ease of computation we apply a frequentist solution, we also discuss application of an approximate Bayesian approach in terms of R-INLA package.

CASE STUDY SECTION

1.2.1 Dashboards

and the flexdashboard framework for generating epidemiological dashboards.

1.3 About Ausvet

Ausvet provides a multidisciplinary team of highly skilled and experienced veterinary epidemiologists. This multidisciplinary team enables us to develop world-leading national, regional and global health information systems that support the development and conduct of disease surveillance and control programs. Our team includes sociological expertise to understand the needs and motivations of the hugely diverse people we seek to help, given people are amongst the most complex part of any epidemiological response; statistical expertise; information technology expertise as monitoring and response to disease is increasingly automated, leading to big data and dashboards; and, of course extensive veterinary and population health expertise. More details on Ausvet may be found here.

Ausvet has also produced a suite of epidemiological analysis tools that are freely available online at epitools. The underlying R code for these tools is available within the epiR package, produced in conjunction with University of Melbourne. epiR integrates also functionality initially created in Ausvet’s previous R package rSurveillance.

1.4 TO-DO

The intent is to state all R computations within the tidyverse dialect of R. * extra on tidyverse dialect.

Translate to other languages. https://cran.r-project.org/web/packages/translateR/translateR.pdf

Cross-check against: http://learnr.web.unc.edu

notes

  • blue highlighted text are hyperlinks linking a term to reference material online.
  • list of libraries used here in appendix
  • geosptdb
  • sf
  • rmarkdown
  • knitr
  • tidyverse

References

Cameron, Angus, Evan Sergeant, and Chris Baldock. 2004. Data Management for Animal Health. Vol. 1. Brisbane: AusVet Animal Health Services.

Sergeant, E S G, and N. R. Perkins. 2015. Epidemiology for Field Veterinarians. Wallingford, UK: CABI. https://www.cabi.org/bookshop/book/9781845936914/.