Abstract:
A cornerstone of data-driven empirical research is reproducibility. The credibility of an analysis or a forecasting system rest in the promise that the entire analysis process can be reproduced by an independent party yielding similar results. Modern data scientist are faced with the challenge of maintaining reproducibility of their results while at the same time the software infrastructure required to compute and adequately present these results is becoming increasingly complex. This tutorial is geared towards novice and intermediate data scientists who want to improve the reproducibility of their results. To this end, methods for applying well-established tools and procedures from software development are applied to a data analysis work-flow to improve reproducibility of the results. In particular, the tutorial will cover the use of markdown and the ‘knitr’ package for combining code, results, and description (literate programming), ‘make’ for organiz-ing and automating complex build processes in a data analysis, git for version control and collaboration, and finally the use of container technology (Docker) to isolate an entire data analysis including the underlying operating system. The tutorial will intro-duce each of the above mentioned technologies for their particular role in making a data analysis more reproducible and is thus geared towards an audience with either no or little experience in any or all of the required techniques. The example analyses will be written in R and Python.