Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Reproducible Research II: Practices and tools for managing computations and data

Inria (French Institute for Research in Computer Science and Automation) via France Université Numerique

Go to class Write review

Details

Go to class

Provider

France Université Numerique
Pricing

Free Online Course
Languages

English
Effort

35 hours
Sessions

Self-Paced
Level

Advanced

Found in

Following the success of the MOOC "Reproducible research: methodological principles for transparent science", the authors continue exploring reproducibility with a focus on massive data and complex calculations. These two MOOCs complement each other and offer a coherent training program on the subject.

In this 2nd MOOC, you will learn how to manage large datasets and complex computations in controlled software environments, using formats such as JSON, FITS, and HDF5, platforms like Zenodo and Software Heritage, tools like git-annex, Docker, Singularity, Guix, make, and Snakemake. Keys concepts are introduced and applied through numerous hands-on exercises and a real-life use case on sunspot detection, demonstrating how to work in a reliable and reproducible way.

A new module for this session proposes exercises illustrating how the tools and techniques we teach are helpful in the daily practice of computational research. Interviews with experienced practitioners of reproducible researchalso discuss related tools, helping you decide whether you should invest in more elaborate tools or not, and which pitfalls you may stumble upon.

Syllabus

Preparing for the mountain hike to reproducibility

Astronomers interviews about sunspots detection
Getting started with JupyterLab and the sunspot time series
Sunspot Time Series: Exercises
Reproducibility and research software communities

Module Managing data

Archiving
File formats
Project Organization
Git Annex

Module Managing software

On the Importance of Software Environment
Package Management Principles
Isolation and Containers
Using Containers
Building and Sharing Containers
Functional Package Managers (Guix, Docker, Singularity...)

Module Managing computations

Why do we need workflows?
From notebooks to shell scripts
Workflows with make
Workflows with snakemake
Workflows and environments

Module Reproducibility in the large

Getting familiar with the Sunspot project
Checking the reproducibility of computations
Checking the robustness of the workflow to a variation on the software environment
Injecting new data
Investigating specific aspects of the data
Parameterizing our workflow to evaluate parameter sensitivity
Inverviews with experts