Enhancing Reproducible Science with GitHub and Docker
Fred Hutchinson Cancer Center via Coursera Specialization
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This Specialization is intended for scientific researchers who work with data and want to make their analyses yield consistent results regardless of who conducts the analysis or when it is run. The four topic courses and capstone course will teach you best practices, help you practice hands-on skills, and provide templates to help you adapt the content for your own research needs. Students will learn about code review, version control with Git and GitHub, using containers with tools like Docker to keep computing environments consistent, and using continuous integration/deployment tools like GitHub Actions to automatically run and test your code.
Syllabus
- Course 1: Introduction to Reproducibility in Cancer Informatics
- Course 2: Advanced Reproducibility in Cancer Informatics
- Course 3: Wrangling Computing Environments: Using Docker for Research
- Course 4: Smarter Scientific Software Development with GitHub Actions
- Course 5: Making Science Reproducible - A Capstone Course
Courses
-
The course is intended for students in the biomedical sciences and researchers who use informatics tools in their research and have not had training in reproducibility tools and methods. This course is written for individuals who: - Have some familiarity with R or Python - have written some scripts. - Have not had formal training in computational methods. - Have limited or no familiar with GitHub, Docker, or package management tools. Motivation Data analyses are generally not reproducible without direct contact with the original researchers and a substantial amount of time and effort (BeaulieuJones et al, 2017). Reproducibility in cancer informatics (as with other fields) is still not monitored or incentivized despite that it is fundamental to the scientific method. Despite the lack of incentive, many researchers strive for reproducibility in their own work but often lack the skills or training to do so effectively. Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. Reproducible analyses are more likely to be understood, applied, and replicated by others. This helps expedite the scientific process by helping researchers avoid false positive dead ends. Open source clarity in reproducible methods also saves researchers' time so they don't have to reinvent the proverbial wheel for methods that everyone in the field is already performing. Curriculum This course introduces the concepts of reproducibility and replicability in the context of cancer informatics. It uses hands-on exercises to demonstrate in practical terms how to increase the reproducibility of data analyses. The course also introduces tools relevant to reproducibility including analysis notebooks, package managers, git and GitHub. The course includes hands-on exercises for how to apply reproducible code concepts to their code. Individuals who take this course are encouraged to complete these activities as they follow along with the course material to help increase the reproducibility of their analyses. **Goal of this course:** Equip learners with reproducibility skills they can apply to their existing analyses scripts and projects. This course opts for an "ease into it" approach. We attempt to give learners doable, incremental steps to increase the reproducibility of their analyses. **What is not the goal** This course is meant to introduce learners to the reproducibility tools, but _it does not necessarily represent the absolute end-all, be-all best practices for the use of these tools_. In other words, this course gives a starting point with these tools, but not an ending point. The advanced version of this course is the next step toward incrementally "better practices". How to use the course This course is designed with busy professional learners in mind -- who may have to pick up and put down the course when their schedule allows. Each exercise has the option for you to continue along with the example files as you've been editing them in each chapter, OR you can download fresh chapter files that have been edited in accordance with the relative part of the course. This way, if you decide to skip a chapter or find that your own files you've been working on no longer make sense, you have a fresh starting point at each exercise.
-
This course introduces tools that help enhance reproducibility and replicability in the context of cancer informatics. It uses hands-on exercises to demonstrate in practical terms how to get acquainted with these tools but is by no means meant to be a comprehensive dive into these tools. The course introduces tools and their concepts such as git and GitHub, code review, Docker, and GitHub actions. Target Audience The course is intended for students in the biomedical sciences and researchers who use informatics tools in their research. It is the follow up course to the Introduction to Reproducibility in Cancer Informatics course. Learners who take this course should: - Have some familiarity with R or Python - Have take the Introductory Reproducibility in Cancer Informatics course - Have some familiarity with GitHub Motivation Data analyses are generally not reproducible without direct contact with the original researchers and a substantial amount of time and effort (BeaulieuJones, 2017). Reproducibility in cancer informatics (as with other fields) is still not monitored or incentivized despite that it is fundamental to the scientific method. Despite the lack of incentive, many researchers strive for reproducibility in their own work but often lack the skills or training to do so effectively. Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. Reproducible analyses are more likely to be understood, applied, and replicated by others. This helps expedite the scientific process by helping researchers avoid false positive dead ends. Open source clarity in reproducible methods also saves researchers' time so they don't have to reinvent the proverbial wheel for methods that everyone in the field is already performing. Curriculum The course includes hands-on exercises for how to apply reproducible code concepts to their code. Individuals who take this course are encouraged to complete these activities as they follow along with the course material to help increase the reproducibility of their analyses. **Goal of this course:** To equip learners with a deeper knowledge of the capabilities of reproducibility tools and how they can apply to their existing analyses scripts and projects. **What is NOT the goal of this course:** To be a comprehensive dive into each of the tools discussed. . How to use the course Each chapter has associated exercises that you are encourage to complete in order to get the full benefit of the course This course is designed with busy professional learners in mind -- who may have to pick up and put down the course when their schedule allows. In general, you are able to skip to chapters you find a most useful to (One incidence where a prior chapter is required is noted). Each chapter has associated exercises that you are encourage to complete in order to get the full benefit of the course
-
The course is intended for individuals in the biomedical sciences who wish to make their work more reproducible through the use of automation. It focuses on the basics of continuous integration continuous deployment techniques using the GitHub Actions software. This course is written for individuals who: - Are comfortable with GitHub and know how to make a pull request - Wish to save time and enhance their scientific projects using automation - Have perhaps tried to learn about GitHub Actions before but felt overwhelmed about how to start
-
The course includes hands-on exercises for how to use, modify, share, and troubleshoot containers for scientific software development purposes. Goal of this course: Equip learners with basics skills and confidence to utilize containers within the context of scientific software analyses. Expectations: This course is not meant to teach learners how to create complex containers, but instead introduce learners to basic fundamentals of continuous integration and continuous deployment (CI/CD). This course focuses on containers (Docker or Podman) and will not cover any other (perfectly fine) tools for CI/CD. Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. By recognizing that biological data analysis code is a form of software development, we can try to adapt good development practices in scientific analyses and software contexts. Scientific software projects may include (but aren’t limited to): - Software built as tools to be utilized by others to analyze biologically derived data - Code that is built primarily for analyzing one project’s data - Code that is built as a workflow for a series of steps and analyses that might be reused among collaborators or within a lab - Any scripts and code that are built to handle data in a research setting - Any scripts and code a researcher might interact with Containers are one tool among many for creating reproducible analyses. A container is a lightweight, portable, and isolated environment that encapsulates an application and its dependencies, enabling it to run consistently across different computing environments. Many individuals performing analyses on cancer data may not have formal training in software development and may be unfamiliar with the idea of containers. Unique Features of This Course - Hands-on exercises exploring real uses of containers for scientific research and software - Activities to demonstrate the common pitfalls using containers - Information about how to use two of the most common tools for containers: Docker and Podman Key Words Reproducibility, Containers, Podman, Docker, Scientific Software Development, Biomedical Research Intended Audience/Required Knowledge - The course is intended for researchers and research staff who might be interested in learning about using containers to make their research or scientific software more reproducible. - Some familiarity with biomedical or health-related research, as well as some familiarity with programming (including bash and command line) is required. Learning Objectives - Understand that computing environments are moving targets - Use containers to share a controlled computing environment - Pull and use a Docker image from online - Modify a Docker image - Build a Docker image from scratch - Troubleshoot the most common Docker related errors Accessibility We are committed to making our content accessible and available to all. We welcome any feedback you might have at https://forms.gle/SzuZjct4ZQyt3Cos7. Questions related to accessibility accommodations should be directed to https://studentserviceportal.force.com/s/.
Taught by
Candace Savonen, MS, Carrie Wright, PhD and Kate Isaac