Overview

This post covers methods to help make your code reproducible. Nothing in this post is absolutely critical like the topics in the previous post, but they will all increase the probability that others can reproduce your findings. Feel free to pick and choose which things to implement from this post; each of them will make your code better. However, as the title says, you should do all of them if you can.

Today’s post is the second in a trilogy of posts about reproducible science. The previous post can be found here. As always, if anything is missing or incorrect DM me on Twitter and I’ll fix it.

Table of Contents:

Runnable Code

One key to reproducibility is making sure that others can run your code. This section is about how to make your code runnable by people who are not you on computers that are not yours.

Have a Science Button

Photo Creds: Kim Possible

Every mad scientist knows that the culmination of their work should have a large button labeled “Self Destruct”. As (mostly) sane scientists, we should aspire to have a button labeled “Do Science” instead. The ideal to work toward is a single script that you can run to do your entire project.

“But wait Ben, I’m reading the sections in reverse order! Doesn’t the code organization section say to split code into smaller files?”

That’s correct. The trick is to have all of your setup and analysis scripts strung together.

The simplest way to combine everything is to make a single bash script that runs all of your individual scripts one by one. While a bash script is a perfectly valid way of doing things, programs called workflow management systems are designed to make stringing together code easier.

Workflow management systems remove inefficiencies that you might run into with a bash script. For example, if some of the samples in your analysis fail to run correctly, workflow management systems make it simple to rerun the analysis on only the failed samples. It also makes it easier to iterate through groups of files with similar names.

Snakemake and Nextflow are two examples of workflow management systems designed with biology in mind. Galaxy only does certain tasks, but is useful if you would rather work in a graphical environment than with scripts.

Further Reading:

Avoid Hardcoded Paths

If my old code is any reliable estimate, manually setting (hardcoding) paths is the most common way to make code unrunnable. While in_file = open('/home/ben/Desktop/data.txt') worked when I first wrote it, I no longer have the patience to open my code and modify all the paths each time I want to run it somewhere new.

There are three main ways to avoid hardcoding paths, each of which has pros and cons. The first method is to pass paths as command line arguments using packages like argparse for python or argparser for R. This method works best when you only have one or a few input files to keep track of, because while python calculate_gc.py sample_1.fasta is fine, python run_analysis.py --vcf sample_1.vcf --reference hg38 --annotation annotation.ref --out_file out.txt is painful to have to type repeatedly.

In programming laziness is a virtue, so ideally we want to avoid typing parameters repeatedly. The best way to handle lots of parameters is to use a configuration file. The idea behind a configuration file is that all your paths and settings for all your scripts live in one place so you only have to change one file to modify an analysis. There are a number of ways to handle config files in both python and R. There is a broad range of potential functionalities, so just pick one that is comfortable for you. While this method saves a lot of parameter typing, it is less well suited for situations where you want to run the same analysis several times while varying the parameters.

The final way to avoid hardcoded paths is to set all paths relative to the root of the repository. This method works particularly well in cases where you are using publicly available data and your download script specifies the location the data will be stored. Using relative paths is great until it isn’t, though. If you ever want to switch out the data you’re using or move files around you’re back to the original problem of going through each file and changing paths. As a result, the first two methods are preferable to this one.

I’ve added some example code for each method below to help you decide which method is best for you.

# Command line arguments; can be run with python this_file.py ~/repo/data/in_file.csv
import argparse
parser = argparse.argument_parser(description='This script reads in a file as an example')
parser.add_argument('in_file_path', help='The path to the file to read int')
args = parser.parse_args()

with open(args.in_file_path) as in_file:
    pass
# settings.cfg
in_file_path = "~/repo/data/in_file.csv"
# Uses Config file
import configparser
config = configparser.ConfigParser()

# This is the sort of hardcoded path to avoid, I've used it here for brevity
config.read('settings.cfg')

with open(config.in_file_path) as in_file:
    pass
# Relative Paths
import os

current_directory = os.path.dirname(os.path.abspath(__file__))
in_file_path = os.path.join(current_directory, '../data/in_file.csv')

with open(in_file_path) as in_file:
    pass

README Redux

In the previous post I talked about the components required for a minimal README file, but there are other elements that can be added to make life easier. The main element is a description of how your directory is set up. Describing what each script does can be helpful for users who already have intermediate files, or who want to modify part of the code. Likewise, a description of where everything lives in your repository may be helpful, especially if you use a nonstandard method for directory organization.

Test Elsewhere

So you’ve gotten everything working! You pushed your science button, everything ran, and the figures that came out at the end are exactly what you expected. Now it’s time to find out whether your code actually works. The best way to do so is to go to another computer, download your code, and run the whole thing again. If you’re like most people and don’t have multiple computers to work with, you can do the same thing by moving to a different directory, removing your Conda environment, and then downloading/rerunning everything.

Rerunning your code helps you find the hardcoded paths you didn’t know were still there, the manual steps you forgot were necessary, and the libraries that turned out to not be in the environment file.

Readable Code

I used to believe that making my code readable wasn’t important if I was working on projects that I would be the only one to use. After going back to old projects with uncommented functions enough times I realized that I was shooting myself in the foot. I would often sink huge amounts of time into trying to remember what exactly each part did instead of spending that time doing useful work. This section covers a number of ways to keep that from happening to you.

When thinking about the audience for your code, it’s often easiest to write for a version of yourself a year in the future. If future you wouldn’t be able to follow what’s going on, other users will certainly struggle.

Documentation

Documentation is a broad subject, with entire posts dedicated to it and entire software packages designed to make the result more convenient. As this guide is dedicated to getting the most value with the least effort, I won’t go too far into the weeds. Instead I will cover what I see as the two most important parts: naming and commenting.

Naming

Ideas about the correct way to name functions and variables fall along a sliding scale of brevity. On one end you have the idea that each name should be so perfectly descriptive that comments should feel unnecessary. At the other exteme you have the mathematical ideal: every variable is a single letter, with meaning implied by the language the character is pulled from and past code that used the same letter. In this system the variables are explained in comments, but their names are left brief to avoid masking the beauty of the algorithm or something.

As an avid reader of Pat Rothfuss’s books, I lean towards the more verbose side of the spectrum. Descriptive variable names are generally considered to be the better way of operating. Single letter variable names should be reserved for loop control (e.g. for i in range(10)) and for actual mathematical equations.

As an example, g is a bad variable name, genes is an okay variable name, gene_names is better, and housekeeping_gene_names is the best.

Commenting

Commenting code is fairly straightforward, and something that most people know they should do. What you might not know is that there are standards designed for easing some of the cognitive burden on how to structure your comments. These methods typically also integrate with software to convert comments into documentation websites.

In R, roxygen2 implements a standard way of writing comments. It is the only one I am aware of, though it is also possible there are others.

In python, the two main commenting standards are reStructuredText and Google/NumPy style. Examples of both are shown below to help select a favorite.

# NumPy Style
def get_gene_count(gene_file):
	'''Count the number of genes in a file produced
	by `microarray_rnaseq_gene_intersection.py`

	Arguments
	---------
	gene_file: str or Path
		The path to a file containing the list of genes created by
		microarray_rnaseq_gene_intersection.py

	Returns
	-------
	num_genes: int
		The number of genes present in the gene file
	'''
# reStructuredText
def get_gene_count(gene_file):
	'''Count the number of genes in a file

	Count the number of genes in a file. The expected file format
	is that of a file produced by
	microarray_rnaseq_gene_intersection.py

	:param str gene_file: The path to a file containing genes
	:return: The number of genes present in the gene file
	'''

Regardless of how you choose to style your comments, what is important is that you have a description for each file and function. Inline comments about individual code parts can be used at your discretion.

Further Reading

Repository Organization

With some minor variation, academic publications all have the same sections. Every paper you read will have sections titled “introduction”, “methods”, “results”, and so on. It’s not necessary for papers to be organized this way, but it reduces the difficulty of orienting yourself in a new paper and makes it easier to find things.

Code repositories should be organized for the same reason. Like academic papers, most repositories share the same layout (shown below), but there are some variations. In this diagram all the bolded words are names of directories, while the unbolded words are individual files.

  • <Project Name>
    • data
    • code
    • output
    • README.txt
    • LICENSE

The data directory is pretty self explanatory. It stores the data files that your analyses are run on. You can also create subdirectories for raw and processed data files to keep track of both the original data and the data the analyses are run on.

The code directory has a lot more variation. For starters, in python projects the code directory will typically have the same name as the project itself. That’s because, with a little extra work, the code can be installed as a module and used in other projects. Code directories are also frequently split into two directories: one code directory and one directory for notebooks. This allows a logical separation between analysis code which goes in notebooks and supporting code which goes in the code directory.

The output directory stores the results of your analyses. Typically for academic projects, the output directory will contain visualizations that become figures in a paper.

Back in the main directory there is a README file, which we’ve already discussed, and a license file. The license file is important, because otherwise others are unable to use your code. If you don’t know which license to choose, this website may help.

Code Organization

When I started writing code, my projects tended to be a single file with hundreds of lines of code and zero functions. In order to work on the project I needed to have hours of uninterrupted time to hold the entire project in my head and determine the next steps. In addition to being a headache to work with, I ended up making errors because I didn’t fully understand what was going on. Over time, I found that the solution to this monolithic design pattern was to split code into different files and functions.

Functions

Programs are most understandable when they are split into subportions that are progressively less abstract. To demonstrate, let’s make a cake.

# With functions
def mix_solids():
    bowl.add(butter)
    bowl.add(sugar)
    bowl.add(baking_powder)
    bowl.add(salt)
    mixer.mix(bowl)

def mix_liquids():
    bowl.add(eggs)
    bowl.add(vanilla)
    mixer.mix()

def preheat_oven():
    oven.set_temp(325)

def mix_batter():
    mix_solids()
    mix_liquids()

def bake_cake():
    oven.insert_cake()
    time.sleep(1800)
    oven.remove_cake()

def make_cake():
    preheat_oven()
    mix_batter()
    bake_cake()
# Without functions
oven.set_temp(325)
bowl.add(butter)
bowl.add(sugar)
bowl.add(baking_powder)
bowl.add(salt)
mixer.mix(bowl)
bowl.add(eggs)
bowl.add(vanilla)
mixer.mix(bowl)
oven.insert_cake()
time.sleep(1800)
oven.remove_cake()

It’s much simpler to understand the code with functions than the code without. As the scale of a project grows it becomes increasingly difficult for functionless code to work. For example, if you wanted to bake the cake for less long it is far more obvious where a change should be made in the former code than the latter. One might go so far as to call functionless code dysfunctional.

The easiest way to end up writing code in functions instead of slabs is to plan what you’re writing ahead of time. Instead of trying to write an entire analysis fully formed, write out each step as a comment, then convert that comment into a function. For me, at least, it is much simpler to think about what is involved in mixing a batter than keeping a whole recipe in my head.

Files

Usually around the second analysis of a project, I end up needing code stored in the first analysis file. I begin to copy-paste the code, only to hear the words of my undergrad CS teacher echo in my head:

When you press the copy button, say to yourself, “I am.” Then, when you press the paste button, say to yourself, “making a mistake.”

Properly scolded, I take the code and move it to a function called in both analyses. Keeping a single function instead of having the code in two places keeps me from having to fix all the bugs in the code twice.

Eventually I end up with enough functions called in multiple analyses that it’s worth putting them in their own file. By convention, that file is usually called utils or utilities. Once it is too large, the utilities file can be further split into more specialized files like a data_utils file. These files can be accessed with import utils in python or source("utils.R")

On the analysis side, the biggest gain in repo organization is to prepend a number to the script names so users know which order to run them in. Additionally, each analysis should be in its own file with a descriptive name.

Avoiding Magic Numbers

As cool as they sound, magic numbers are not good programming practice. For example, if you’re accessing a part of a csv file, you probably should read in the header and determine the location of the correct column instead of writing gene = row.split(',')[2]. If the magical number is something that will always be the same, state that in your code.

For example, in pandas instead of using expression_df.iloc[:,2], use expresion_df["gene_names"]. Likewise, in R prefer using expression_df$gene over expression_df[,2]. If your columns or rows don’t have labels, a construction like the one below will keep your code form breaking if the columns change order.

def get_gene_column(in_file):
    header = in_file.readline().strip().split(',')
    gene_index = None
    for i, col_name in enumerate(header):
        if col_name == 'gene':
            gene_index = i
            return gene_index
    return gene_index

gene_index = get_gene_column(in_file)

column_of_interest = expression_df.iloc[:, gene_index]

Finally, if using a number in your code is unavoidable, be sure to leave a comment to explain what it means.

Conclusion

This post is long, and there are a lot of things to think about. Don’t let the perfect be the enemy of the good. It is far better to use a couple of ideas from this post than to get dejected and use none of them.

Acknowledgements

Thank you to Rachel Ungar for her expertise in snakemake and for helping edit this post. This post is funded in part by the Gordon and Betty Moore Foundation’s Data-driven Discovery initiative through grant GBMF4552.

BibTex Citation:

@misc{Heil_2020, title={Reproducible Programming for Biologists Who Code - Part 2: Should Dos}, url={https://autobencoder.com/2020-06-30-shoulddo/}, journal={AutoBenCoding}, author={Heil, Benjamin J.}, year={2020}, month={06}}