2023-03-03
As Ioannidis (2005) argues, most research findings are false.
This has led to the replication crisis in many fields (esp. social psychology) in which seemings well-established results just… aren’t true.
This is not (usually) malfeasance on the part of a research. It’s just bad practice, and a lack of transparency.
As Christensen and Miguel (2018) argues, this is a solvable problem with better research practices.
One part of the solution is replicable research workflows. The key benefits:
Key Drawback: more work if not executed properly.
This presentation: tips on how to make this part of your workflow, and eliminate the drawback.
Each step here should be documented and replicable.
conda
One key issue is making sure you can easily re-create all of the tools (software) you need to do an analysis.
A key solution is environment management.
A environment is a virtualized computational environment with defined properties which are tracked.
The key benefits are that:
There are many environment managers out there. The most popular is Anaconda (conda
).
conda
renv
package manages R packages.For Python, the venv
module is similar to conda
; you shouldn’t mix and match them.
conda
You can install conda
from their website, for the UI (which is awful)
https://docs.conda.io/en/latest/miniconda.html
My rationale is that if you’re using conda
you need to be at least a little comfortable with the command line.
conda
ConceptsIn conda
, you create virtual environments into which you can install software.
conda
via a package managerconda
Commandsconda create --name MYENV
conda activate MYENV
MYENV
conda install -c conda-forge PACKAGE
conda info --envs
conda list --revisions
then conda install --rev 8
conda create --name R-env
conda activate R-env
conda install -c conda-forge r-essentials
conda install -c r rstudio
conda run rstudio
If you have an external IDE (like RStudio or VSCode) you will also be able to see the R-env
environment as one way to “run” R
conda
EnvironmentsOkay, you’ve done a bunch of stuff. You want to send your code and workflow to your PI. How do you do that?
conda activate R-env
conda env export > r-environment.yml
Then on the other computer, you just do:
conda env create -f r-environment.yml
A key issue is when you need a package or software that isn’t available via conda install
.
renv
)Usually best to install using curl
from a specific version and then just save that information.
I know, it sucks. Write a short bash
script to launch all the stuff you want in your project folder instead. No more scary command line.
git
git
?If conda
manages software what manages files or data? The answer: git
and version control.
git
is a version control system which was developed by Linus Torvalds for use with LinuxIt keeps track of the changes made to files. While there are other options (e.g., Mercurial) git
is by far the most popular and adopted.
Pros
Cons
Fortunately, unless you are working on large projects with lots of co-authors, the basic git
workflow is actually easy to use.
The benefits of git
are similar to LaTeX: it has a learning curve, but once you get the hang on it, everything will else will seem inferior and wrong.
git
Concepts and TerminologyIn git
, there are many new concepts:
main
is the “original” or “main” one.I am working with my colleague on a project. We store all our stuff on a private GitHub remote.
We just worked on a project together!
git
You can install git
from the website:
This will give you the software and a Linux-based command-like tool (git bash
) and a simple GUI (git gui
). You can install other software now:
git
Workflow Part 1: Cloning and Branchinggit
Workflow Part 2: Staging and CommittingA bundle of changes to files is called a commit (git commit
). You can select only some files by staging them, which is easiest in a GUI.
You shouldn’t commit things every time you save a file. Only commit stuff that you want to mark as a point to go back to; usually after you’ve done some stuff, or hit a milestone.
git
Workflow Part 3: Pulling, Merging, and Pushinggit
Workflow Part 4: F-ing it UpGit is hard: messing up is easy, and figuring out how to fix your mistakes is impossible. Git documentation has this chicken and egg problem where you can’t search for how to get yourself out of a mess, unless you already know the name of the thing you need to know about in order to fix your problem. (Katie Sylor-Miller)
git
repo they can increase its size exponentiallygit
Large File Storage system instead (https://git-lfs.com/)At this point, you have an idea of how to make your work completely reproducible.
Conceptually, this should not matter: but sometimes it does1
To do this, you need a server - but fortunately, GitHub and GitLab exist, and are free.
You can set up a script which pulls a defined image of a software and virtual computer, loads or creates your environment, clones your repo, and then builds your project all in the cloud.
This is the highest level of reproducibility - it’s not even hard.
For example, here’s how to build LaTeX documents in GitHub: https://github.com/marketplace/actions/github-action-for-latex
Overleaf even synchronizes with this too: https://www.overleaf.com/learn/how-to/Using_Git_and_GitHub