I’ve been gradually changing my workflows to be more in line with the prinicples of reproducible research. It has been life-changing – I can go back and run analyses from last year on a different computer and…it just works. It’s absolutely magical.
I’m not fully reproducible yet - I shudder to think what someone would do if they came along my Projects folder at this exact moment, but I’m getting closer. Here are the main reasons I’m going this route and the tools I’m using to do so.
There are lots of resources for reproducible research, but they are mostly overkill for my level. So, here’s what I’ve taken from the discussion.
A reproducible workflow is one in which each step of the analytical process is clearly documented in such a way that someone — and here it is better to imagine that person is not you — can retrace your steps and verify the exact results that you presented. –Baumer BS. (2017) Lessons from between the white lines for isolated data scientists. PeerJ Preprints 5:e3160v2
Tired of saying "No" each time RStudio asks you if you want to save your workspace upon exit? You can tell RStudio to stop asking with Preferences > General > Save workspace to .RData on exit Never#rstats pic.twitter.com/sJeBuRfRvp— Sharon Machlis (@sharon000) August 23, 2018
The primary feature of projects for me is that they allow me to have several data projects going at once in separate instances of R.
The other key feature is that if all of your files are in the same directory, you can use local references instead of full paths.
Using full paths is a mess, and if I ever move my folder, everything breaks.
library(readxl) mydata <- read_xlsx("C:/HD/MyFiles/MyProjects/BigDataProject/myfile.xlsx").
In a project, though, I can just use short filenames and they will continue to work indenfinitely - RStudio just finds them.
library(readxl) mydata <- read_xlsx("myfile.xlsx")
In my workflow, those are the “laws”. However, there are a few more things that I think improve my coding style, making my analyses more reproducible by myself and hopefully others.
Hadley Wickham coined the term “Tidy Data” to describe data with the following features:
Tidy data is easy to work with, easy to analyze, easy to understand. To get from messy data to tidy data (usually) requires a lot of steps. Scripting those steps ensures that I can do my “data munging” steps again if the raw data gets updated with new values. Depending on length, I either do these as a standalone code block in an R Notebook, or as a standalone file called tidy.rmd.
To tidy data, the tools in
tidyr are invaluable. I wrote a post about tidying data.
Here’s an example without the pipe. I had to create 2 extra dummy variables and it isn’t very readable.
library(dplyr) just4cyl <- filter(mtcars, cyl == 4) four_cyl_with_ratio <- mutate(just4cyl, gcratio = gear/carb) clean_cyl_with_ratio <- select(four_cyl_with_ratio, hp, wt, gcratio)
And an example using the pipe. So much better, no? The verbs are lined up and it is clear what I did.
library(dplyr) clean_cyl_with_ratio <- mtcars %>% filter(cyl == 4) %>% mutate(gcratio = gear/carb) %>% select(hp, wt, gcratio)
Especially with R Notebooks, adding descriptive info about your thought process is a breeze. Explain to your future self what you were trying to do and what cludgy workarounds you might have employed along the way.
If you find yourself doing the same thing over and over, write a function or save the code as a script. Hadley’s rule of thumb is anything you do more than twice, but for me the pain of copying and pasting vs. writing a function makes the cutoff for me anything I do more than 10x.
I only know git/github, which is baked into RStudio. Having regular backups and history of my decisions is a lifesaver. Learning it has been a curve, and I’m sure I don’t really understand it. My basic trend is to commit a change at the end of every “complete thought”, whether that is an r code block that produces the output I like, some text that forms a complete thought, etc. I try to only commit when everything is working fine, so I can use it as a restore point.