Friday, February 3, 2017

Standing on the shoulders of giants

I am currently teaching a graduate "Stats" course, which is more a historical exploration of statistical issues in ecology, led by grad students. As part of the course, we are also exploring best practices in R and ecological data management. So naturally we covered Brian McGill's 10 commandments for good data management, and his follow-up post with an example application of these recommendations with a toy data set.
I decided afterwards to do the challenge, and with our weekly University of Guelph R Users group (UGRU) we walked through the code line by line, and discussed why certain lines were included, alternative ways to code them, advantages and disadvantages of these alternative approaches. It took us 3 hours of exploration, and I have captured our discussion in an alternative R script file, where our notes are preceded by "###" to differentiate them from Brian's comments.

Here is a link to this updated script file: https://drive.google.com/open?id=0B6C_pml53BPUQ1JKWV96NDFFVUU

Here is a summary of some of our observations:

  • The tidyverse package makes everything easy
  • read_csv is preferable over read.csv
  • tibbles are the way to go
  • reproducible code is very difficult (paths to files, outdated packages)
  • different philosophies with respect to keeping/creating intermediate files, and the value of long versus short file names
  • the flexibility of ggplot is awesome, and just as in base R, there are multiple ways to reach the same goal
  • and the biggest revelation for some of us: when you are piping, and your code is structured in multiple lines, you can still execute the whole block with one cmd/ctr-enter, without the need to highlight the block or step through it line by line!
Thank you Brian for the nice tutorial, RStudio for the functionality, and Maddie for the cmd-enter combination in a piping block, coding will be so much more efficient now.

No comments:

Post a Comment