My Favorite References for Working in R

Author

K. Lee

Published

April 6, 2020

This is a post that I will likely update later, but for now, I wanted to gather together some of my favorite references for working in R. Some of these may assume a certain baseline knowledge of R, some are more about thinking through your data management and organizing your analysis in a way that is easier for Future You to revisit it. It’s based on my perspective as someone who has mostly done work for academic-style research purposes and works primarily in R. If you want this information for another language, I’m sorry I can’t help you.

It’s based in part on this thread that I tweeted, but I wanted to expand on a lot of these references and add some other thoughts. Additionally, I wanted to make the information easier for me to find and easier for me to point people to it.

First, everyone is always learning. Let’s all collectively try to be supportive and encouraging.

Second, one of the frustrating and amazing things about programming and data analysis is that there are multiple ways to do everything. Sometimes there are good reasons to choose one way over another. Sometimes it is personal preference. Sometimes it’s just the only way you managed to get it to work. Whatever. Accept it. Keep learning new ways to do stuff if you want, or keep using the way that works for you.

Third, I’m going to reference three people a bunch in this post: Past You, Present You, and Future You (or Past Me, Present Me, and Future Me). Right now Present You can’t do anything to fix what Past You did. Present You can only work to make life a little easier for Future You. Present Me usually thinks that Past Me was a lazy jerk who should have documented things better (maybe Past You was diligent and orderly…if so, congrats!). Sometimes Present Me accepts that Past Me didn’t know enough to do things better, or technology has changed, or that Past Me was doing the best we could in that situation, alright? But I like to pause when working to think about what Future Me would wish Present Me was doing for this project right now, and then take steps to make life a little easier for Future Me.

A lot of my work, before going back to graduate school, was made possible by Other People’s Data. I’ve seen great Other People’s Data and I’ve seen confusing and messy Other Peoples Data. I’ve assembled confusing and messy datasets (Past Me didn’t know what she was doing!), and I’ve assembled well-documented datasets. For analyses, I often feel like we’re basically all like Wallace and Grommet, putting down tracks as fast as we can to keep moving forward (that is the gif below, which I can’t figure out how to give an image description).

via GIPHY

Data Management

Keeping track of where your data came from and what you did to it is really important. So much of the data we use (speaking as an anthropologist and researcher in the area of human health) can’t be recreated because of the specific contextual nature. It takes time and money to collect data. Depending on the type of research you do, you’ve been entrusted with personal information from other people. Keeping track of it, and making sure it’s effectively used for the things you described in your proposals and IRBs is the least you can do, but it takes time and effort and planning.

Present You can document what is going on in the study and organize the data in a way that makes Future You happy. Reflect on what Past You has done and see if there is anything Present You can do differently to make things easier or more clear for Future You. I promise that Future You is not going to remember random details and small decisions Present You is making any better than Present You remembers what happened in That Spreadsheet from 2014.

That being said, here are some of my favorite readings on data organization and data management.

  • Data Organization in Spreadsheets by Karl Broman and Kara Woo. Spreadsheets are not going away. Someday I dream of having a project that includes funding for an actual database with a database administrator. Until then, we’re all still probably going to be entering data from handwritten paper sheets into spreadsheets. But!!! We can do this in ways that make our data safer and easier to use in the future. Give this one a quick read and see if there is ONE THING you could start doing now in your research to make life easier for Future You.
  • Good Enough Practices in Scienctific Computing, written together by a slew of awesome folks. There is a lot of information in this, but I’d start with reading the The Data Management section if you’re new to doing some of these things. Again, see if there is ONE THING that makes sense to incorporate into your project.
  • Tidy Data by Hadley Wickham is a classic. It is what helped me realize how important it is to plan ahead for analysis when organizing your data.

I find that each time I re-read these two articles, I remember that there is something that I started doing because I read about it that now feels natural to my workflow and there is something else small that I can attempt to incorporate. Present Me can look back and appreciate a lot of tiny steps that Past Me took to learn and implement better data management practices.

Reproducible Research

Making sure your work is reproducible is important for scientific integrity, for Future You who has revisions or new ideas, for whoever gets your project later, etc. Reproducible basically means you could start with the same data and remake the same analyses (plots, results, models, tables, etc). There are a number of more technical definitions and distinctions, but that’s a good starting point, and I think it makes it more manageable to think about steps you can take to achieve that goal.

Present Me really appreciates the scripted data cleaning that Past Me put into the project. Present Me is trying to better document the decisions that led to specific analyses for Future Me so we won’t go down the same rabbit hole a dozen times.

My favorite readings on Reproducible Research are:

  • Reproducible Research with R and RStudio or compile your own version here. I have the second version, and I think it is totally worth the cost to have it as a paperback. It outlines a lot of considerations, and walks you through a number of options. But mostly it breaks down the various research tasks you might have (at least the computer-y ones) and helps you identify ways to make it more reproducible. This is the book that helped Past Me figure out how I can take small steps to be more reproducible when it seemed overwhelming.
  • Good Enough Practices in Scientific Computing (yes, again). See if there is ONE THING that makes sense to incorporate into your project.
  • Packaging data analytical work reproducibly using R (and friends) by Ben Marwick, Carl Boettiger, and Lincoln Mullen. This one is a bit more technical at times, but it introduces the idea of a Research Compendium, and gives a few different versions of what that might look like for various types of projects. I like how it lays out several options of different complexity. It also provides some goals for how to share your data and analyses more effectively with collaborators.
  • Report Writing for Data Science in R by Roger Peng. I like this one because when I’m giving updates to my collaborators it helps if I can show them results or summary stats, and then I can update this document for the next meeting based on our discussion. In my mind, this is basically writing them a report. This book walks you through some considerations to make that easier and less tedious. The report-writing isn’t exactly the same as the work that would go into a manuscript, but it can help you have conversations with your team to narrow down the scope of your manuscript, and then you can re-use a lot of that work when it comes time to write your official version.
  • RMarkdown Driven Development (RmdDD) by Emily Riederer. Honestly, this one probably best matches how I tend to do stuff. Because you don’t always know where your analyses will go, but this gives you some ways to think through re-organizing after you get your footing in a project.

Learning R stuff in general

This is the area that I’d like to add more stuff to over time, but for now here are the main links that I provided in my Twitter thread referenced above. Honestly, the best R reference for you is going to depend on what type of data you work with, how familiar you are with basic programming, how familiar you are with statistics, and how opinionated you are about graphics and layouts.

These are the main places I turn to when I have questions (along with package documentation and vignettes):

  • Cookbook for R and the associated R Graphics Cookbook. Both provide small examples, sample code, and show you how tweaking parts of the code change what happens.
  • R for Data Science is available completely online, but I also have a paper copy because I find it easier to flip through and reference. This goes a bit more in-depth on some data wrangling and cleaning stuff that is really useful in real life (where our data do not arrive all neat and tidy). Additionally, I find their approach to thinking through the analysis steps and process and workflow useful
  • Data Visualization: A practical introduction by Kieran Healy. I kinda like this one in part because it deals with social science data, which is often more messy to analyze and interpret than some of the main example datasets often used in R tutorials. It is also based on concepts from Tufte, and it tries to get you familiar with ideas and concepts rather than throwing a TON of code at you.
  • Data Science Specialization at Coursera from Johns Hopkins Biostats department. This is a great collection of classes that go over everything from getting started in R to different types of analyses. I highly recommend it if you prefer to learn from videos instead of books. The classes are free, and I periodically skip around and watch the videos that are most pertinent to my current questions or roadblocks.
  • The Carpentries There are various free lessons available on a bunch of topics (not just R!) through their Data Carpentries, Software Carpentries, and Library Carpentries. Poke around, try lessons. See if there is a workshop happening near you some time.

In addition to these readings, I find it helpful to look at other people’s code and have people look at my code. Each week on twitter you can peruse the TidyTuesday hashtag and see a bunch of examples of people exploring a dataset. You can see different approaches people take, different questions, different visualizations.

Finally, please remember that it takes a lot of time and effort to learn new skills. Be kind to yourself. Present You is doing the best you can.

Did I miss anything you think is vital? Do you have questions? Are any of the links broken? Tweet me any feedback you might have, and I’ll try to get back to you and/or fix this post.