productivity with rstudio

ben best <bbest@nceas.ucsb.edu>
2014-04-03 in Santa Barbara, CA USA

overview

  1. data wrangling
    • tool: dplyr
  2. documenting
    • tool: markdown
  3. versioning
    • tool: github

inspiration

Gandrud2013_ReproducibleResearchRstudio_cover

1. data wrangling with dplyr

what is dplyr?

  • dplyr is the next iteration of plyr, focussed on tools for working with data frames.

  • plyr-screenshot.png

old: plyr

library(Lahman)
library(plyr)

games <- ddply(Batting, "playerID", summarise, total = sum(G))
head(arrange(games, desc(total)), 5)
   playerID total
1  rosepe01  3562
2 yastrca01  3308
3 aaronha01  3298
4 henderi01  3081
5  cobbty01  3035

new: dplyr chaining %.%

library(Lahman)
library(dplyr)

Batting %.%
  group_by(playerID) %.%
  summarise(total = sum(G)) %.%
  arrange(desc(total)) %.%
  head(5)
Source: local data frame [5 x 2]

   playerID total
1  rosepe01  3562
2 yastrca01  3308
3 aaronha01  3298
4 henderi01  3081
5  cobbty01  3035

dplyr concepts

  • grammar of data manipulation: sequence of operations on a dataset, rather than setting temporary variables or nesting functions inside one another

  • readable: minimal quoting needed, operations are easily aligned with another (especially with merge())

  • fast: written largely in C++, faster than base functions

  • generic: regardless of backend storage (data.frame, database, Goolge bigquery…)

2. documenting with markdown

what is markdown?

  • markdown is a plain text formatting syntax for conversion to HTML (with a tool)

  • r markdown enables easy authoring of reproducible web reports from R

  • in rstudio

    markdownOverview.png

embedding r code

  • chunks: text, tables, figures

    markdownChunk.png

  • inline: pi=`r pi` evaluates to “pi=3.1416”

embedding equations

  • inline

    The Arithmetic mean is equal to \( \frac{1}{n} \sum_{i=i}^{n} x_{i} \), or the summation of n numbers divided by n.
    

    The Arithmetic mean is equal to \( \frac{1}{n} \sum_{i=i}^{n} x_{i} \), or the summation of n numbers divided by n.

embedding equations (2)

  • chunked
$$
\frac{1}{n} \sum_{i=i}^{n} x_{i}
$$

\[ \frac{1}{n} \sum_{i=i}^{n} x_{i} \]

online friendly

3. versioning with github

why version?

  • backup: offsite archive (if syncing with remote server)

  • rewind: roll back changes, so you can experiment and/or clean up code without worry of loss

  • document: associate changes of code and files with issues and messages

  • collaborate: with others

  • publish to web site (github free for public repositories)

github process

  1. sign up at github.com

  2. install git and github

  3. create or fork a repository, aka “repo”

  4. clone from web to your local desktop

  5. commit changes locally

  6. push changes to your repo

  7. pull request changes from your repo to upstream

github process (2)

direction org web user web user local
github.com/[org]/[repo] github.com/[user]/[repo] ~/github/[repo]
-> (1x) -> fork -> clone
<- merge {admin} <- <- pull request <- push, <-> commit

where:

  • [org] is an organization (eg ohi-science)
  • [repo] is a repository in the orgranization (eg ohicore, ohiprep, etc.)
  • [user] is your github username

github in rstudio

RStudio: File > New Project > Version Control

  • clone

    rstudio_git_clone.png

github in rstudio (2)

  • commit and push

    rstudio_git_commit.png

github in rstudio (3)

rstudio-vcs_diff.png

this presentation

more info