According to John Chambers, one of the creators of R’s precursor S:
Everything that exists in R is an object
Everything that happens in R is a call to a function
<-
, preferred over =
scores <- mtcars
Now we can use scores
as an argument to other functions. For example, compute summary statistics for each column in the data:
summary(scores[1:4]) # First four elements
## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
Use Alt + -
(Win) or Option + -
(Mac) in RStudio to quickly insert <-
Scalars (0-dimensional): a = 42
, b = a / 7
Vectors (1-dimensional): b = c(12, 14, 16)
b[2]
(returns 14)Matrices (2-dimensional):
mtx = matrix(data = c(3, 1, 3, 2, 3, 2), ncol = 2)mtx
## [,1] [,2]## [1,] 3 2## [2,] 1 3## [3,] 3 2
Be careful not to name your variables as function names. E.g., c
is a bad variable name because c()
is a function for combining variables. Check its help function ?c
With auto-completion in RStudio, you don't need to worry about variable name length - make names that are self-explanatory
Follow Hadley Wickham's Tidyverse Style Guide
# numeric: real or decimal numbers, sometimes referred to as “double” integer: a subset of numeric in which numbers are stored as integersa <- 2# character: sometimes referred to as string data, tend to be surrounded by quotesa <- "2" # logical: Boolean data (TRUE and FALSE)a <- TRUE
Auxillary functions
class(a)str(a)is.numeric() # TRUE is matches, same with is.characteras.numeric("2") # Attempt to convert types
factor()
- used for nominal data ("Cats", "Cats", "Dogs", "Birds")ordered()
- used for ordinal data ("First", "Second", "Second", "Third")factor(c("Cats", "Cats", "Dogs", "Birds"))
## [1] Cats Cats Dogs Birds## Levels: Birds Cats Dogs
ordered(c("First", "Second", "Second", "Third"))
## [1] First Second Second Third ## Levels: First < Second < Third
levels()
- get levels of a factor. Also, an argument in the factor()
function allowing to set the order manuallyrelevel()
- reorder factor levelsis.factor()
, as.factor()
a <- factor(c("Cats", "Cats", "Dogs", "Birds"))a
## [1] Cats Cats Dogs Birds## Levels: Birds Cats Dogs
relevel(a, ref = "Cats")
## [1] Cats Cats Dogs Birds## Levels: Cats Birds Dogs
levels(a) <- rev(levels(a))a
## [1] Cats Cats Birds Dogs ## Levels: Dogs Cats Birds
dat = data.frame(Column.1 = c(3, 1, 3), Column.2 = c("2", "3", "2"))dat
## Column.1 Column.2## 1 3 2## 2 1 3## 3 3 2
dim(dat)
## [1] 3 2
nrow(dat)
## [1] 3
ncol(dat)
## [1] 2
length(dat)
## [1] 2
colnames(dat)
## [1] "Column.1" "Column.2"
rownames(dat)
## [1] "1" "2" "3"
dat[3, 2] # [] contain row/column indices.
## [1] "2"
dat[3, "Column.2"] # Address by column name
## [1] "2"
dat$Column.2[3] # Use $ shortcut to access column by name
## [1] "2"
# Compare column classesclass(dat$Column.1)
## [1] "numeric"
class(dat$Column.2)
## [1] "character"
# Top or bottom of a data framehead(dat)tail(dat)
There are several built-in functions that are useful for working with data frames.
Content:
head()
: shows the first few rowstail()
: shows the last few rowsSize:
dim()
: returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow()
: returns the number of rowsncol()
: returns the number of columnscolnames()
(or just names()
): returns the column names
str()
: structure of the object and information about the class, length and content of each column
summary()
: works differently depending on what kind of object you pass to it. Passing a data frame to the summary()
function prints out useful summary statistics about numeric column (min, max, median, mean, etc.)
lst = list(A = rep(2, 5), B = seq(1:10), C = letters)lst
## $A## [1] 2 2 2 2 2## ## $B## [1] 1 2 3 4 5 6 7 8 9 10## ## $C## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"## [20] "t" "u" "v" "w" "x" "y" "z"
lst[1]
(or, lst["A"]
)lst[1]
## $A## [1] 2 2 2 2 2
lst[[1]]
(or, lst[["A"]]
, lts$A
)lst[[1]]
## [1] 2 2 2 2 2
R ignores everything after the #
sign
# This line is a commentprint("Hello, World!") # This will print the message, but the comment will be ignored
## [1] "Hello, World!"
z <- c(1, 2, 3)ls()
## [1] "a" "dat" "lst" "mtx" "scores" "z"
rm(z) # Remove one variablels()
## [1] "a" "dat" "lst" "mtx" "scores"
# Remove everything from the environmentrm(list = ls()) # Not the same as restarting R sessionls()
## character(0)
summary()
, mean()
read.csv(file="scores.csv")
read.csv
is a function to import a CSV file, and file
is an argument that specifies which file to import
R has a large number of built-in functions, and the user can create their own functions
From the R console - type the function and hit Enter
Using an R
script - a text file that contains all your R
functions/code
R
scripts allow you to save, edit, reproduce and share your code.R
extensionsource("script_name.R")
, or, from command line, Rscript script_name.R
All functions belong to packages. The read.csv
function is in the utils
package.
R
comes with about 30 packages (called "base R
"), but as of August 2020, there are over 16,000 CRAN packages and over 1,900 Bioconductor packages
Example: ggplot2
is a popular package that adds functions for creating graphs in a different way than what base R
provides
To use functions in a package, the package must be installed and loaded. (They're free)
CRAN
- Comprehensive R Archive Network – a collection of > 16,000 (September 2020) packages
Bioconductor
– genomics-oriented free and open source project hosting > 1,900 specialized R packages (September 2020)
MRAN
- Microsoft R Application Network, includes CRAN packages and more
GitHub
– code-hosting repository, packages for everyone and by everyone
install.packages
- installs packages from CRAN, e.g., install.packages("BiocManager")
remotes
package - installs R packages from GitHub, GitLab, Bitbucket, Bioconductor, or plain 'subversion' or 'git' repositories. E.g., remotes::install_github("tidyverse/ggplot2")
BiocManager::install()
- Install or update Bioconductor, CRAN, or GitHub packages
RStudio point-and-click interface
library()
will load the package, e.g., library(readxl)
or library("readxl")
install.packages("readxl")
require()
will load the package and, if success, return TRUE. Useful in if
statement, e.g.
if (!require(ggplot2)) { install.packages("ggplot2")}
install.packages(“<package_name>”)
– install from CRAN
install.packages(“<package_name.tar.gz>”, repos = NULL)
– install from a tarball archive
R CMD INSTALL <package_name.tar.gz>
- install from a command line
devtools::install_github('mdozmorov/MDmisc')
– install from GitHub
BiocManager::install()
- install Bioconductor, CRAN, and GitHub packages
https://CRAN.R-project.org/package=BiocManager
library(package_name)
– load library to use its functions
library()
vs. require()
require()
tries to load the package, returns TRUE or FALSElibrary()
just loads the package, fails if the package is not availableUse only library(package_name)
https://yihui.name/en/2014/07/library-vs-require/
You can access functions without loading the package using the ::
operator, e.g., Hmisc::rcorr()
Entering the function name without parentheses will output its code
> data.framefunction (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors()) { data.row.names <- if (check.rows && is.null(row.names)) ...
:::
operator if you know their nameGet an overview of all functions in a package: help(package = "dplyr")
browseVignettes(package = "limma")
Use ?function_name
to get help on a function from a loaded package. E.g., ?boxplot
(same as help(boxplot)
)
example(boxplot)
to see how the function can be usedUse ??function_name
to search for the function across all installed packages, even not loaded. E.g., ??ggplotly
Search engine is your best friend on many things
Base functions: read.table
, read.csv
, write.table
, write.csv
Tidyverse way, readr
package: read_table
, read_csv
, read_tsv
, write_csv
...
For fixed-width files, use read.fwf
or readr::read_fwf
funcitons
For reading/writing Excel files, use readxl
and writexl
packages, read_xlsx
, write_xlsx
functions
.csv
is the preferred text-based format that opens in ExcelWhen creating data frames with data.frame()
or reading data with read.table()
, strings automatically converted to factors
This behind-the-scenes factor conversion can lead to unpredictable behaviors
Use as.is = TRUE
in read.table()
to avoid such conversion
Better yet, set options(stringsAsFactors = FALSE)
at the beginning of your script files
https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/
save()
, load()
- saves/loads R objects to the specified file
x <- stats::runif(20)y <- list(a = 1, b = TRUE, c = "oops")save(x, y, file = "xy.rda")load(file = "xy.rda")
saveRDS()
, readRDS()
- saves/loads a representation of the object
x <- stats::runif(20)saveRDS(x, file = "x.rds")x2 <- readRDS(file = "x.rds")identical(x, x2, ignore.environment = TRUE)
https://fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/
R contains many datasets (stored as data frames) that are built-in to the software
data() # All built-in datasets# ?treesdata(trees) # Load a particular onehead(trees)
## Girth Height Volume## 1 8.3 70 10.3## 2 8.6 65 10.3## 3 8.8 63 10.2## 4 10.5 72 16.4## 5 10.7 81 18.8## 6 10.8 83 19.7
attach(trees) # You can make R find variables in any data frame by adding the data frame to the search pathsearch() # .GlobalEnv is your workspace and the package quantities are libraries
## [1] ".GlobalEnv" "trees" "package:xaringanthemer"## [4] "package:stats" "package:graphics" "package:grDevices" ## [7] "package:utils" "package:datasets" "package:methods" ## [10] "Autoloads" "package:base"
detach(trees) # To remove an object from the search path, use the detach()with(trees, mean(Height)) # Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data
## [1] 76
attach()
can cause name overloads and other serious issues. Avoid it
Simple statistical functions: count()
, min()
, max()
, mean()
, median()
, sd()
, cor()
, summary()
)
These, and many other functions, have settings to properly handle NAs, e.g., mean(x, trim = 0, na.rm = FALSE, ...)
complete.cases()
on a matrix/data frame returns row-wise logical with TRUE for rows without NAs
unique()
- unique elements in a vector. Combine with length()
to get the number of unique elements
table()
- contingency table for a vector (the number of elements per unique level)
data(mtcars) # simple summary # ?mtcarshead(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mean(mtcars$mpg) # Try median, sd, var, min, max
## [1] 20.09062
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10.40 15.43 19.20 20.09 22.80 33.90
quantile(mtcars$mpg, probs = c(.20, .80))
## 20% 80% ## 15.20 24.08
cor(mtcars$mpg, mtcars$hp) # sample correlation coeficient
## [1] -0.7761684
table(mtcars$cyl)
## ## 4 6 8 ## 11 7 14
table(mtcars$cyl)/length(mtcars$cyl) # normalized by the total number of observations = 32
## ## 4 6 8 ## 0.34375 0.21875 0.43750
if, else
for
while
repeat
break
next
Conditional code execution
if (condition) { # do something} else { # do something else}
==
: Equal to!=
: Not equal to>
, >=
: Greater than, greater than or equal to<
, <=
: Less than, less than or equal tox <- 1:15if (sample(x, 1) <= 10) { print("x is less than 10")} else { print("x is greater than 10")}
## [1] "x is less than 10"
Repetitive code execution
for (i in 1:5) { cat(i)}
## 12345
Compare with
for (i in 1:5) { print(i)}
## [1] 1## [1] 2## [1] 3## [1] 4## [1] 5
x <- c("apples", "oranges", "bananas", "strawberries")for (i in x) { cat(i); cat(" ")}
## apples oranges bananas strawberries
for (i in 1:4) { cat(x[i]); cat(" ")}
## apples oranges bananas strawberries
for (i in seq(x)) { cat(x[i]); cat(" ")}
## apples oranges bananas strawberries
m <- matrix(1:10, 2)m
## [,1] [,2] [,3] [,4] [,5]## [1,] 1 3 5 7 9## [2,] 2 4 6 8 10
for (i in seq(nrow(m))) { for (j in seq(ncol(m))) { print(m[i, j]) }}
## [1] 1## [1] 3## [1] 5## [1] 7## [1] 9## [1] 2## [1] 4## [1] 6## [1] 8## [1] 10
i <- 1while (i < 10) { print(i) i <- i + 1} # Be sure there is a way to exit out of a while loop
## [1] 1## [1] 2## [1] 3## [1] 4## [1] 5## [1] 6## [1] 7## [1] 8## [1] 9
repeat { # simulations; generate some value have an expectation if within some range, # then exit the loop if ((value - expectation) <= threshold) { break }}
for (i in 1:20) { if (i%%2 == 1) { next # skip printing over odd numbers } else { print(i) }}
## [1] 2## [1] 4## [1] 6## [1] 8## [1] 10## [1] 12## [1] 14## [1] 16## [1] 18## [1] 20
Many operations in R are already vectorized, making code more efficient, concise, and easier to read
x <- 1:4; y <- 6:9x
## [1] 1 2 3 4
y
## [1] 6 7 8 9
x * y
## [1] 6 14 24 36
x / y
## [1] 0.1666667 0.2857143 0.3750000 0.4444444
ages <- c(40, 50, 60, 70, 80)# add a value to end of vectorages <- c(ages, 90) # add value at the beginningages <- c(30, ages)# extracting second valueages[2]
## [1] 40
# excluding second valueages[-2]
## [1] 30 50 60 70 80 90
# extracting first and third valuesages[c(1, 3)]
## [1] 30 50
apply
family of functionsWriting for, while loops in R are inefficient, and we want to vectorize computation in R.
apply()
- apply a function over the margins of an array
lapply()
- loop over a list and evaluate a function on each element
sapply()
- same as lapply but try to simplify results, if the result is a list where every element is length 1, then a vector is returned
mapply()
- multivariate version of lapply
tapply()
- apply a function over subsets of a vector
x <- 1:4lapply(x, runif)
## [[1]]## [1] 0.7103501## ## [[2]]## [1] 0.3951038 0.4184131## ## [[3]]## [1] 0.3217766 0.1780726 0.8919266## ## [[4]]## [1] 0.60705926 0.05831400 0.09485927 0.83428037
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1))sapply(x, mean)
## a b c ## 2.5000000 -0.1394352 0.9592841
#If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.x <- list(rnorm(100), runif(100), rpois(100, 1))sapply(x, quantile, probs = c(0.25, 0.75))
## [,1] [,2] [,3]## 25% -0.4512676 0.3318313 0## 75% 0.7023871 0.7364010 1
x <- matrix(rnorm(200), 20, 10)apply(x, 1, sum)
## [1] -1.8750636 -5.7395261 -1.3897511 2.8761932 0.6523308 -1.2598896## [7] 3.6839290 -3.9213112 6.7922807 -1.2603816 1.9749202 -1.6921166## [13] -0.9821136 3.3275777 3.5328862 1.8707953 9.2495361 1.7392953## [19] -2.0882412 0.9703799
apply(x, 2, mean)
## [1] 0.45098333 -0.19774551 -0.13381431 0.50200787 0.13881266 -0.27666766## [7] 0.28958605 0.07532345 -0.12001035 0.09461095
For sums and means of matrix dimensions, we have some shortcuts
rowSums = apply(x, 1, sum)rowMeans = apply(x, 1, mean)colSums = apply(x, 2, sum)colMeans = apply(x, 2, mean)
Check ?rowSums
help on these base R functions
Apply a function to each cell of a ragged array, that is, to each (non-empty) group of values given by a unique combination of the levels of certain factors.
function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)X is a vectorINDEX is a factor or a list of factors (or else they are coerced to factors)FUN is a function to be applied... contains other arguments to be passed FUNsimplify, should we simplify the result?
x <- c(rnorm(10), runif(10), rnorm(10, 1))f <- gl(3, 10)tapply(x, f, mean)
## 1 2 3 ## -0.7883378 0.4986981 1.1900419
mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)FUN is a function to apply... contains arguments to apply overMoreArgs is a list of other arguments to FUN.SIMPLIFY indicates whether the result should be simplified
mapply(rep, 1:4, 4:1)mapply(rnorm,mean=1:3,sd=1:3,n=seq(5,15,by=5))
According to John Chambers, one of the creators of R’s precursor S:
Everything that exists in R is an object
Everything that happens in R is a call to a function
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |