+ - 0:00:00
Notes for current slide
Notes for next slide

R preliminaries

Mikhail Dozmorov

Virginia Commonwealth University

09-24-2020

1 / 50

R expressions, function calls, and objects

According to John Chambers, one of the creators of R’s precursor S:

  • Everything that exists in R is an object

  • Everything that happens in R is a call to a function

2 / 50

Assignment operator

  • We often need to save a function's result or output. For this we use the assignment operator: <-, preferred over =
scores <- mtcars

Now we can use scores as an argument to other functions. For example, compute summary statistics for each column in the data:

summary(scores[1:4]) # First four elements
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0

Use Alt + - (Win) or Option + - (Mac) in RStudio to quickly insert <-

3 / 50

Variables

  • Scalars (0-dimensional): a = 42, b = a / 7

  • Vectors (1-dimensional): b = c(12, 14, 16)

    • Access vector element as b[2] (returns 14)
  • Matrices (2-dimensional):

mtx = matrix(data = c(3, 1, 3, 2, 3, 2), ncol = 2)
mtx
## [,1] [,2]
## [1,] 3 2
## [2,] 1 3
## [3,] 3 2
4 / 50

Variable names

  • Be careful not to name your variables as function names. E.g., c is a bad variable name because c() is a function for combining variables. Check its help function ?c

  • With auto-completion in RStudio, you don't need to worry about variable name length - make names that are self-explanatory

Follow Hadley Wickham's Tidyverse Style Guide

5 / 50

Variable types

# numeric: real or decimal numbers, sometimes referred to as “double” integer: a subset of numeric in which numbers are stored as integers
a <- 2
# character: sometimes referred to as string data, tend to be surrounded by quotes
a <- "2"
# logical: Boolean data (TRUE and FALSE)
a <- TRUE
  • complex: complex numbers with real and imaginary parts (e.g., 1 + 4i)
  • raw: bytes of data (machine-readable, but not human readable)

Auxillary functions

class(a)
str(a)
is.numeric() # TRUE is matches, same with is.character
as.numeric("2") # Attempt to convert types
6 / 50

Factors

  • Factors are how R represents categorical data
  • There are two kinds of factors:
    • factor() - used for nominal data ("Cats", "Cats", "Dogs", "Birds")
    • ordered() - used for ordinal data ("First", "Second", "Second", "Third")
factor(c("Cats", "Cats", "Dogs", "Birds"))
## [1] Cats Cats Dogs Birds
## Levels: Birds Cats Dogs
ordered(c("First", "Second", "Second", "Third"))
## [1] First Second Second Third
## Levels: First < Second < Third
7 / 50

Factors Auxillary functions

  • levels() - get levels of a factor. Also, an argument in the factor() function allowing to set the order manually
  • relevel() - reorder factor levels
  • is.factor(), as.factor()
a <- factor(c("Cats", "Cats", "Dogs", "Birds"))
a
## [1] Cats Cats Dogs Birds
## Levels: Birds Cats Dogs
relevel(a, ref = "Cats")
## [1] Cats Cats Dogs Birds
## Levels: Cats Birds Dogs
levels(a) <- rev(levels(a))
a
## [1] Cats Cats Birds Dogs
## Levels: Dogs Cats Birds
8 / 50

Data frames

  • Data frames: tables or 2-dimensional arrays. Think matrices that can hold different data types
    • The column names should be non-empty
    • Columns should be the same length
    • The row names should be unique
    • The data stored in a data frame can be of numeric, factor, or character
dat = data.frame(Column.1 = c(3, 1, 3), Column.2 = c("2", "3", "2"))
dat
## Column.1 Column.2
## 1 3 2
## 2 1 3
## 3 3 2
9 / 50

Data frames Auxillary functions

dim(dat)
## [1] 3 2
nrow(dat)
## [1] 3
ncol(dat)
## [1] 2
length(dat)
## [1] 2
colnames(dat)
## [1] "Column.1" "Column.2"
rownames(dat)
## [1] "1" "2" "3"
10 / 50

Addressing elements in a data frame

dat[3, 2] # [] contain row/column indices.
## [1] "2"
dat[3, "Column.2"] # Address by column name
## [1] "2"
dat$Column.2[3] # Use $ shortcut to access column by name
## [1] "2"
# Compare column classes
class(dat$Column.1)
## [1] "numeric"
class(dat$Column.2)
## [1] "character"
# Top or bottom of a data frame
head(dat)
tail(dat)
11 / 50

Inspecting data.frame objects

There are several built-in functions that are useful for working with data frames.

  • Content:

    • head(): shows the first few rows
    • tail(): shows the last few rows
  • Size:

    • dim(): returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow(): returns the number of rows
    • ncol(): returns the number of columns
12 / 50

Inspecting data.frame objects

  • colnames() (or just names()): returns the column names

  • str(): structure of the object and information about the class, length and content of each column

  • summary(): works differently depending on what kind of object you pass to it. Passing a data frame to the summary() function prints out useful summary statistics about numeric column (min, max, median, mean, etc.)

13 / 50

Lists

  • Lists: objects containing elements of different types
    • Each list element can be of different length
lst = list(A = rep(2, 5), B = seq(1:10), C = letters)
lst
## $A
## [1] 2 2 2 2 2
##
## $B
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $C
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
14 / 50

Addressing elements in a list

  • Address any element as lst[1] (or, lst["A"])
lst[1]
## $A
## [1] 2 2 2 2 2
  • Address the content of any element as lst[[1]] (or, lst[["A"]], lts$A)
lst[[1]]
## [1] 2 2 2 2 2
15 / 50

Comments

R ignores everything after the # sign

# This line is a comment
print("Hello, World!") # This will print the message, but the comment will be ignored
## [1] "Hello, World!"
16 / 50

Clean up your environment

z <- c(1, 2, 3)
ls()
## [1] "a" "dat" "lst" "mtx" "scores" "z"
rm(z) # Remove one variable
ls()
## [1] "a" "dat" "lst" "mtx" "scores"
# Remove everything from the environment
rm(list = ls()) # Not the same as restarting R session
ls()
## character(0)
17 / 50

Functions

  • A function is a set of statements organized together to perform a specific task
    • Name - the actual name of the function, e.g., summary(), mean()
    • Arguments - values passed to the functions. Argument-less functions exist
    • Code - actual code of the function
    • Return value - the result of the function's code execution
read.csv(file="scores.csv")

read.csv is a function to import a CSV file, and file is an argument that specifies which file to import

R has a large number of built-in functions, and the user can create their own functions

18 / 50

Running functions

  • From the R console - type the function and hit Enter

    • One function at a time, not efficient
  • Using an R script - a text file that contains all your R functions/code

    • R scripts allow you to save, edit, reproduce and share your code
    • R scripts stored in files with .R extension
    • Run the whole script as source("script_name.R"), or, from command line, Rscript script_name.R
    • In RStudio, you can run individual lines, code chunks, or source whole scripts. Keyboard shortcuts are available
19 / 50

Packages

  • All functions belong to packages. The read.csv function is in the utils package.

  • R comes with about 30 packages (called "base R"), but as of August 2020, there are over 16,000 CRAN packages and over 1,900 Bioconductor packages

  • Example: ggplot2 is a popular package that adds functions for creating graphs in a different way than what base R provides

  • To use functions in a package, the package must be installed and loaded. (They're free)

  • You only install a package once
  • You load a package whenever you want to use its functions
20 / 50

Package repositories

  • CRAN - Comprehensive R Archive Network – a collection of > 16,000 (September 2020) packages

  • Bioconductor – genomics-oriented free and open source project hosting > 1,900 specialized R packages (September 2020)

  • MRAN - Microsoft R Application Network, includes CRAN packages and more

  • GitHub – code-hosting repository, packages for everyone and by everyone

21 / 50

Installing packages

  • install.packages - installs packages from CRAN, e.g., install.packages("BiocManager")

  • remotes package - installs R packages from GitHub, GitLab, Bitbucket, Bioconductor, or plain 'subversion' or 'git' repositories. E.g., remotes::install_github("tidyverse/ggplot2")

  • BiocManager::install() - Install or update Bioconductor, CRAN, or GitHub packages

  • RStudio point-and-click interface

22 / 50

Loading packages

  • library() will load the package, e.g., library(readxl) or library("readxl")

    • But, when installing packages, always use parentheses, e.g., install.packages("readxl")
  • require() will load the package and, if success, return TRUE. Useful in if statement, e.g.

if (!require(ggplot2)) {
install.packages("ggplot2")
}
23 / 50

Installing packages

  • install.packages(“<package_name>”) – install from CRAN

  • install.packages(“<package_name.tar.gz>”, repos = NULL) – install from a tarball archive

  • R CMD INSTALL <package_name.tar.gz> - install from a command line

  • devtools::install_github('mdozmorov/MDmisc') – install from GitHub

  • BiocManager::install() - install Bioconductor, CRAN, and GitHub packages

https://CRAN.R-project.org/package=BiocManager

24 / 50

Loading packages

  • library(package_name) – load library to use its functions

  • library() vs. require()

    • require() tries to load the package, returns TRUE or FALSE
    • library() just loads the package, fails if the package is not available
  • Use only library(package_name)

https://yihui.name/en/2014/07/library-vs-require/

25 / 50

Using functions from other packages

  • You can access functions without loading the package using the :: operator, e.g., Hmisc::rcorr()

  • Entering the function name without parentheses will output its code

> data.frame
function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors())
{
data.row.names <- if (check.rows && is.null(row.names))
...
  • You can access internal functions of a package with the ::: operator if you know their name
26 / 50

Getting help

  • Get an overview of all functions in a package: help(package = "dplyr")

    • Bioconductor packages have vignettes, short tutorials on package-specific tasks. Browse them, e.g., browseVignettes(package = "limma")
  • Use ?function_name to get help on a function from a loaded package. E.g., ?boxplot (same as help(boxplot))

    • Use example(boxplot) to see how the function can be used
  • Use ??function_name to search for the function across all installed packages, even not loaded. E.g., ??ggplotly

  • Search engine is your best friend on many things

27 / 50

Useful ways of getting data in and out of R

  • Base functions: read.table, read.csv, write.table, write.csv

  • Tidyverse way, readr package: read_table, read_csv, read_tsv, write_csv ...

  • For fixed-width files, use read.fwf or readr::read_fwf funcitons

  • For reading/writing Excel files, use readxl and writexl packages, read_xlsx, write_xlsx functions

    • Remember that .csv is the preferred text-based format that opens in Excel
28 / 50

The stringsAsFactors curse

  • When creating data frames with data.frame() or reading data with read.table(), strings automatically converted to factors

  • This behind-the-scenes factor conversion can lead to unpredictable behaviors

  • Use as.is = TRUE in read.table() to avoid such conversion

  • Better yet, set options(stringsAsFactors = FALSE) at the beginning of your script files

https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/

29 / 50

Save/load R objects

  • save(), load() - saves/loads R objects to the specified file

    x <- stats::runif(20)
    y <- list(a = 1, b = TRUE, c = "oops")
    save(x, y, file = "xy.rda")
    load(file = "xy.rda")
  • saveRDS(), readRDS() - saves/loads a representation of the object

    x <- stats::runif(20)
    saveRDS(x, file = "x.rds")
    x2 <- readRDS(file = "x.rds")
    identical(x, x2, ignore.environment = TRUE)

https://fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/

30 / 50

R datasets

R contains many datasets (stored as data frames) that are built-in to the software

data() # All built-in datasets
# ?trees
data(trees) # Load a particular one
head(trees)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
31 / 50

Accessing data in datasets

attach(trees) # You can make R find variables in any data frame by adding the data frame to the search path
search() # .GlobalEnv is your workspace and the package quantities are libraries
## [1] ".GlobalEnv" "trees" "package:xaringanthemer"
## [4] "package:stats" "package:graphics" "package:grDevices"
## [7] "package:utils" "package:datasets" "package:methods"
## [10] "Autoloads" "package:base"
detach(trees) # To remove an object from the search path, use the detach()
with(trees, mean(Height)) # Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data
## [1] 76

attach() can cause name overloads and other serious issues. Avoid it

32 / 50

Summary statistics

  • Simple statistical functions: count(), min(), max(), mean(), median(), sd(), cor(), summary())

  • These, and many other functions, have settings to properly handle NAs, e.g., mean(x, trim = 0, na.rm = FALSE, ...)

  • complete.cases() on a matrix/data frame returns row-wise logical with TRUE for rows without NAs

  • unique() - unique elements in a vector. Combine with length() to get the number of unique elements

  • table() - contingency table for a vector (the number of elements per unique level)

33 / 50

Summary statistics

data(mtcars) # simple summary
# ?mtcars
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mean(mtcars$mpg) # Try median, sd, var, min, max
## [1] 20.09062
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
34 / 50

Summary statistics

quantile(mtcars$mpg, probs = c(.20, .80))
## 20% 80%
## 15.20 24.08
cor(mtcars$mpg, mtcars$hp) # sample correlation coeficient
## [1] -0.7761684
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
table(mtcars$cyl)/length(mtcars$cyl) # normalized by the total number of observations = 32
##
## 4 6 8
## 0.34375 0.21875 0.43750
35 / 50

Control structures inside R/functions

  • if, else
  • for
  • while
  • repeat
  • break
  • next
36 / 50

If-else statement

Conditional code execution

if (condition) {
# do something
} else {
# do something else
}
  • ==: Equal to
  • !=: Not equal to
  • >, >=: Greater than, greater than or equal to
  • <, <=: Less than, less than or equal to
x <- 1:15
if (sample(x, 1) <= 10) {
print("x is less than 10")
} else {
print("x is greater than 10")
}
## [1] "x is less than 10"
37 / 50

For loop

Repetitive code execution

for (i in 1:5) {
cat(i)
}
## 12345

Compare with

for (i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
38 / 50

More uses of For loops

x <- c("apples", "oranges", "bananas", "strawberries")
for (i in x) {
cat(i); cat(" ")
}
## apples oranges bananas strawberries
for (i in 1:4) {
cat(x[i]); cat(" ")
}
## apples oranges bananas strawberries
for (i in seq(x)) {
cat(x[i]); cat(" ")
}
## apples oranges bananas strawberries
39 / 50

Nested For loops

m <- matrix(1:10, 2)
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
for (i in seq(nrow(m))) {
for (j in seq(ncol(m))) {
print(m[i, j])
}
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
40 / 50

while, repeat loops

i <- 1
while (i < 10) {
print(i)
i <- i + 1
} # Be sure there is a way to exit out of a while loop
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
repeat {
# simulations; generate some value have an expectation if within some range,
# then exit the loop
if ((value - expectation) <= threshold) {
break
}
}
41 / 50

Combine any statements/functions

for (i in 1:20) {
if (i%%2 == 1) {
next # skip printing over odd numbers
} else {
print(i)
}
}
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20
42 / 50

Vectorized operation

Many operations in R are already vectorized, making code more efficient, concise, and easier to read

x <- 1:4; y <- 6:9
x
## [1] 1 2 3 4
y
## [1] 6 7 8 9
x * y
## [1] 6 14 24 36
x / y
## [1] 0.1666667 0.2857143 0.3750000 0.4444444
43 / 50

Manipulating vectors

ages <- c(40, 50, 60, 70, 80)
# add a value to end of vector
ages <- c(ages, 90)
# add value at the beginning
ages <- c(30, ages)
# extracting second value
ages[2]
## [1] 40
# excluding second value
ages[-2]
## [1] 30 50 60 70 80 90
# extracting first and third values
ages[c(1, 3)]
## [1] 30 50
44 / 50

apply family of functions

Writing for, while loops in R are inefficient, and we want to vectorize computation in R.

  • apply() - apply a function over the margins of an array

  • lapply() - loop over a list and evaluate a function on each element

  • sapply() - same as lapply but try to simplify results, if the result is a list where every element is length 1, then a vector is returned

  • mapply() - multivariate version of lapply

  • tapply() - apply a function over subsets of a vector

45 / 50

apply examples

x <- 1:4
lapply(x, runif)
## [[1]]
## [1] 0.7103501
##
## [[2]]
## [1] 0.3951038 0.4184131
##
## [[3]]
## [1] 0.3217766 0.1780726 0.8919266
##
## [[4]]
## [1] 0.60705926 0.05831400 0.09485927 0.83428037
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1))
sapply(x, mean)
## a b c
## 2.5000000 -0.1394352 0.9592841
46 / 50

apply examples

#If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
x <- list(rnorm(100), runif(100), rpois(100, 1))
sapply(x, quantile, probs = c(0.25, 0.75))
## [,1] [,2] [,3]
## 25% -0.4512676 0.3318313 0
## 75% 0.7023871 0.7364010 1
x <- matrix(rnorm(200), 20, 10)
apply(x, 1, sum)
## [1] -1.8750636 -5.7395261 -1.3897511 2.8761932 0.6523308 -1.2598896
## [7] 3.6839290 -3.9213112 6.7922807 -1.2603816 1.9749202 -1.6921166
## [13] -0.9821136 3.3275777 3.5328862 1.8707953 9.2495361 1.7392953
## [19] -2.0882412 0.9703799
apply(x, 2, mean)
## [1] 0.45098333 -0.19774551 -0.13381431 0.50200787 0.13881266 -0.27666766
## [7] 0.28958605 0.07532345 -0.12001035 0.09461095
47 / 50

apply examples

For sums and means of matrix dimensions, we have some shortcuts

rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)

Check ?rowSums help on these base R functions

48 / 50

tapply

Apply a function to each cell of a ragged array, that is, to each (non-empty) group of values given by a unique combination of the levels of certain factors.

function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
X is a vector
INDEX is a factor or a list of factors (or else they are coerced to factors)
FUN is a function to be applied
... contains other arguments to be passed FUN
simplify, should we simplify the result?
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10)
tapply(x, f, mean)
## 1 2 3
## -0.7883378 0.4986981 1.1900419
49 / 50

mapply

mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.

function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
FUN is a function to apply
... contains arguments to apply over
MoreArgs is a list of other arguments to FUN.
SIMPLIFY indicates whether the result should be simplified
mapply(rep, 1:4, 4:1)
mapply(rnorm,mean=1:3,sd=1:3,n=seq(5,15,by=5))
50 / 50

R expressions, function calls, and objects

According to John Chambers, one of the creators of R’s precursor S:

  • Everything that exists in R is an object

  • Everything that happens in R is a call to a function

2 / 50
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow