R preliminaries

R preliminariesMikhail DozmorovVirginia Commonwealth University09-24-20201 / 50

R expressions, function calls, and objects

According to John Chambers, one of the creators of R’s precursor S:

Everything that exists in R is an object
Everything that happens in R is a call to a function

2 / 50

Assignment operator

We often need to save a function's result or output. For this we use the assignment operator: <-, preferred over =

scores <- mtcars

Now we can use scores as an argument to other functions. For example, compute summary statistics for each column in the data:

summary(scores[1:4]) # First four elements

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0

Use Alt + - (Win) or Option + - (Mac) in RStudio to quickly insert <-

3 / 50

Variables

Scalars (0-dimensional): a = 42, b = a / 7
Vectors (1-dimensional): b = c(12, 14, 16)
- Access vector element as b[2] (returns 14)
Matrices (2-dimensional):

mtx = matrix(data = c(3, 1, 3, 2, 3, 2), ncol = 2)
mtx

##      [,1] [,2]
## [1,]    3    2
## [2,]    1    3
## [3,]    3    2

4 / 50

Variable names

Be careful not to name your variables as function names. E.g., c is a bad variable name because c() is a function for combining variables. Check its help function ?c
With auto-completion in RStudio, you don't need to worry about variable name length - make names that are self-explanatory

Follow Hadley Wickham's Tidyverse Style Guide

5 / 50

Variable types

# numeric: real or decimal numbers, sometimes referred to as “double” integer: a subset of numeric in which numbers are stored as integers
a <- 2
# character: sometimes referred to as string data, tend to be surrounded by quotes
a <- "2" 
# logical: Boolean data (TRUE and FALSE)
a <- TRUE

complex: complex numbers with real and imaginary parts (e.g., 1 + 4i)
raw: bytes of data (machine-readable, but not human readable)

Auxillary functions

class(a)
str(a)
is.numeric() # TRUE is matches, same with is.character
as.numeric("2") # Attempt to convert types

6 / 50

Factors

Factors are how R represents categorical data
There are two kinds of factors:
- factor() - used for nominal data ("Cats", "Cats", "Dogs", "Birds")
- ordered() - used for ordinal data ("First", "Second", "Second", "Third")

factor(c("Cats", "Cats", "Dogs", "Birds"))

## [1] Cats  Cats  Dogs  Birds
## Levels: Birds Cats Dogs

ordered(c("First", "Second", "Second", "Third"))

## [1] First  Second Second Third 
## Levels: First < Second < Third

7 / 50

Factors Auxillary functions

levels() - get levels of a factor. Also, an argument in the factor() function allowing to set the order manually
relevel() - reorder factor levels
is.factor(), as.factor()

a <- factor(c("Cats", "Cats", "Dogs", "Birds"))
a

## [1] Cats  Cats  Dogs  Birds
## Levels: Birds Cats Dogs

relevel(a, ref = "Cats")

## [1] Cats  Cats  Dogs  Birds
## Levels: Cats Birds Dogs

levels(a) <- rev(levels(a))
a

## [1] Cats  Cats  Birds Dogs 
## Levels: Dogs Cats Birds

8 / 50

Data frames

Data frames: tables or 2-dimensional arrays. Think matrices that can hold different data types
- The column names should be non-empty
- Columns should be the same length
- The row names should be unique
- The data stored in a data frame can be of numeric, factor, or character

dat = data.frame(Column.1 = c(3, 1, 3), Column.2 = c("2", "3", "2"))
dat

##   Column.1 Column.2
## 1        3        2
## 2        1        3
## 3        3        2

9 / 50

Data frames Auxillary functions

dim(dat)

## [1] 3 2

nrow(dat)

## [1] 3

ncol(dat)

## [1] 2

length(dat)

## [1] 2

colnames(dat)

## [1] "Column.1" "Column.2"

rownames(dat)

## [1] "1" "2" "3"

10 / 50

Addressing elements in a data frame

dat[3, 2]          # [] contain row/column indices.

## [1] "2"

dat[3, "Column.2"] # Address by column name

## [1] "2"

dat$Column.2[3]    # Use $ shortcut to access column by name

## [1] "2"

# Compare column classes
class(dat$Column.1)

## [1] "numeric"

class(dat$Column.2)

## [1] "character"

# Top or bottom of a data frame
head(dat)
tail(dat)

11 / 50

Inspecting data.frame objects

There are several built-in functions that are useful for working with data frames.

Content:
- head(): shows the first few rows
- tail(): shows the last few rows
Size:
- dim(): returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow(): returns the number of rows
- ncol(): returns the number of columns

12 / 50

Inspecting data.frame objects

colnames() (or just names()): returns the column names
str(): structure of the object and information about the class, length and content of each column
summary(): works differently depending on what kind of object you pass to it. Passing a data frame to the summary() function prints out useful summary statistics about numeric column (min, max, median, mean, etc.)

13 / 50

Lists

Lists: objects containing elements of different types
- Each list element can be of different length

lst = list(A = rep(2, 5), B = seq(1:10), C = letters)
lst

## $A
## [1] 2 2 2 2 2
## 
## $B
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $C
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

14 / 50

Addressing elements in a list

Address any element as lst[1] (or, lst["A"])

lst[1]

## $A
## [1] 2 2 2 2 2

Address the content of any element as lst[[1]] (or, lst[["A"]], lts$A)

lst[[1]]

## [1] 2 2 2 2 2

15 / 50

Comments

R ignores everything after the # sign

# This line is a comment
print("Hello, World!") # This will print the message, but the comment will be ignored

## [1] "Hello, World!"

16 / 50

Clean up your environment

z <- c(1, 2, 3)
ls()

## [1] "a"      "dat"    "lst"    "mtx"    "scores" "z"

rm(z) # Remove one variable
ls()

## [1] "a"      "dat"    "lst"    "mtx"    "scores"

# Remove everything from the environment
rm(list = ls()) # Not the same as restarting R session
ls()

## character(0)

17 / 50

Functions

A function is a set of statements organized together to perform a specific task
- Name - the actual name of the function, e.g., summary(), mean()
- Arguments - values passed to the functions. Argument-less functions exist
- Code - actual code of the function
- Return value - the result of the function's code execution

read.csv(file="scores.csv")

read.csv is a function to import a CSV file, and file is an argument that specifies which file to import

R has a large number of built-in functions, and the user can create their own functions

18 / 50

Running functions

From the R console - type the function and hit Enter
- One function at a time, not efficient
Using an R script - a text file that contains all your R functions/code
- R scripts allow you to save, edit, reproduce and share your code
- R scripts stored in files with .R extension
- Run the whole script as source("script_name.R"), or, from command line, Rscript script_name.R
- In RStudio, you can run individual lines, code chunks, or source whole scripts. Keyboard shortcuts are available

19 / 50

Packages

All functions belong to packages. The read.csv function is in the utils package.
R comes with about 30 packages (called "base R"), but as of August 2020, there are over 16,000 CRAN packages and over 1,900 Bioconductor packages
Example: ggplot2 is a popular package that adds functions for creating graphs in a different way than what base R provides
To use functions in a package, the package must be installed and loaded. (They're free)
You only install a package once
You load a package whenever you want to use its functions

20 / 50

Package repositories

CRAN - Comprehensive R Archive Network – a collection of > 16,000 (September 2020) packages
Bioconductor – genomics-oriented free and open source project hosting > 1,900 specialized R packages (September 2020)
MRAN - Microsoft R Application Network, includes CRAN packages and more
GitHub – code-hosting repository, packages for everyone and by everyone

https://cran.r-project.org/web/packages/

https://www.bioconductor.org/

https://mran.microsoft.com/

https://github.com/

21 / 50

Installing packages

install.packages - installs packages from CRAN, e.g., install.packages("BiocManager")
remotes package - installs R packages from GitHub, GitLab, Bitbucket, Bioconductor, or plain 'subversion' or 'git' repositories. E.g., remotes::install_github("tidyverse/ggplot2")
BiocManager::install() - Install or update Bioconductor, CRAN, or GitHub packages
RStudio point-and-click interface

22 / 50

Loading packages

library() will load the package, e.g., library(readxl) or library("readxl")
- But, when installing packages, always use parentheses, e.g., install.packages("readxl")
require() will load the package and, if success, return TRUE. Useful in if statement, e.g.

if (!require(ggplot2)) {
  install.packages("ggplot2")
}

23 / 50

Installing packages

install.packages(“<package_name>”) – install from CRAN
install.packages(“<package_name.tar.gz>”, repos = NULL) – install from a tarball archive
R CMD INSTALL <package_name.tar.gz> - install from a command line
devtools::install_github('mdozmorov/MDmisc') – install from GitHub
BiocManager::install() - install Bioconductor, CRAN, and GitHub packages

https://CRAN.R-project.org/package=BiocManager

24 / 50

Loading packages

library(package_name) – load library to use its functions
library() vs. require()
- require() tries to load the package, returns TRUE or FALSE
- library() just loads the package, fails if the package is not available
Use only library(package_name)

https://yihui.name/en/2014/07/library-vs-require/

25 / 50

Using functions from other packages

You can access functions without loading the package using the :: operator, e.g., Hmisc::rcorr()
Entering the function name without parentheses will output its code

> data.frame
function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, 
    fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors()) 
{
    data.row.names <- if (check.rows && is.null(row.names)) 
...

You can access internal functions of a package with the ::: operator if you know their name

26 / 50

Getting help

Get an overview of all functions in a package: help(package = "dplyr")
- Bioconductor packages have vignettes, short tutorials on package-specific tasks. Browse them, e.g., browseVignettes(package = "limma")
Use ?function_name to get help on a function from a loaded package. E.g., ?boxplot (same as help(boxplot))
- Use example(boxplot) to see how the function can be used
Use ??function_name to search for the function across all installed packages, even not loaded. E.g., ??ggplotly
Search engine is your best friend on many things

27 / 50

Useful ways of getting data in and out of R

Base functions: read.table, read.csv, write.table, write.csv
Tidyverse way, readr package: read_table, read_csv, read_tsv, write_csv ...
For fixed-width files, use read.fwf or readr::read_fwf funcitons
For reading/writing Excel files, use readxl and writexl packages, read_xlsx, write_xlsx functions
- Remember that .csv is the preferred text-based format that opens in Excel

https://readr.tidyverse.org/

https://readxl.tidyverse.org/

https://CRAN.R-project.org/package=writexl

28 / 50

The stringsAsFactors curse

When creating data frames with data.frame() or reading data with read.table(), strings automatically converted to factors
This behind-the-scenes factor conversion can lead to unpredictable behaviors
Use as.is = TRUE in read.table() to avoid such conversion
Better yet, set options(stringsAsFactors = FALSE) at the beginning of your script files

https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/

29 / 50

Save/load R objects

save(), load() - saves/loads R objects to the specified file

x <- stats::runif(20)
y <- list(a = 1, b = TRUE, c = "oops")
save(x, y, file = "xy.rda")
load(file = "xy.rda")

saveRDS(), readRDS() - saves/loads a representation of the object

x <- stats::runif(20)
saveRDS(x, file = "x.rds")
x2 <- readRDS(file = "x.rds")
identical(x, x2, ignore.environment = TRUE)

https://fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/

30 / 50

R datasets

R contains many datasets (stored as data frames) that are built-in to the software

data() # All built-in datasets
# ?trees
data(trees) # Load a particular one
head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

31 / 50

Accessing data in datasets

attach(trees)   # You can make R find variables in any data frame by adding the data frame to the search path
search()        # .GlobalEnv is your workspace and the package quantities are libraries

##  [1] ".GlobalEnv"             "trees"                  "package:xaringanthemer"
##  [4] "package:stats"          "package:graphics"       "package:grDevices"     
##  [7] "package:utils"          "package:datasets"       "package:methods"       
## [10] "Autoloads"              "package:base"

detach(trees)   # To remove an object from the search path, use the detach()
with(trees, mean(Height)) # Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data

## [1] 76

attach() can cause name overloads and other serious issues. Avoid it

32 / 50

Summary statistics

Simple statistical functions: count(), min(), max(), mean(), median(), sd(), cor(), summary())
These, and many other functions, have settings to properly handle NAs, e.g., mean(x, trim = 0, na.rm = FALSE, ...)
complete.cases() on a matrix/data frame returns row-wise logical with TRUE for rows without NAs
unique() - unique elements in a vector. Combine with length() to get the number of unique elements
table() - contingency table for a vector (the number of elements per unique level)

33 / 50

Summary statistics

data(mtcars)    # simple summary 
# ?mtcars
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mean(mtcars$mpg) # Try median, sd, var, min, max

## [1] 20.09062

summary(mtcars$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

34 / 50

Summary statistics

quantile(mtcars$mpg, probs = c(.20, .80))

##   20%   80% 
## 15.20 24.08

cor(mtcars$mpg, mtcars$hp) # sample correlation coeficient

## [1] -0.7761684

table(mtcars$cyl)

## 
##  4  6  8 
## 11  7 14

table(mtcars$cyl)/length(mtcars$cyl) # normalized by the total number of observations = 32

## 
##       4       6       8 
## 0.34375 0.21875 0.43750

35 / 50

Control structures inside R/functionsif, else
for
while
repeat
break
next
36 / 50

If-else statement

Conditional code execution

if (condition) {
  # do something
} else {
  # do something else
}

==: Equal to
!=: Not equal to
>, >=: Greater than, greater than or equal to
<, <=: Less than, less than or equal to

x <- 1:15
if (sample(x, 1) <= 10) {
  print("x is less than 10")
} else {
  print("x is greater than 10")
}

## [1] "x is less than 10"

37 / 50

For loop

Repetitive code execution

for (i in 1:5) {
  cat(i)
}

## 12345

Compare with

for (i in 1:5) {
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

38 / 50

More uses of For loops

x <- c("apples", "oranges", "bananas", "strawberries")
for (i in x) {
  cat(i); cat(" ")
}

## apples oranges bananas strawberries

for (i in 1:4) {
  cat(x[i]); cat(" ")
}

## apples oranges bananas strawberries

for (i in seq(x)) {
  cat(x[i]); cat(" ")
}

## apples oranges bananas strawberries

39 / 50

Nested For loops

m <- matrix(1:10, 2)
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

for (i in seq(nrow(m))) {
  for (j in seq(ncol(m))) {
    print(m[i, j])
  }
}

## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10

40 / 50

while, repeat loops

i <- 1
while (i < 10) {
  print(i)
  i <- i + 1
} # Be sure there is a way to exit out of a while loop

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9

repeat {
  # simulations; generate some value have an expectation if within some range,
  # then exit the loop
  if ((value - expectation) <= threshold) {
    break
  }
}

41 / 50

Combine any statements/functions

for (i in 1:20) {           
  if (i%%2 == 1) {
    next                # skip printing over odd numbers
  } else {
    print(i)
  }
}

## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20

42 / 50

Vectorized operation

Many operations in R are already vectorized, making code more efficient, concise, and easier to read

x <- 1:4; y <- 6:9
x

## [1] 1 2 3 4

## [1] 6 7 8 9

x * y

## [1]  6 14 24 36

x / y

## [1] 0.1666667 0.2857143 0.3750000 0.4444444

43 / 50

Manipulating vectors

ages <- c(40, 50, 60, 70, 80)
# add a value to end of vector
ages <- c(ages, 90) 
# add value at the beginning
ages <- c(30, ages)
# extracting second value
ages[2]

## [1] 40

# excluding second value
ages[-2]

## [1] 30 50 60 70 80 90

# extracting first and third values
ages[c(1, 3)]

## [1] 30 50

44 / 50

`apply` family of functions

Writing for, while loops in R are inefficient, and we want to vectorize computation in R.

apply() - apply a function over the margins of an array
lapply() - loop over a list and evaluate a function on each element
sapply() - same as lapply but try to simplify results, if the result is a list where every element is length 1, then a vector is returned
mapply() - multivariate version of lapply
tapply() - apply a function over subsets of a vector

45 / 50

apply examples

x <- 1:4
lapply(x, runif)

## [[1]]
## [1] 0.7103501
## 
## [[2]]
## [1] 0.3951038 0.4184131
## 
## [[3]]
## [1] 0.3217766 0.1780726 0.8919266
## 
## [[4]]
## [1] 0.60705926 0.05831400 0.09485927 0.83428037

x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1))
sapply(x, mean)

##          a          b          c 
##  2.5000000 -0.1394352  0.9592841

46 / 50

apply examples

#If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
x <- list(rnorm(100), runif(100), rpois(100, 1))
sapply(x, quantile, probs = c(0.25, 0.75))

##           [,1]      [,2] [,3]
## 25% -0.4512676 0.3318313    0
## 75%  0.7023871 0.7364010    1

x <- matrix(rnorm(200), 20, 10)
apply(x, 1, sum)

##  [1] -1.8750636 -5.7395261 -1.3897511  2.8761932  0.6523308 -1.2598896
##  [7]  3.6839290 -3.9213112  6.7922807 -1.2603816  1.9749202 -1.6921166
## [13] -0.9821136  3.3275777  3.5328862  1.8707953  9.2495361  1.7392953
## [19] -2.0882412  0.9703799

apply(x, 2, mean)

##  [1]  0.45098333 -0.19774551 -0.13381431  0.50200787  0.13881266 -0.27666766
##  [7]  0.28958605  0.07532345 -0.12001035  0.09461095

47 / 50

apply examples

For sums and means of matrix dimensions, we have some shortcuts

rowSums  = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums  = apply(x, 2, sum)
colMeans = apply(x, 2, mean)

Check ?rowSums help on these base R functions

48 / 50

tapply

Apply a function to each cell of a ragged array, that is, to each (non-empty) group of values given by a unique combination of the levels of certain factors.

function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
X is a vector
INDEX is a factor or a list of factors (or else they are coerced to factors)
FUN is a function to be applied
... contains other arguments to be passed FUN
simplify, should we simplify the result?

x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10)
tapply(x, f, mean)

##          1          2          3 
## -0.7883378  0.4986981  1.1900419

49 / 50

mapply

mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.

function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
FUN is a function to apply
... contains arguments to apply over
MoreArgs is a list of other arguments to FUN.
SIMPLIFY indicates whether the result should be simplified

mapply(rep, 1:4, 4:1)
mapply(rnorm,mean=1:3,sd=1:3,n=seq(5,15,by=5))

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

R preliminaries

Mikhail Dozmorov

Virginia Commonwealth University

09-24-2020

R expressions, function calls, and objects

Assignment operator

Variables

Variable names

Variable types

Factors

Factors Auxillary functions

Data frames

Data frames Auxillary functions

Addressing elements in a data frame

Inspecting data.frame objects

Inspecting data.frame objects

Lists

Addressing elements in a list

Comments

Clean up your environment

Functions

Running functions

Packages

Package repositories

Installing packages

Loading packages

Installing packages

Loading packages

Using functions from other packages

Getting help

Useful ways of getting data in and out of R

The stringsAsFactors curse

Save/load R objects

R datasets

Accessing data in datasets

Summary statistics

Summary statistics

Summary statistics

Control structures inside R/functions

If-else statement

For loop

More uses of For loops

Nested For loops

while, repeat loops

Combine any statements/functions

Vectorized operation

Manipulating vectors

apply family of functions

apply examples

apply examples

apply examples

tapply

mapply

R expressions, function calls, and objects

Help

`apply` family of functions