R preliminaries

class: center, middle, inverse, title-slide

# R preliminaries
### Mikhail Dozmorov
### Virginia Commonwealth University
### 09-24-2020

---

## R expressions, function calls, and objects

According to John Chambers, one of the creators of R’s precursor S:

- Everything that exists in R is an **object**

- Everything that happens in R is a **call to a function**

---
## Assignment operator

- We often need to save a function's result or output. For this we use the assignment operator: ` <- `, preferred over ` = `

```r
scores <- mtcars
```
 
Now we can use `scores` as an argument to other functions. For example, compute summary statistics for each column in the data:

```r
summary(scores[1:4]) # First four elements
```

```
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0
```

Use `Alt + -` (Win) or `Option + -` (Mac) in RStudio to quickly insert ` <- `

---
## Variables

- **Scalars** (0-dimensional): `a = 42`, `b = a / 7`

- **Vectors** (1-dimensional): `b = c(12, 14, 16)`
    - Access vector element as `b[2]` (returns 14)

- **Matrices** (2-dimensional):

```r
mtx = matrix(data = c(3, 1, 3, 2, 3, 2), ncol = 2)
mtx
```

```
##      [,1] [,2]
## [1,]    3    2
## [2,]    1    3
## [3,]    3    2
```

---
## Variable names

- Be careful not to name your variables as function names. E.g., `c` is a bad variable name because `c()` is a function for combining variables. Check its help function `?c`

- With auto-completion in RStudio, you don't need to worry about variable name length - make names that are self-explanatory

Follow [Hadley Wickham's Tidyverse Style Guide](http://adv-r.had.co.nz/Style.html)

---
## Variable types

```r
# numeric: real or decimal numbers, sometimes referred to as “double” integer: a subset of numeric in which numbers are stored as integers
a <- 2
# character: sometimes referred to as string data, tend to be surrounded by quotes
a <- "2" 
# logical: Boolean data (TRUE and FALSE)
a <- TRUE 
```

- complex: complex numbers with real and imaginary parts (e.g., 1 + 4i)
- raw: bytes of data (machine-readable, but not human readable)

Auxillary functions

``` r
class(a)
str(a)
is.numeric() # TRUE is matches, same with is.character
as.numeric("2") # Attempt to convert types
```

---
## Factors

- Factors are how R represents categorical data
- There are two kinds of factors:
    - `factor()` - used for nominal data ("Cats", "Cats", "Dogs", "Birds")
    - `ordered()` - used for ordinal data ("First", "Second", "Second", "Third")

```r
factor(c("Cats", "Cats", "Dogs", "Birds"))
```

```
## [1] Cats  Cats  Dogs  Birds
## Levels: Birds Cats Dogs
```

```r
ordered(c("First", "Second", "Second", "Third"))
```

```
## [1] First  Second Second Third 
## Levels: First < Second < Third
```

---
## Factors Auxillary functions

- `levels()` - get levels of a factor. Also, an argument in the `factor()` function allowing to set the order manually
- `relevel()` - reorder factor levels
- `is.factor()`, `as.factor()`

```r
a <- factor(c("Cats", "Cats", "Dogs", "Birds"))
a
```

```
## [1] Cats  Cats  Dogs  Birds
## Levels: Birds Cats Dogs
```

```r
relevel(a, ref = "Cats") 
```

```
## [1] Cats  Cats  Dogs  Birds
## Levels: Cats Birds Dogs
```

```r
levels(a) <- rev(levels(a))
a
```

```
## [1] Cats  Cats  Birds Dogs 
## Levels: Dogs Cats Birds
```

---
## Data frames

- **Data frames**: tables or 2-dimensional arrays. Think matrices that can hold different data types
    - The column names should be non-empty
    - Columns should be the same length
    - The row names should be unique
    - The data stored in a data frame can be of numeric, factor, or character

```r
dat = data.frame(Column.1 = c(3, 1, 3), Column.2 = c("2", "3", "2"))
dat
```

```
##   Column.1 Column.2
## 1        3        2
## 2        1        3
## 3        3        2
```

---
## Data frames Auxillary functions

```r
dim(dat)
```

```
## [1] 3 2
```

```r
nrow(dat)
```

```
## [1] 3
```

```r
ncol(dat)
```

```
## [1] 2
```

```r
length(dat)
```

```
## [1] 2
```

```r
colnames(dat)
```

```
## [1] "Column.1" "Column.2"
```

```r
rownames(dat)
```

```
## [1] "1" "2" "3"
```

---
## Addressing elements in a data frame

```r
dat[3, 2]          # [] contain row/column indices. 
```

```
## [1] "2"
```

```r
dat[3, "Column.2"] # Address by column name 
```

```
## [1] "2"
```

```r
dat$Column.2[3]    # Use $ shortcut to access column by name
```

```
## [1] "2"
```

```r
# Compare column classes
class(dat$Column.1)
```

```
## [1] "numeric"
```

```r
class(dat$Column.2)
```

```
## [1] "character"
```

``` r
# Top or bottom of a data frame
head(dat)
tail(dat)
```

---
## Inspecting data.frame objects

There are several built-in functions that are useful for working with data frames.

* Content:
    * `head()`: shows the first few rows
    * `tail()`: shows the last few rows

* Size:
    * `dim()`: returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    * `nrow()`: returns the number of rows
    * `ncol()`: returns the number of columns

---
## Inspecting data.frame objects

* `colnames()` (or just `names()`): returns the column names

* `str()`: structure of the object and information about the class, length and content of each column

* `summary()`: works differently depending on what kind of object you pass to it. Passing a data frame to the `summary()` function prints out useful summary statistics about numeric column (min, max, median, mean, etc.)

---
## Lists

- **Lists**: objects containing elements of different types
    - Each list element can be of different length

```r
lst = list(A = rep(2, 5), B = seq(1:10), C = letters)
lst
```

```
## $A
## [1] 2 2 2 2 2
## 
## $B
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $C
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
```

---
## Addressing elements in a list

- Address any element as `lst[1]` (or, `lst["A"]`)

```r
lst[1]
```

```
## $A
## [1] 2 2 2 2 2
```

- Address _the content of any element_ as `lst[[1]]` (or, `lst[["A"]]`, `lts$A`)

```r
lst[[1]]
```

```
## [1] 2 2 2 2 2
```

---
## Comments

R ignores everything after the `#` sign

```r
# This line is a comment
print("Hello, World!") # This will print the message, but the comment will be ignored
```

```
## [1] "Hello, World!"
```

---
## Clean up your environment

```r
z <- c(1, 2, 3)
ls()
```

```
## [1] "a"      "dat"    "lst"    "mtx"    "scores" "z"
```

```r
rm(z) # Remove one variable
ls()
```

```
## [1] "a"      "dat"    "lst"    "mtx"    "scores"
```

```r
# Remove everything from the environment
rm(list = ls()) # Not the same as restarting R session
ls()
```

```
## character(0)
```

---
## Functions

- A function is a set of statements organized together to perform a specific task
    - **Name** - the actual name of the function, e.g., `summary()`, `mean()`
    - **Arguments** - values passed to the functions. Argument-less functions exist
    - **Code** - actual code of the function
    - **Return value** - the result of the function's code execution

``` r
read.csv(file="scores.csv")
```

`read.csv` is a function to import a CSV file, and `file` is an argument that specifies which file to import

R has a large number of built-in functions, and the user can create their own functions

---
## Running functions

- From the R console - type the function and hit Enter
    - One function at a time, not efficient

- Using an `R` script - a text file that contains all your `R` functions/code
    - `R` scripts allow you to save, edit, reproduce and share your code
    - R scripts stored in files with `.R` extension
    - Run the whole script as `source("script_name.R")`, or, from command line, `Rscript script_name.R`
    - In RStudio, you can run individual lines, code chunks, or source whole scripts. Keyboard shortcuts are available

---
## Packages

- All functions belong to *packages*. The `read.csv` function is in the `utils` package.

- `R` comes with about 30 packages (called "base `R`"), but as of August 2020, there are over 16,000 CRAN packages and over 1,900 Bioconductor packages

- Example: `ggplot2` is a popular package that adds functions for creating graphs in a different way than what base `R` provides

- To use functions in a package, the package must be installed and loaded. (They're free)
- You only _install_ a package once
- You _load_ a package whenever you want to use its functions

---
## Package repositories

- `CRAN` - Comprehensive R Archive Network – a collection of > 16,000 (September 2020) packages

- `Bioconductor` – genomics-oriented free and open source project hosting > 1,900 specialized R packages (September 2020)

- `MRAN` - Microsoft R Application Network, includes CRAN packages and more

- `GitHub` – code-hosting repository, packages for everyone and by everyone

.small[ https://cran.r-project.org/web/packages/

https://www.bioconductor.org/

https://mran.microsoft.com/

https://github.com/  ]

---
## Installing packages

- `install.packages` - installs packages from CRAN, e.g., `install.packages("BiocManager")`

- `remotes` package - installs R packages from GitHub, GitLab, Bitbucket, Bioconductor, or plain 'subversion' or 'git' repositories. E.g., `remotes::install_github("tidyverse/ggplot2")`

- `BiocManager::install()` - Install or update Bioconductor, CRAN, or GitHub packages

- RStudio point-and-click interface

---
## Loading packages

- `library()` will load the package, e.g., `library(readxl)` or `library("readxl")`
    - But, when installing packages, always use parentheses, e.g., `install.packages("readxl")`

- `require()` will load the package and, if success, return TRUE. Useful in `if` statement, e.g.

``` r
if (!require(ggplot2)) {
  install.packages("ggplot2")
}
```

---
## Installing packages

- `install.packages(“<package_name>”)` – install from CRAN

- `install.packages(“<package_name.tar.gz>”, repos = NULL)` – install from a tarball archive

- `R CMD INSTALL <package_name.tar.gz>` - install from a command line

- `devtools::install_github('mdozmorov/MDmisc')` – install from GitHub

- `BiocManager::install()` - install Bioconductor, CRAN, and GitHub packages

.small[ https://CRAN.R-project.org/package=BiocManager ]

---
## Loading packages

- `library(package_name)` – load library to use its functions

- `library()` vs. `require()`
    - `require()` _tries_ to load the package, returns TRUE or FALSE
    - `library()` just loads the package, fails if the package is not available

- Use only `library(package_name)`

.small[ https://yihui.name/en/2014/07/library-vs-require/ ]

---
## Using functions from other packages

- You can access functions without loading the package using the `::` operator, e.g., `Hmisc::rcorr()`

- Entering the function name without parentheses will output its code

``` r
> data.frame
function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, 
    fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors()) 
{
    data.row.names <- if (check.rows && is.null(row.names)) 
...
```

- You can access internal functions of a package with the `:::` operator if you know their name

---
## Getting help

- Get an overview of all functions in a package: `help(package = "dplyr")`
    - Bioconductor packages have vignettes, short tutorials on package-specific tasks. Browse them, e.g., `browseVignettes(package = "limma")`

- Use `?function_name` to get help on a function from a _loaded_ package. E.g., `?boxplot` (same as `help(boxplot)`)
    - Use `example(boxplot)` to see how the function can be used

- Use `??function_name` to search for the function across all installed packages, even not loaded. E.g., `??ggplotly`

- Search engine is your best friend on many things

---
## Useful ways of getting data in and out of R

- Base functions: `read.table`, `read.csv`, `write.table`, `write.csv`

- Tidyverse way, `readr` package: `read_table`, `read_csv`, `read_tsv`, `write_csv` ...

- For fixed-width files, use `read.fwf` or `readr::read_fwf` funcitons

- For reading/writing Excel files, use `readxl` and `writexl` packages, `read_xlsx`, `write_xlsx` functions
    - Remember that `.csv` is the preferred text-based format that opens in Excel

.small[https://readr.tidyverse.org/

https://readxl.tidyverse.org/

https://CRAN.R-project.org/package=writexl]

---
## The stringsAsFactors curse

- When creating data frames with `data.frame()` or reading data with `read.table()`, strings automatically converted to factors

- This behind-the-scenes factor conversion can lead to unpredictable behaviors

- Use `as.is = TRUE` in `read.table()` to avoid such conversion

- Better yet, set `options(stringsAsFactors = FALSE)` at the beginning of your script files

.small[https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/]

---
## Save/load R objects

- `save()`, `load()` - saves/loads R objects to the specified file
``` r
x <- stats::runif(20)
y <- list(a = 1, b = TRUE, c = "oops")
save(x, y, file = "xy.rda")
load(file = "xy.rda")
```

- `saveRDS()`, `readRDS()` - saves/loads a _representation_ of the object
``` r
x <- stats::runif(20)
saveRDS(x, file = "x.rds")
x2 <- readRDS(file = "x.rds")
identical(x, x2, ignore.environment = TRUE)
```

.small[https://fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/]

---
## R datasets

R contains many datasets (stored as data frames) that are built-in to the software

```r
data() # All built-in datasets
# ?trees
data(trees) # Load a particular one
head(trees)
```

```
##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7
```

---
## Accessing data in datasets

```r
attach(trees)   # You can make R find variables in any data frame by adding the data frame to the search path
search()        # .GlobalEnv is your workspace and the package quantities are libraries
```

```
##  [1] ".GlobalEnv"             "trees"                  "package:xaringanthemer"
##  [4] "package:stats"          "package:graphics"       "package:grDevices"     
##  [7] "package:utils"          "package:datasets"       "package:methods"       
## [10] "Autoloads"              "package:base"
```

```r
detach(trees)   # To remove an object from the search path, use the detach()
with(trees, mean(Height)) # Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data
```

```
## [1] 76
```

`attach()` can cause name overloads and other serious issues. Avoid it

---
## Summary statistics

- Simple statistical functions: `count()`, `min()`, `max()`, `mean()`, `median()`, `sd()`, `cor()`, `summary()`)

- These, and many other functions, have settings to properly handle NAs, e.g., `mean(x, trim = 0, na.rm = FALSE, ...)`

- `complete.cases()` on a matrix/data frame returns row-wise logical with TRUE for rows without NAs

- `unique()` - unique elements in a vector. Combine with `length()` to get the number of unique elements

- `table()` - contingency table for a vector (the number of elements per unique level)

---
## Summary statistics

```r
data(mtcars)    # simple summary 
# ?mtcars
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

```r
mean(mtcars$mpg) # Try median, sd, var, min, max
```

```
## [1] 20.09062
```

```r
summary(mtcars$mpg)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90
```

---
## Summary statistics

```r
quantile(mtcars$mpg, probs = c(.20, .80))
```

```
##   20%   80% 
## 15.20 24.08
```

```r
cor(mtcars$mpg, mtcars$hp) # sample correlation coeficient
```

```
## [1] -0.7761684
```

```r
table(mtcars$cyl)
```

```
## 
##  4  6  8 
## 11  7 14
```

```r
table(mtcars$cyl)/length(mtcars$cyl) # normalized by the total number of observations = 32
```

```
## 
##       4       6       8 
## 0.34375 0.21875 0.43750
```

---
## Control structures inside R/functions

- `if, else`
- `for`
- `while`
- `repeat`
- `break`
- `next`

---
## If-else statement

Conditional code execution

``` r
if (condition) {
  # do something
} else {
  # do something else
}
```

- `==`: Equal to
- `!=`: Not equal to
- `>`, `>=`: Greater than, greater than or equal to
- `<`, `<=`: Less than, less than or equal to

```r
x <- 1:15
if (sample(x, 1) <= 10) {
  print("x is less than 10")
} else {
  print("x is greater than 10")
}
```

```
## [1] "x is less than 10"
```

---
## For loop

Repetitive code execution

```r
for (i in 1:5) {
  cat(i)
}
```

```
## 12345
```

Compare with

```r
for (i in 1:5) {
  print(i)
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
```

---
## More uses of For loops

```r
x <- c("apples", "oranges", "bananas", "strawberries")

for (i in x) {
  cat(i); cat(" ")
}
```

```
## apples oranges bananas strawberries
```

```r
for (i in 1:4) {
  cat(x[i]); cat(" ")
}
```

```
## apples oranges bananas strawberries
```

```r
for (i in seq(x)) {
  cat(x[i]); cat(" ")
}
```

```
## apples oranges bananas strawberries
```

---
## Nested For loops

```r
m <- matrix(1:10, 2)
m
```

```
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
```

```r
for (i in seq(nrow(m))) {
  for (j in seq(ncol(m))) {
    print(m[i, j])
  }
}
```

```
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
```

---
## while, repeat loops

```r
i <- 1
while (i < 10) {
  print(i)
  i <- i + 1
} # Be sure there is a way to exit out of a while loop
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
```

``` r
repeat {
  # simulations; generate some value have an expectation if within some range,
  # then exit the loop
  if ((value - expectation) <= threshold) {
    break
  }
}
```

---
## Combine any statements/functions

```r
for (i in 1:20) {           
  if (i%%2 == 1) {
    next                # skip printing over odd numbers
  } else {
    print(i)
  }
}
```

```
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20
```

---
## Vectorized operation

Many operations in R are already vectorized, making code more efficient, concise, and easier to read

```r
x <- 1:4; y <- 6:9
x
```

```
## [1] 1 2 3 4
```

```r
y
```

```
## [1] 6 7 8 9
```

```r
x * y
```

```
## [1]  6 14 24 36
```

```r
x / y
```

```
## [1] 0.1666667 0.2857143 0.3750000 0.4444444
```

---
## Manipulating vectors

```r
ages <- c(40, 50, 60, 70, 80)
# add a value to end of vector
ages <- c(ages, 90) 
# add value at the beginning
ages <- c(30, ages)
# extracting second value
ages[2]
```

```
## [1] 40
```

```r
# excluding second value
ages[-2]
```

```
## [1] 30 50 60 70 80 90
```

```r
# extracting first and third values
ages[c(1, 3)] 
```

```
## [1] 30 50
```

---
## `apply` family of functions

Writing for, while loops in R are inefficient, and we want to vectorize computation in R.

- `apply()` - apply a function over the margins of an array

- `lapply()` - loop over a list and evaluate a function on each element

- `sapply()` - same as lapply but try to simplify results, if the result is a list where every element is length 1, then a vector is returned

- `mapply()` - multivariate version of lapply

- `tapply()` -  apply a function over subsets of a vector

---
## apply examples

```r
x <- 1:4
lapply(x, runif)
```

```
## [[1]]
## [1] 0.7103501
## 
## [[2]]
## [1] 0.3951038 0.4184131
## 
## [[3]]
## [1] 0.3217766 0.1780726 0.8919266
## 
## [[4]]
## [1] 0.60705926 0.05831400 0.09485927 0.83428037
```

```r
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1))
sapply(x, mean)
```

```
##          a          b          c 
##  2.5000000 -0.1394352  0.9592841
```

---
## apply examples

```r
#If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
x <- list(rnorm(100), runif(100), rpois(100, 1))
sapply(x, quantile, probs = c(0.25, 0.75))
```

```
##           [,1]      [,2] [,3]
## 25% -0.4512676 0.3318313    0
## 75%  0.7023871 0.7364010    1
```

```r
x <- matrix(rnorm(200), 20, 10)
apply(x, 1, sum)
```

```
##  [1] -1.8750636 -5.7395261 -1.3897511  2.8761932  0.6523308 -1.2598896
##  [7]  3.6839290 -3.9213112  6.7922807 -1.2603816  1.9749202 -1.6921166
## [13] -0.9821136  3.3275777  3.5328862  1.8707953  9.2495361  1.7392953
## [19] -2.0882412  0.9703799
```

```r
apply(x, 2, mean)
```

```
##  [1]  0.45098333 -0.19774551 -0.13381431  0.50200787  0.13881266 -0.27666766
##  [7]  0.28958605  0.07532345 -0.12001035  0.09461095
```

---
## apply examples

For sums and means of matrix dimensions, we have some shortcuts

```r
rowSums  = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums  = apply(x, 2, sum)
colMeans = apply(x, 2, mean)
```

Check `?rowSums` help on these base R functions

---
## tapply

Apply a function to each cell of a ragged array, that is, to each (non-empty) group of values given by a unique combination of the levels of certain factors.

``` r
function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
X is a vector
INDEX is a factor or a list of factors (or else they are coerced to factors)
FUN is a function to be applied
... contains other arguments to be passed FUN
simplify, should we simplify the result?
```

```r
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10)
tapply(x, f, mean)
```

```
##          1          2          3 
## -0.7883378  0.4986981  1.1900419
```

---
## mapply

mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.

``` r
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
FUN is a function to apply
... contains arguments to apply over
MoreArgs is a list of other arguments to FUN.
SIMPLIFY indicates whether the result should be simplified
```

```r
mapply(rep, 1:4, 4:1)
mapply(rnorm,mean=1:3,sd=1:3,n=seq(5,15,by=5))
```