class: center, middle, inverse, title-slide # R preliminaries ### Mikhail Dozmorov ### Virginia Commonwealth University ### 09-24-2020 --- ## R expressions, function calls, and objects According to John Chambers, one of the creators of R’s precursor S: - Everything that exists in R is an **object** - Everything that happens in R is a **call to a function** --- ## Assignment operator - We often need to save a function's result or output. For this we use the assignment operator: ` <- `, preferred over ` = ` ```r scores <- mtcars ``` Now we can use `scores` as an argument to other functions. For example, compute summary statistics for each column in the data: ```r summary(scores[1:4]) # First four elements ``` ``` ## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ``` Use `Alt + -` (Win) or `Option + -` (Mac) in RStudio to quickly insert ` <- ` --- ## Variables - **Scalars** (0-dimensional): `a = 42`, `b = a / 7` - **Vectors** (1-dimensional): `b = c(12, 14, 16)` - Access vector element as `b[2]` (returns 14) - **Matrices** (2-dimensional): ```r mtx = matrix(data = c(3, 1, 3, 2, 3, 2), ncol = 2) mtx ``` ``` ## [,1] [,2] ## [1,] 3 2 ## [2,] 1 3 ## [3,] 3 2 ``` --- ## Variable names - Be careful not to name your variables as function names. E.g., `c` is a bad variable name because `c()` is a function for combining variables. Check its help function `?c` - With auto-completion in RStudio, you don't need to worry about variable name length - make names that are self-explanatory Follow [Hadley Wickham's Tidyverse Style Guide](http://adv-r.had.co.nz/Style.html) --- ## Variable types ```r # numeric: real or decimal numbers, sometimes referred to as “double” integer: a subset of numeric in which numbers are stored as integers a <- 2 # character: sometimes referred to as string data, tend to be surrounded by quotes a <- "2" # logical: Boolean data (TRUE and FALSE) a <- TRUE ``` - complex: complex numbers with real and imaginary parts (e.g., 1 + 4i) - raw: bytes of data (machine-readable, but not human readable) Auxillary functions ``` r class(a) str(a) is.numeric() # TRUE is matches, same with is.character as.numeric("2") # Attempt to convert types ``` --- ## Factors - Factors are how R represents categorical data - There are two kinds of factors: - `factor()` - used for nominal data ("Cats", "Cats", "Dogs", "Birds") - `ordered()` - used for ordinal data ("First", "Second", "Second", "Third") ```r factor(c("Cats", "Cats", "Dogs", "Birds")) ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Birds Cats Dogs ``` ```r ordered(c("First", "Second", "Second", "Third")) ``` ``` ## [1] First Second Second Third ## Levels: First < Second < Third ``` --- ## Factors Auxillary functions - `levels()` - get levels of a factor. Also, an argument in the `factor()` function allowing to set the order manually - `relevel()` - reorder factor levels - `is.factor()`, `as.factor()` ```r a <- factor(c("Cats", "Cats", "Dogs", "Birds")) a ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Birds Cats Dogs ``` ```r relevel(a, ref = "Cats") ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Cats Birds Dogs ``` ```r levels(a) <- rev(levels(a)) a ``` ``` ## [1] Cats Cats Birds Dogs ## Levels: Dogs Cats Birds ``` --- ## Data frames - **Data frames**: tables or 2-dimensional arrays. Think matrices that can hold different data types - The column names should be non-empty - Columns should be the same length - The row names should be unique - The data stored in a data frame can be of numeric, factor, or character ```r dat = data.frame(Column.1 = c(3, 1, 3), Column.2 = c("2", "3", "2")) dat ``` ``` ## Column.1 Column.2 ## 1 3 2 ## 2 1 3 ## 3 3 2 ``` --- ## Data frames Auxillary functions ```r dim(dat) ``` ``` ## [1] 3 2 ``` ```r nrow(dat) ``` ``` ## [1] 3 ``` ```r ncol(dat) ``` ``` ## [1] 2 ``` ```r length(dat) ``` ``` ## [1] 2 ``` ```r colnames(dat) ``` ``` ## [1] "Column.1" "Column.2" ``` ```r rownames(dat) ``` ``` ## [1] "1" "2" "3" ``` --- ## Addressing elements in a data frame ```r dat[3, 2] # [] contain row/column indices. ``` ``` ## [1] "2" ``` ```r dat[3, "Column.2"] # Address by column name ``` ``` ## [1] "2" ``` ```r dat$Column.2[3] # Use $ shortcut to access column by name ``` ``` ## [1] "2" ``` ```r # Compare column classes class(dat$Column.1) ``` ``` ## [1] "numeric" ``` ```r class(dat$Column.2) ``` ``` ## [1] "character" ``` ``` r # Top or bottom of a data frame head(dat) tail(dat) ``` --- ## Inspecting data.frame objects There are several built-in functions that are useful for working with data frames. * Content: * `head()`: shows the first few rows * `tail()`: shows the last few rows * Size: * `dim()`: returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object) * `nrow()`: returns the number of rows * `ncol()`: returns the number of columns --- ## Inspecting data.frame objects * `colnames()` (or just `names()`): returns the column names * `str()`: structure of the object and information about the class, length and content of each column * `summary()`: works differently depending on what kind of object you pass to it. Passing a data frame to the `summary()` function prints out useful summary statistics about numeric column (min, max, median, mean, etc.) --- ## Lists - **Lists**: objects containing elements of different types - Each list element can be of different length ```r lst = list(A = rep(2, 5), B = seq(1:10), C = letters) lst ``` ``` ## $A ## [1] 2 2 2 2 2 ## ## $B ## [1] 1 2 3 4 5 6 7 8 9 10 ## ## $C ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" ## [20] "t" "u" "v" "w" "x" "y" "z" ``` --- ## Addressing elements in a list - Address any element as `lst[1]` (or, `lst["A"]`) ```r lst[1] ``` ``` ## $A ## [1] 2 2 2 2 2 ``` - Address _the content of any element_ as `lst[[1]]` (or, `lst[["A"]]`, `lts$A`) ```r lst[[1]] ``` ``` ## [1] 2 2 2 2 2 ``` --- ## Comments R ignores everything after the `#` sign ```r # This line is a comment print("Hello, World!") # This will print the message, but the comment will be ignored ``` ``` ## [1] "Hello, World!" ``` --- ## Clean up your environment ```r z <- c(1, 2, 3) ls() ``` ``` ## [1] "a" "dat" "lst" "mtx" "scores" "z" ``` ```r rm(z) # Remove one variable ls() ``` ``` ## [1] "a" "dat" "lst" "mtx" "scores" ``` ```r # Remove everything from the environment rm(list = ls()) # Not the same as restarting R session ls() ``` ``` ## character(0) ``` --- ## Functions - A function is a set of statements organized together to perform a specific task - **Name** - the actual name of the function, e.g., `summary()`, `mean()` - **Arguments** - values passed to the functions. Argument-less functions exist - **Code** - actual code of the function - **Return value** - the result of the function's code execution ``` r read.csv(file="scores.csv") ``` `read.csv` is a function to import a CSV file, and `file` is an argument that specifies which file to import R has a large number of built-in functions, and the user can create their own functions --- ## Running functions - From the R console - type the function and hit Enter - One function at a time, not efficient - Using an `R` script - a text file that contains all your `R` functions/code - `R` scripts allow you to save, edit, reproduce and share your code - R scripts stored in files with `.R` extension - Run the whole script as `source("script_name.R")`, or, from command line, `Rscript script_name.R` - In RStudio, you can run individual lines, code chunks, or source whole scripts. Keyboard shortcuts are available --- ## Packages - All functions belong to *packages*. The `read.csv` function is in the `utils` package. - `R` comes with about 30 packages (called "base `R`"), but as of August 2020, there are over 16,000 CRAN packages and over 1,900 Bioconductor packages - Example: `ggplot2` is a popular package that adds functions for creating graphs in a different way than what base `R` provides - To use functions in a package, the package must be installed and loaded. (They're free) - You only _install_ a package once - You _load_ a package whenever you want to use its functions --- ## Package repositories - `CRAN` - Comprehensive R Archive Network – a collection of > 16,000 (September 2020) packages - `Bioconductor` – genomics-oriented free and open source project hosting > 1,900 specialized R packages (September 2020) - `MRAN` - Microsoft R Application Network, includes CRAN packages and more - `GitHub` – code-hosting repository, packages for everyone and by everyone .small[ https://cran.r-project.org/web/packages/ https://www.bioconductor.org/ https://mran.microsoft.com/ https://github.com/ ] --- ## Installing packages - `install.packages` - installs packages from CRAN, e.g., `install.packages("BiocManager")` - `remotes` package - installs R packages from GitHub, GitLab, Bitbucket, Bioconductor, or plain 'subversion' or 'git' repositories. E.g., `remotes::install_github("tidyverse/ggplot2")` - `BiocManager::install()` - Install or update Bioconductor, CRAN, or GitHub packages - RStudio point-and-click interface --- ## Loading packages - `library()` will load the package, e.g., `library(readxl)` or `library("readxl")` - But, when installing packages, always use parentheses, e.g., `install.packages("readxl")` - `require()` will load the package and, if success, return TRUE. Useful in `if` statement, e.g. ``` r if (!require(ggplot2)) { install.packages("ggplot2") } ``` --- ## Installing packages - `install.packages(“<package_name>”)` – install from CRAN - `install.packages(“<package_name.tar.gz>”, repos = NULL)` – install from a tarball archive - `R CMD INSTALL <package_name.tar.gz>` - install from a command line - `devtools::install_github('mdozmorov/MDmisc')` – install from GitHub - `BiocManager::install()` - install Bioconductor, CRAN, and GitHub packages .small[ https://CRAN.R-project.org/package=BiocManager ] --- ## Loading packages - `library(package_name)` – load library to use its functions - `library()` vs. `require()` - `require()` _tries_ to load the package, returns TRUE or FALSE - `library()` just loads the package, fails if the package is not available - Use only `library(package_name)` .small[ https://yihui.name/en/2014/07/library-vs-require/ ] --- ## Using functions from other packages - You can access functions without loading the package using the `::` operator, e.g., `Hmisc::rcorr()` - Entering the function name without parentheses will output its code ``` r > data.frame function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors()) { data.row.names <- if (check.rows && is.null(row.names)) ... ``` - You can access internal functions of a package with the `:::` operator if you know their name --- ## Getting help - Get an overview of all functions in a package: `help(package = "dplyr")` - Bioconductor packages have vignettes, short tutorials on package-specific tasks. Browse them, e.g., `browseVignettes(package = "limma")` - Use `?function_name` to get help on a function from a _loaded_ package. E.g., `?boxplot` (same as `help(boxplot)`) - Use `example(boxplot)` to see how the function can be used - Use `??function_name` to search for the function across all installed packages, even not loaded. E.g., `??ggplotly` - Search engine is your best friend on many things --- ## Useful ways of getting data in and out of R - Base functions: `read.table`, `read.csv`, `write.table`, `write.csv` - Tidyverse way, `readr` package: `read_table`, `read_csv`, `read_tsv`, `write_csv` ... - For fixed-width files, use `read.fwf` or `readr::read_fwf` funcitons - For reading/writing Excel files, use `readxl` and `writexl` packages, `read_xlsx`, `write_xlsx` functions - Remember that `.csv` is the preferred text-based format that opens in Excel .small[https://readr.tidyverse.org/ https://readxl.tidyverse.org/ https://CRAN.R-project.org/package=writexl] --- ## The stringsAsFactors curse - When creating data frames with `data.frame()` or reading data with `read.table()`, strings automatically converted to factors - This behind-the-scenes factor conversion can lead to unpredictable behaviors - Use `as.is = TRUE` in `read.table()` to avoid such conversion - Better yet, set `options(stringsAsFactors = FALSE)` at the beginning of your script files .small[https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/] --- ## Save/load R objects - `save()`, `load()` - saves/loads R objects to the specified file ``` r x <- stats::runif(20) y <- list(a = 1, b = TRUE, c = "oops") save(x, y, file = "xy.rda") load(file = "xy.rda") ``` - `saveRDS()`, `readRDS()` - saves/loads a _representation_ of the object ``` r x <- stats::runif(20) saveRDS(x, file = "x.rds") x2 <- readRDS(file = "x.rds") identical(x, x2, ignore.environment = TRUE) ``` .small[https://fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/] --- ## R datasets R contains many datasets (stored as data frames) that are built-in to the software ```r data() # All built-in datasets # ?trees data(trees) # Load a particular one head(trees) ``` ``` ## Girth Height Volume ## 1 8.3 70 10.3 ## 2 8.6 65 10.3 ## 3 8.8 63 10.2 ## 4 10.5 72 16.4 ## 5 10.7 81 18.8 ## 6 10.8 83 19.7 ``` --- ## Accessing data in datasets ```r attach(trees) # You can make R find variables in any data frame by adding the data frame to the search path search() # .GlobalEnv is your workspace and the package quantities are libraries ``` ``` ## [1] ".GlobalEnv" "trees" "package:xaringanthemer" ## [4] "package:stats" "package:graphics" "package:grDevices" ## [7] "package:utils" "package:datasets" "package:methods" ## [10] "Autoloads" "package:base" ``` ```r detach(trees) # To remove an object from the search path, use the detach() with(trees, mean(Height)) # Evaluate an R expression in an environment constructed from data, possibly modifying (a copy of) the original data ``` ``` ## [1] 76 ``` `attach()` can cause name overloads and other serious issues. Avoid it --- ## Summary statistics - Simple statistical functions: `count()`, `min()`, `max()`, `mean()`, `median()`, `sd()`, `cor()`, `summary()`) - These, and many other functions, have settings to properly handle NAs, e.g., `mean(x, trim = 0, na.rm = FALSE, ...)` - `complete.cases()` on a matrix/data frame returns row-wise logical with TRUE for rows without NAs - `unique()` - unique elements in a vector. Combine with `length()` to get the number of unique elements - `table()` - contingency table for a vector (the number of elements per unique level) --- ## Summary statistics ```r data(mtcars) # simple summary # ?mtcars head(mtcars) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` ```r mean(mtcars$mpg) # Try median, sd, var, min, max ``` ``` ## [1] 20.09062 ``` ```r summary(mtcars$mpg) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10.40 15.43 19.20 20.09 22.80 33.90 ``` --- ## Summary statistics ```r quantile(mtcars$mpg, probs = c(.20, .80)) ``` ``` ## 20% 80% ## 15.20 24.08 ``` ```r cor(mtcars$mpg, mtcars$hp) # sample correlation coeficient ``` ``` ## [1] -0.7761684 ``` ```r table(mtcars$cyl) ``` ``` ## ## 4 6 8 ## 11 7 14 ``` ```r table(mtcars$cyl)/length(mtcars$cyl) # normalized by the total number of observations = 32 ``` ``` ## ## 4 6 8 ## 0.34375 0.21875 0.43750 ``` --- ## Control structures inside R/functions - `if, else` - `for` - `while` - `repeat` - `break` - `next` --- ## If-else statement Conditional code execution ``` r if (condition) { # do something } else { # do something else } ``` - `==`: Equal to - `!=`: Not equal to - `>`, `>=`: Greater than, greater than or equal to - `<`, `<=`: Less than, less than or equal to ```r x <- 1:15 if (sample(x, 1) <= 10) { print("x is less than 10") } else { print("x is greater than 10") } ``` ``` ## [1] "x is less than 10" ``` --- ## For loop Repetitive code execution ```r for (i in 1:5) { cat(i) } ``` ``` ## 12345 ``` Compare with ```r for (i in 1:5) { print(i) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ``` --- ## More uses of For loops ```r x <- c("apples", "oranges", "bananas", "strawberries") for (i in x) { cat(i); cat(" ") } ``` ``` ## apples oranges bananas strawberries ``` ```r for (i in 1:4) { cat(x[i]); cat(" ") } ``` ``` ## apples oranges bananas strawberries ``` ```r for (i in seq(x)) { cat(x[i]); cat(" ") } ``` ``` ## apples oranges bananas strawberries ``` --- ## Nested For loops ```r m <- matrix(1:10, 2) m ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 3 5 7 9 ## [2,] 2 4 6 8 10 ``` ```r for (i in seq(nrow(m))) { for (j in seq(ncol(m))) { print(m[i, j]) } } ``` ``` ## [1] 1 ## [1] 3 ## [1] 5 ## [1] 7 ## [1] 9 ## [1] 2 ## [1] 4 ## [1] 6 ## [1] 8 ## [1] 10 ``` --- ## while, repeat loops ```r i <- 1 while (i < 10) { print(i) i <- i + 1 } # Be sure there is a way to exit out of a while loop ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9 ``` ``` r repeat { # simulations; generate some value have an expectation if within some range, # then exit the loop if ((value - expectation) <= threshold) { break } } ``` --- ## Combine any statements/functions ```r for (i in 1:20) { if (i%%2 == 1) { next # skip printing over odd numbers } else { print(i) } } ``` ``` ## [1] 2 ## [1] 4 ## [1] 6 ## [1] 8 ## [1] 10 ## [1] 12 ## [1] 14 ## [1] 16 ## [1] 18 ## [1] 20 ``` --- ## Vectorized operation Many operations in R are already vectorized, making code more efficient, concise, and easier to read ```r x <- 1:4; y <- 6:9 x ``` ``` ## [1] 1 2 3 4 ``` ```r y ``` ``` ## [1] 6 7 8 9 ``` ```r x * y ``` ``` ## [1] 6 14 24 36 ``` ```r x / y ``` ``` ## [1] 0.1666667 0.2857143 0.3750000 0.4444444 ``` --- ## Manipulating vectors ```r ages <- c(40, 50, 60, 70, 80) # add a value to end of vector ages <- c(ages, 90) # add value at the beginning ages <- c(30, ages) # extracting second value ages[2] ``` ``` ## [1] 40 ``` ```r # excluding second value ages[-2] ``` ``` ## [1] 30 50 60 70 80 90 ``` ```r # extracting first and third values ages[c(1, 3)] ``` ``` ## [1] 30 50 ``` --- ## `apply` family of functions Writing for, while loops in R are inefficient, and we want to vectorize computation in R. - `apply()` - apply a function over the margins of an array - `lapply()` - loop over a list and evaluate a function on each element - `sapply()` - same as lapply but try to simplify results, if the result is a list where every element is length 1, then a vector is returned - `mapply()` - multivariate version of lapply - `tapply()` - apply a function over subsets of a vector --- ## apply examples ```r x <- 1:4 lapply(x, runif) ``` ``` ## [[1]] ## [1] 0.7103501 ## ## [[2]] ## [1] 0.3951038 0.4184131 ## ## [[3]] ## [1] 0.3217766 0.1780726 0.8919266 ## ## [[4]] ## [1] 0.60705926 0.05831400 0.09485927 0.83428037 ``` ```r x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1)) sapply(x, mean) ``` ``` ## a b c ## 2.5000000 -0.1394352 0.9592841 ``` --- ## apply examples ```r #If the result is a list where every element is a vector of the same length (> 1), a matrix is returned. x <- list(rnorm(100), runif(100), rpois(100, 1)) sapply(x, quantile, probs = c(0.25, 0.75)) ``` ``` ## [,1] [,2] [,3] ## 25% -0.4512676 0.3318313 0 ## 75% 0.7023871 0.7364010 1 ``` ```r x <- matrix(rnorm(200), 20, 10) apply(x, 1, sum) ``` ``` ## [1] -1.8750636 -5.7395261 -1.3897511 2.8761932 0.6523308 -1.2598896 ## [7] 3.6839290 -3.9213112 6.7922807 -1.2603816 1.9749202 -1.6921166 ## [13] -0.9821136 3.3275777 3.5328862 1.8707953 9.2495361 1.7392953 ## [19] -2.0882412 0.9703799 ``` ```r apply(x, 2, mean) ``` ``` ## [1] 0.45098333 -0.19774551 -0.13381431 0.50200787 0.13881266 -0.27666766 ## [7] 0.28958605 0.07532345 -0.12001035 0.09461095 ``` --- ## apply examples For sums and means of matrix dimensions, we have some shortcuts ```r rowSums = apply(x, 1, sum) rowMeans = apply(x, 1, mean) colSums = apply(x, 2, sum) colMeans = apply(x, 2, mean) ``` Check `?rowSums` help on these base R functions --- ## tapply Apply a function to each cell of a ragged array, that is, to each (non-empty) group of values given by a unique combination of the levels of certain factors. ``` r function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) X is a vector INDEX is a factor or a list of factors (or else they are coerced to factors) FUN is a function to be applied ... contains other arguments to be passed FUN simplify, should we simplify the result? ``` ```r x <- c(rnorm(10), runif(10), rnorm(10, 1)) f <- gl(3, 10) tapply(x, f, mean) ``` ``` ## 1 2 3 ## -0.7883378 0.4986981 1.1900419 ``` --- ## mapply mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary. ``` r function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) FUN is a function to apply ... contains arguments to apply over MoreArgs is a list of other arguments to FUN. SIMPLIFY indicates whether the result should be simplified ``` ```r mapply(rep, 1:4, 4:1) mapply(rnorm,mean=1:3,sd=1:3,n=seq(5,15,by=5)) ```