Assignment 5: Data management/code best practices
Due by 5:00 PM on Tuesday, October 13, 2020
To do yourself
Read the vignette for DataExplorer R package, to know what it is capable of
Read the vignette for doParaller R package, go through code examples there
To submit on Blackboard
Question One
Spreadsheets, Summarization, and Naming Conventions
Part A
- Create an Excel spreadsheet with 4 columns and 10 rows, following proper spreadsheet conventions discussed in class.
- Let the first column be an ID number.
- Let the second column be a set of integers between 20 and 30, denoting the number of teeth someone has.
- Let the third column be a set of real numbers between 1.0 and 4.0, denoting GPA in school.
- Let the fourth column be a mix of “H” and “T” character values, denoting the tosses of a coin.
- Read it into R. What is the class of this object?
- Create a new object with the same content, but with the class
list
- Address the fourth element of the list object. Address the content of the fourth element
Question Two
Summarizing Quickly
Load the ChickWeight dataset that comes with R. Install the skimr package. Use this package to summarize the dataset in one line of code.
Explore at least one of the following functions. Compare it with the skimr functionality, which one would you prefer?
- describe, from the Hmisc package
- stat.desc from pastecs
- describe from psych
- descr and dfSummary from summarytools
Question Four
Efficiency and Timing
Make an empty vector of length 10 million. Set a seed for reproducibility. Populate the vector in two ways and compare:
- Use a for loop to enter a randomly-generated Exponential variable with rate parameter 3 in each position in the vector.
- Use a more efficient method, possibly writing a function, to populate the vector with randomly-generated numbers from the same distribution.
- Try parallelization approaches.
Time all methods, and make a conclusion.