Assignment 5: Data management/code best practices

Due by 5:00 PM on Tuesday, October 13, 2020

To do yourself

To submit on Blackboard

Question One

Spreadsheets, Summarization, and Naming Conventions

Part A
  • Create an Excel spreadsheet with 4 columns and 10 rows, following proper spreadsheet conventions discussed in class.
    • Let the first column be an ID number.
    • Let the second column be a set of integers between 20 and 30, denoting the number of teeth someone has.
    • Let the third column be a set of real numbers between 1.0 and 4.0, denoting GPA in school.
    • Let the fourth column be a mix of “H” and “T” character values, denoting the tosses of a coin.
  • Read it into R. What is the class of this object?
  • Create a new object with the same content, but with the class list
  • Address the fourth element of the list object. Address the content of the fourth element
Part B
  • In one line of code, calculate the mean of the second column and the third column. Do the same for the list object.
  • Print out the frequency distribution of the fourth column.
Part C
  • Rename the columns into something descriptive using R code. What variable naming convention are you using (refer to Code Organization Best Practices slides)?
  • Save the data into .csv format and submit along with the code file

Question Two

Summarizing Quickly

Load the ChickWeight dataset that comes with R. Install the skimr package. Use this package to summarize the dataset in one line of code.

Explore at least one of the following functions. Compare it with the skimr functionality, which one would you prefer?

  • describe, from the Hmisc package
  • stat.desc from pastecs
  • describe from psych
  • descr and dfSummary from summarytools

Question Three

File Save Naming Conventions

Type a list of 5 files with good naming conventions (they could have an R file extension, a SAS file extension, etc.)

Let the files represent tasks in some project. Explain why your naming conventions are good and helpful to future you.

Question Four

Efficiency and Timing

Make an empty vector of length 10 million. Set a seed for reproducibility. Populate the vector in two ways and compare:

  • Use a for loop to enter a randomly-generated Exponential variable with rate parameter 3 in each position in the vector.
  • Use a more efficient method, possibly writing a function, to populate the vector with randomly-generated numbers from the same distribution.
  • Try parallelization approaches.

Time all methods, and make a conclusion.