Complete Guide to DTSC 650 Final Project Part 1 in R
School
Eastern Michigan University**We aren't endorsed by this school
Course
COMM 550
Subject
Computer Science
Date
Dec 12, 2024
Pages
5
Uploaded by MasterFlowerBat34
title: 'DTSC 650: Data Analytics In R' subtitle: 'CodeGrade Final Project Part 1' output: html_notebookeditor_options: chunk_output_type: inlineStudent InfoName: Term: Date:General InstructionsAcademic IntegrityWe expect all work conducted at Eastern University to reflect honest and ethical behavior. The primary rule for ensuring academic integrity is that all submitted work must be original and produced by theindividual student.To uphold academic integrity, it is not permitted to view another student's work prior to submitting your own, nor should you provide your work to someone who has not yet submitted their work.Collaborating with other students is only allowed when seeking clarification on a topic or assignment instructions. Collaborating on quizzes and exams is no permitted.See the syllabus for more information.Name of FileName your assignment file Final_Project_Part1.qmd. This is a Quarto "markdown" file, which has the file has the extension '.qmd'.Allowable packagesOn this part of the project, the only allowable packages are tidyverse, psychand lm.beta. You should not use any other packages on this part because CodeGrade is not set up to accept them.• If the allowable packages are not installed on your local computer, you'll need to do a one-time installation from the Console Window in RStudiofor each package like this:install.packages('<package name>')Do not attempt to install packages in code that you submit to CodeGrade.CodeGrade will crash if you do so.Do / Do not• Do use tidyverse functions for all of the questions where possible.• Do use comments in your code and written descriptions outside the code chunks to remind yourself of what you've tried or ideas you have.• Do not use the print()function anywhere in the notebook. If you do this while working on the project, please be sure to comment those lines out before submitting.• Do not rearrange dataframe outputs unless specified by the question instructions.• Do not create multiple copies of the BRFSS dataset in your notebook. Creating too many copies of the dataset can cause CodeGrade to crash.Data Set• These data come from the Centers for Disease Control and Prevention• To answer these questions you will need to use the codebook on Brightspace, called BRFSS_2021 Codebook. For part 2 of the project, please note that not all of the variables listed in the codebookare included in the .csv file to be downloaded from Brightspace.• Download the brfss2021.csvfile from Brightspace and place it in the same folder/directory as your script file. Then in RStudio, set your Working Directory to your Source File location: in themenus choose Session | Set Working Directory | To Source File Location. You most likely will see some warnings after it loads due to the fact that read_csv()will try to guess the column type butbecause there are so many rows it won't read enough of them to accurately make a guess.• You must use the read_csv()function when loading the .csv file. Do not use read.csv().• Do not rename the .csv file that you download from Brightspace.• Do not edit the .csv file.Pipe NotationYou may use the tidyversepipe %>%or the new base R pipe |>. For a comparison, see here.You are expected to use pipe notation in all of the CodeGrade assignments. Although there are alternate ways to filter, subset, and summarize data sets, using the pipe creates more readable code and isan important skill to develop.Rounding requirementRound all float/dbl values to two decimal places, unless otherwise noted.Dataframe vs. TibbleTypically, in CodeGrade assignments, we expect output to be dataframes, not tibbles, unless otherwise noted.Preliminaries{r}### Run this cell. Do not make any changes.Final_Project_Part1_Templateabout:srcdoc1 of 512/12/24, 12:41
rm(list = ls())library(tidyverse) library(psych)library(lm.beta)# This will take a few moments to load since the file is so large.brf <- read_csv("brfss2021.csv", show_col_types = FALSE)QuestionsQ1: We will be analyzing three variables (described below) in part 1 of this project. Identify the names of the variables indicated belowusing the CodeBook provided on Brightspace. Using the brfss2021.csv data provided on Brightspace, create a dataframe named brf_Q1with only three columns (in the order listed below). Do not rename the variables. Store the first 10 rows in Q1.• a variable that measures how often the respondent eats fruit (not including juices).• a variable that records the length of time since last routine medical checkup• a variable that records the general health of the respondent.We encourage you to explore both the Codebook and the Questionnaire on Brightspace and take note of the values of each of these three variables and familiarize yourself with them before continuing.Note: Your brf_Q1dataframe should have the same number of rows as the original brfbut now only 3 columns.{r}### Do not edit the following line. It is used by CodeGrade.# CG Q1 # ### TYPE YOUR CODE BELOW ###library(dplyr)brf_Q1 <- brf %>%select(FRUIT2, CHECKUP1, GENHLTH) %>%head(10)Q1 <- brf_Q1 ### VIEW OUTPUT ###Q1Q2: Clean the dataframe brf_Q1by removing the respondents who "refused", said "don't know/not sure" and any NAs from both thehealth variable and the length of time variable. See the CodeBook for details on what the values of the variables mean. Store this cleanedversion in a new dataframe named brf_Q2(we'll use this later). Sort brf_Q2by the general health variable (from excellent health topoor health) and assign the first 10 rows to Q2.Hint: The resulting brf_Q2dataframe is 431,750 x 3.{r}### Do not edit the following line. It is used by CodeGrade.# CG Q2 # ### TYPE YOUR CODE BELOW ###brf_Q2 <- brf_Q1 %>%filter(!(GENHLTH %in% c("refused", "don't know/not sure", NA)) &!(CHECKUP1 %in% c("refused", "don't know/not sure", NA))) %>%arrange(factor(GENHLTH, levels = c("excellent", "very good", "good", "fair", "poor")))Q2 <- head(brf_Q2, 10)### VIEW OUTPUT ###Q2Q3: How many people (and what percentage) reported that, in general, their health is either good or very good? Your answer should be adataframe with two values: the number and the percentage. Round the percentage to the nearest tenth. Store it as Q3.The percentage is out of the total number of observations for the brf_Q2dataset.Hint: The answer should look like this (note the column names):Count Percent<value> <value> {r}### Do not edit the following line. It is used by CodeGrade.# CG Q3 # ### TYPE YOUR CODE BELOW ###### VIEW OUTPUT ###Q3Q4: Create a dataframe showing the number and the proportion of individuals who said their health is excellent, very good or good foreach of the different lengths of times since last checkup. Store as a dataframe named Q4. Round to three decimal places.The percentage is out of the total number of observations for the brf_Q2dataset. If your proportion does not match below, double check your Q2 cleaning.Hint: The 5x3 dataframe should look like this. The [...]is the name of the length of time variable. Be sure to match the column names.[...] n proportion1 <value> <value>2 <value> <value>3 <value> <value>Final_Project_Part1_Templateabout:srcdoc2 of 512/12/24, 12:41
4 <value> 0.0388 <value> <value>{r}### Do not edit the following line. It is used by CodeGrade.# CG Q4 # ### TYPE YOUR CODE BELOW ###### VIEW OUTPUT ###Q4Q5a: Now we will clean the variable that measures how often the respondent ate fruit per day or per week or per month. Create a newdataframe named brf_Q5athat is the same as brf_Q2except that you'll add a new variable to it named FRTDAYthat converts all ofthe responses into fruits eaten per day. Be sure to account for 0 days. Use 30 days per month, 7 days per week, and 0.02 for less than oncea month in your conversion calculations. Be sure to round any conversion calculations to two decimal places. Place the new column as thefirst column in the dataframe.The resulting dataframe should still have NAs for FRTDAY at this point.Do not do anything with respondents who said "don't know/not sure" or refused to respond. We will handle those responses in the next section.Hint: The resulting dataframe is 431,750 x 4.Hint: Select values for rows 1, 65, and 117 are shown below.FRTDAY FRUIT2 CHECKUP1 GENHLTH1 1.00 101 2 1... 65 NA NA 2 1... 117 0.14 201 3 1{r}Q5b: Create a new dataframe named brf_Q5that is the same as brf_Q5aexcept that you'll update the FRTDAYcolumn by removingthe respondents who said "don't know/not sure" or refused to respond and then drop the original fruit variable from the brf_Q5dataframe (but keep FRTDAY). Sort by GENHLTH. Store the first 10 rows of the dataframe as Q5.Hint: The resulting dataframe is 422,747 x 3. Be sure the variables are in this order (left to right): FRTDAY, the length of time variable, then the general health variable.Hint: Select values for rows 1, 6, 10 are shown below.FRTDAY CHECKUP1 GENHLTH1 <value> 2 <value>...6 0.02 4 <value>...10 <value> <value> 1{r}### Do not edit the following line. It is used by CodeGrade.# CG Q5 # ### TYPE YOUR CODE BELOW ###### VIEW OUTPUT ###Q5Q6 Using the brf_Q5dataframe, create a 5 x 5 dataframe with the mean, median, standard deviation, and count of FRTDAY split by thegeneral health of the respondent. Store the result as Q6.Hint: The resulting 5 x 5 dataframe should look like this. The [...]is the name of the health variable. Be sure to match the column names.[...] Mean Median SD Count1 <value> <value> 5.20 <value>2 <value> <value> <value> <value>3 <value> <value> <value> <value>4 <value> <value> <value> <value>5 <value> <value> <value> <value>{r}### Do not edit the following line. It is used by CodeGrade.# CG Q6 # ### TYPE YOUR CODE BELOW ###### VIEW OUTPUT ###Q6Final_Project_Part1_Templateabout:srcdoc3 of 512/12/24, 12:41
Q7: After a visual analysis of the FRTDAY variable in the brf_Q5dataframe, an analyst recommends removing all values more than 8 as"outliers". This isn't following a specific outlier rule, but a decision based on the graphical display and the amount of observations lost byusing this cut-off. The graded question is part (c) below, but we have included parts (a) and (b) to illustrate some of the analyst's work.a) Create a boxplot of the FRTDAY variable (not scored by CG).{r}b) Count how many values will be lost if all values of FRTDAY that are greater than 8 are removed. What proportion of observations will be lost by this removal? (notscored by CG) The resulting boxplot will still have outliers, but not as extreme.{r}c) Create a new dataframe named brf_Q7cbased on brf_Q5that follows the analyst's recommendation.Hint 1: The resulting dataframe is 420,720 x 3.Hint 2: Check that NAs still exist in the FRTDAY variable.{r}d) Finally, create a new dataframe named brf_Q7from brf_Q7cthat imputes the median number of fruits per day to the "not asked or missing" values. With thenew brf_Q7dataframe, use describefrom the Psychpackage to output statistics from each variable. Store as Q7.The summary output is a 3 x 13 dataframe. The first few columns look like this:vars n mean sd ... ...FRTDAY 1 420720 1.04 <value> ... ...<name> 2 420720 <value> <value> ... ...<name> 3 420720 <value> <value> ... ...{r}### Do not edit the following line. It is used by CodeGrade.# CG Q7 # ### TYPE YOUR CODE BELOW ###### VIEW OUTPUT ###Q7Q8: Using the brf_Q7dataframe create a regression predicting general health based on both number of fruits consumed per day and thelength of time since last checkup (in that order). Analyze the summary of the model. How would you assess this model? Store the summaryof the standardized regression coefficients in Q8.Hint: The resulting summary output should look like this:Call:[...]Residuals:Min 1Q Median 3Q Max <value> <value> <value> <value> <value> Coefficients:Estimate Standardized Std. Error t value Pr(>|t|) (Intercept) 2.672440 <value> <value> <value> <value>FRTDAY -0.107117 <value> <value> <value> <value>[...]2 <value> <value> <value> <value> <value>[...]3 <value> <value> <value> <value> <value>[...]4 <value> <value> <value> <value> <value>[...]8 <value> <value> <value> <value> <value>---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: <value> on <value> degrees of freedomMultiple R-squared: <value>, Adjusted R-squared: <value> F-statistic: <value> on <value> and <value> DF, p-value: < <value>{r}### Do not edit the following line. It is used by CodeGrade.# CG Q8 # ### TYPE YOUR CODE BELOW ###### VIEW OUTPUT ###Q8Q9: Create a new dataframe named brf_Q9based on brf_Q7with two new columns with binary data. For any health level of very goodor excellent, set the value to 1; otherwise, a 0. Call this binHealth. For any person attending a checkup within the past year, set to avalue of 1; otherwise, 0. Call this binCheckup. With the updated dataframe, perform a logistic regression to predict the likelihood of anindividual's general health being very good or excellent based on FRTDAYand binCheckup.Store the summary of the regressioncoefficients in Q9.Final_Project_Part1_Templateabout:srcdoc4 of 512/12/24, 12:41