STAT 301 Week 6 LO

.docx

School

Aditya Birla World Academy**We aren't endorsed by this school

Course

STA 301

Subject

Computer Science

Date

Dec 16, 2024

Pages

Uploaded by ChefOkapi200

1.List model metrics that are suitable for evaluating a statistical model developed to make inferences about the data-generating mechanism (e.g., R2, AIC, F-tests, Likelihood ratio test) and how they are calculated.The glance()function provides model metrics (e.g., R2, AIC, F-tests, Likelihood ratio test).For a LR with an intercept and estimated by LS:The R2, coefficient of determination, can be used to measure the part of the variation in the response explained by the estimated modelThe Adjusted R2can be used to compare the fit of estimated models of different sizesMetricFormulaWhat we wantR2(does not provide a sense of how good our model is in predicting out-of-sample cases)1−RSSTSSBetween 0 and 1Negative means that regression does not have interceptAdjusted R21−(1−R2)(n−1)N−p−1p– number of indep. variablesHigher is better. Penalizes RSS with n-p in case RSS decreases when more variables are included in the model.Residualyi−^yLow ideally (predicted is close to actual)yi– actual value^yi– predicted valuey– average valueMSE (Mean squared error)1n∑i=1n(yi−^yi)2ESS (Explained sum of squares)∑i=1n(^yi−y)2RSS (Residual sum of squares)∑i=1n(yi−^yi)2TSS (Total sum of squares) = ESS + RSS∑i=1n(yi−y)2TSS (residuals from the null model) is much larger than RSS (residuals from the fitted model) = good fitAIC−2ln(^L)+2k^L - max likelihood value of the log-likelihood model k - number of estimated parametersLow ideally, it evaluates how well a model fitsthe data it was generated from.

F-testshelp compare whether your regression model provides a better fit than a model that contains no independent variables.Difference between the reduced and the full model:the term with β1. We are testing if β1 is different from zero. i.e., H0:β1=0The null hypothesis states that the model with no independent variables fits the data as well as your model.The alternative hypothesis says that your model fits the data better than the intercept-onlymodel.If the p-value is less than the significance level, your sample data provide sufficient evidence to conclude that your regression model fits the data better than the model with no independent variables.R code (F-ratio is the ‘statistic’ in glance()): reduced_model <- lm(y ~ 1, data = …)full_model <- lm(y ~ x + …, data = …)anova(reduced_model, full_model)

Likelihood ratio testH0: The full model and the nested model fit the data equally well. Thus, you should use the nested model.HA: The full model fits the data significantly better than the nested model. Thus, you should use the full model.If we reject the null hypothesis (for large LR), we can conclude that the full model offers a betterfit.LR=−2(¿Null–¿Proposed)¿Nullis the log likelihood value of the null model¿Proposedis the log likelihood value of the proposed model2.Write a computer script to calculate these model metrics. Interpret and communicate the results from that computer script.

In add_mrna_results, the null hypothesis is that the coefficient for mrna is zero and this indicates whether mrna has a significant impact on the response variable when controlling for the gene variable. In Ftest_3genes_add_mrna, the null hypothesis is that adding mrna to the model does not improve the model fit compared to using just gene. In both cases, the p-value suggests that mrna does not significantly contribute to predicting prot. Therefore, we do not haveenough evidence to suggest that mrna is a significant predictor in this case.

Post-inference problem- the training set is used to select so it can't be used again to assess the final significance of the model usually our sample is too small to divide into training and testing sets.3.Explain the algorithms for the following variable selection methods:(1) test to compare nested models,(2) Forward selection, and(3) Backward selection.(1) Test to compare nested models = F-test = anova between both modelsThese F-tests can be used to select variables since we are comparing and testing how the fit changes as we select predictors/variables.(2) Forward selection

Both selection algorithms are implemented in R by the function regsubsets():The argument x of regsubsets() is analogous to formula in lm().The argument nvmaxindicates the maximum number of variables to be used in the variable selection.This function identifies subsets of input variables that provide the best model for different model sizes and then selects the best among those.R code:selection <- regsubsets(y ~ ., nvmax = …, data = …, method = "forward"/”backward”)selection_summary <- summary(selection)selection_summary_df <- tibble(n_input_variables = 1:nvmax,RSQ = selection_summary$rsq,RSS = selection_summary$rss,ADJ.R2 = selection_summary$adjr2)cp_min = which.min(housing_forward_summary$cp) selected_var <- names(coef(housing_forward_sel, cp_min))[-1]