University of Waterloo**We aren't endorsed by this school
Course
STAT 371
Subject
Statistics
Date
Dec 25, 2024
Pages
5
Uploaded by matthewleespams
STAT 371 Midterm –S24 Solutions ( /40) Instructions and information 1.Do not remove this cover page. 2.This test is out of 40total marks. The marks for each question are indicated. 3.No student questions are permitted. 4.Answer the questions in the spaces provided. 5.The last page is for rough work and will not be marked. 6.When consulting the t and F tables, always use the closest available degrees of freedom. 7.SHOW YOUR WORK. Your grade will be influenced by how clearly you express your ideas, and how well you organize your solutions. Some relevant expressions and relationships: SLR: 22()()()()iiiixxyyrxxyy−−=−−01ˆˆyx=−12()()ˆ()iiixxyyxx−−=−2211ˆ~(,/() )iNxx −Multiple Regression: 2~( ,)N=+YXβεε0I1ˆ()TT−=βX XX y21ˆ~( ,())TN−ββX X2ˆ(1)ienp=−+1(1),1/2ˆˆ()TTnewnpnewnewt−−+−xX Xx1(1),1/2ˆˆ1()TTnewnpnewnewyt−−+−+xX XxSS(Reg) /MS(Reg)SS(Res) / ((1))MS(Res)pFnp==−+(SS(Res)SS(Res)) / ()SS(Res)/redfullredfullfullfulldfdfFdf−−=
1)The linear model 2~( ,)N=+YXβεε0Iis fit to a random sample of insured US adults to investigate the relationship between medical charges ($) billed by the individual’s health insurance and the following explanatory variates: •age: age of individual (years) •gender: gender of individual (one of female or male for this dataset) •bmi: Body mass index (2/kgm) •smoker: Smoker status (yes/no) •region: Residential area in the US: northeast, southeast, southwest, northwest. Regression output is given below. Call: lm(formula = charges ~ age + gender + bmi + smoker + region) Residuals: Min 1Q Median 3Q Max -11175.1 -3254.7 -481.6 2156.5 21392.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -10101.31 2864.57 -3.526 0.000603 age 296.47 ***** 7.243 5.04e-11 gender_male -333.10 1055.85 -0.315 0.752956 bmi 282.70 83.65 3.379 0.000988 smokeryes 22678.26 1195.76 18.966 < 2e-16 regionnorthwest -2879.13 1500.10 ****** ******** regionsoutheast -3853.71 1485.72 -2.594 0.010703 regionsouthwest 197.65 1572.09 0.126 0.900167 --- Residual standard error: 5797 on 117 degrees of freedom Multiple R-squared: 0.7902, Adjusted R-squared: 0.7777 F-statistic: 62.96 on *** and *** DF, p-value: < 2.2e-16 a)Give the matrix dimensions (in numerical values) of the following: i) [2] 1()TT−X XX(1)8 125pn+=ii) [2] Tε ε1 1iii) [2] THH y, where H is the hat matrix defined in lectures. 125 1
b)[3] Interpret the value of 3ˆ(the bmiparameter estimate) in the context of the study. After accounting for the other variates, each kg/m2increase in bmi is associated with an estimated mean increase in $282.70 in medical charges. c)[3] From the given p-value and parameter estimate, state the conclusion associated with the southeast region in the context of the study. After accounting for the other variates, the mean medical charges in the southeast is significantly less than the mean medical charges in the northeast. d)[3] Give a 99% confidence interval for 1. You do not have to interpret the interval. 1117,.9951ˆˆ()tSE2.6 7296.47 41 ( 0.93)=1111ˆˆˆ(()40.93)ˆ7.243(296.)47tSEtSE=→===296.47 107.11 (189.36, 403.58)==e) [3] From the t-table, give a range of values for the p-value associated with 05:0H=(the northwest region parameter). Show your work. 5522879.ˆ1.9ˆ13 15.10()00tSE===−−p-value = 1202(1.92)P t −From t table: 120(1.98).025P t −=120(1.658).05P t −=.05 < p-value < .10
f) [4] A 95% confidence interval for the mean medical charges of a 50 year-old female smoker from the northeast region and with a bmi = 35 has an upper limit of $40567. Calculate ˆ()newSEto the nearest dollar. (You will first need to calculate ˆnew). 013410101.31 296.47(50)282.7(35)22678.2637295ˆˆˆˆˆ(50)(35)new−+=+++++==107,.95107,.95ˆˆˆ()40567()40567372953272newnewnewtSEtSE+=→=−=107,.97532723272ˆ()1.98$1653newSEt===g) Consider the test summarized in the last line of the output (underlined) i) [2] Give the null hypothesis associated with this test in the form 0:H=Aβ001230123704567010000000001000000000100000:0: 000010000000001000000000100000000010HH ====→= ii) [4] For the reduced model given by 01,2,...,iiyin=+=, derivethe least squares estimate of 0. Show your work. We wish to minimize the sum of squares of the errors, given by 220011()()nniiiiSy====−Taking derivative w.r.t. 0and setting to zero: 0100110ˆ2()0ˆˆniiniiniidSydnyyyn==== −−====
h) Consider the anova output below> anova(ins.red.lm,ins.lm) Model 1: charges ~ age + gender + smoker + region Model 2: charges ~ age + gender + smoker + region + bmi Res.Df RSS Df Sum of Sq F Pr(>F) 1 **** 4315517878 2 **** 3931739391 *** ********* ***** *****i) [4] Give the value of the F statistic and p-value. Show your work. (SS(Res)SS(Res)) / ()SS(Res)/() / (1)/11711.42431551787839317393913931739391redfullredfullfullfulldfdfFdf−−=−==From F table: 1,1171,1201,117(3.92)(3.92).05(11.42).05P FP FP F=→(From 03:0H=of summaryoutput above: p-value = 0.000988) ii) [2] State the conclusion in the context of the study. After accounting for age, gender, smoking status, and region, body mass index is significantly (positively) related to medical charges. i)Suppose the region variate was remodelled with the indicator variates: 5x= {1 if northeast; 0 otherwise}; 6x= {1 if northwest; 0 otherwise}; 7x= {1 if southeast; 0 otherwise} i) [2] Give the value of 7ˆ. Modelled in this way, 7ˆrepresents the estimated mean difference in medical charges between the southeast and southwest regions, after accounting for the other variates. In the original model, this value is represented by 67ˆˆ3853.71 197.654051.35−= −−= −ii) [1] Give the p-value associated with 05:0H=. This hypothesis is testing whether there is a difference in mean medical charges between northeast and southwest regions. From the original model, we see that the p-value for this test is .900167 2) [3] Show (i.e.prove) that for the multiple regression model given by 2~( ,)N=+YXβεε0I, the least squares estimator of βis unbiased (you do not have to derive the form of the estimator). 2~( ,)()()NEE=+→=+=YXβεε0IYXβεXβ111ˆ( )[()]()()()TTTTTTEEE−−−====βX XX YX XXYX XX Xββ