University Of Chicago**We aren't endorsed by this school
Course
BUS 41100
Subject
Computer Science
Date
Dec 20, 2024
Pages
35
Uploaded by ProfessorTrout4909
One color— 60%screen of black with 100%black Reversed out of color background— 15%screen of black with 100%white or 428 Medium Gray with 100%white Reversed out of photography— 85%transparency of white with 100%whiteBUS 41100: Applied Regression AnalysisInstructor: Panagiotis (Panos) ToulisEmail:panos.toulis@chicagobooth.eduTheR companion.R(https://www.r-project.org/about.html) has become the de facto programming lan-guage for statistical analysis. Its main advantages include being equipped with a wide varietyof libraries in a seamless integration with visualization tools. It is also supported by the wideand vibrantCrantasticcommunity. Many researchers publish their methodologies writteninR, which ensures that the language remains at the forefront of modern scientific research.This short document is meant to be a (very) quick and practical introduction toR. It isloosely based on theR Cookbookby Paul Teetor, but it has significant additions tailored toBUS 41100 at Booth.Another great resource is the free e-book “R for Data Science”.See herehttps://r4ds.had.co.nz/.
1Getting helpGoogle is your friend. When in trouble you can copy paste the error message and search iton Google. Chances are the answer will be in the first search result. For example,> x = rnorm(10, mean=k)Error in rnorm(10, mean = k) : object ’k’ not found.The error here is pretty obvious: we have not defined the variablek. But a Google searchfor “r object k not found” brings up a page onstackoverflowwith the answer.Because functions are important here are two useful commands:> ?rnormbrings up a page with information onrnorm, which samples normal random variables. Inthis documentation,Argumentshas information about the input arguments of the function,andValueshas information about its output. Theexamplecommand is also useful to getexamples on how to use a function:> example(mean)mean> x = c(0:10, 50)mean> xm = mean(x)mean> c(xm, mean(x, trim = 0.10))[1] 8.75 5.50
Table 1: Basic commands (run one after the other for practice)CommandDescriptionExamplec()Create vector>x = c(1, 2, 3, 4)>y = c(-1, 0, 1, 2)x[i]Access i’th element>x[1] + x[2]3x[i:j]Access i’th to j’th element>x[1:3]1 2 3headGet first elements of vector>head(x, 3)1 2 3tailGet last elements of vector>tail(x, 3)2 3 4revReverse vector>rev(x)4 3 2 1sumSum of all elements>sum(x)10prodProduct of all elements>prod(x)24x + ySum of vectors>x + y0 2 4 6x * yProduct of vectors>x * y-1 0 3 8x^kRaises elements tokth power>xˆ21 4 9 16sqrtSquare-root values>sqrt(x)1.00 1.41 1.73 2.00meanAverage value of vector>mean(x)2.5medianMedian value of vector>median(x)2.5minMinimum value of vector>min(x)1maxMaximum value of vector>max(x)4sdStandard deviation of data in vector>sd(x)1.291varVariance of data in vector>var(x)1.67read.csvLoad dataset>x = read.csv(“pickup.csv”)
2Data framesThe data frame is the most important data structure inR. It is like a spreadsheet with rowsand columns. An example data frame that we will use in this document for illustration isiris, which exists by default in yourR installation:> head(iris)Sepal.Length Sepal.Width Petal.Length Petal.Width Species15.13.51.40.2setosa24.93.01.40.2setosa34.73.21.30.2setosa44.63.11.50.2setosa55.03.61.40.2setosa65.43.91.70.4setosaThe datairisare about certain species of the flower iris, and is one of the most populardatasets in statistics and machine learning, especially for classification.1Use?irisfor more.To get the names of the columns of a data frame:> names(iris)[1] "Sepal.Length" "Sepal.Width""Petal.Length" "Petal.Width"[5] "Species"To access a column of a data frame use$and the column name:> head(iris$Sepal.Length)[1] 5.1 4.9 4.7 4.6 5.0 5.4This is now a vector and you can use basic commands in Table 1 to do more with it. You couldaccess the same vector by using bracket notation. To access the first column (Sepal.Lengh):> head(iris[, 1])1https://en.wikipedia.org/wiki/Iris_flower_data_set
[1] 5.1 4.9 4.7 4.6 5.0 5.4To access the first two columns:> head(iris[, 1:2])Sepal.Length Sepal.Width15.13.524.93.034.73.244.63.155.03.665.43.9To access the first row:> head(iris[1, ])Sepal.Length Sepal.Width Petal.Length Petal.Width Species15.13.51.40.2setosaTo access the first two rows:> head(iris[1:2, ])Sepal.Length Sepal.Width Petal.Length Petal.Width Species15.13.51.40.2setosa24.93.01.40.2setosaWe can treat all these subsets as data frames. For example, to get the first column (Sepal.Length)of the first two rows:> x = iris[1:2, ]> x$Sepal.Length[1] 5.1 4.92.1subsetfunctionThere are two important tasks to do with a data frame. First, we often need to get a subsetof the data. To get the data corresponding to the speciessetosa:> x = subset(iris, Species=="setosa")> head(x)Sepal.Length Sepal.Width Petal.Length Petal.Width Species15.13.51.40.2setosa24.93.01.40.2setosa34.73.21.30.2setosa44.63.11.50.2setosa55.03.61.40.2setosa65.43.91.70.4setosaWe observe that the new datasetxcontains all data only for speciessetosa.
To get the data that are of speciessetosaandhave sepal length more than 5.5:> x = subset(iris, Species=="setosa" & Sepal.Length > 5.5)> xSepal.Length Sepal.Width Petal.Length Petal.Width Species155.84.01.20.2setosa165.74.41.50.4setosa195.73.81.70.3setosaWe used the logical operator ’&’, which you can read as “and”.You can usesubsettoselectspecific columns. For example, to select the first two columns:> x = subset(iris, select=c(Sepal.Length, Sepal.Width))> head(x)Sepal.Length Sepal.Width15.13.524.93.034.73.244.63.155.03.665.43.9This gets the same result as the commandiris[, 1:2]we saw earlier.To get a new data frame with one column removed we can use the ’-’ sign:> x = subset(iris, select=-Species)> head(x)Sepal.Length Sepal.Width Petal.Length Petal.Width15.13.51.40.224.93.01.40.234.73.21.30.244.63.11.50.255.03.61.40.265.43.91.70.4We see that the last column (Species) was removed. To remove more columns we can usethec()command; e.g.,subset(iris, select=-c(Species, Petal.Width))removes bothspecified columns.2.2applyfunctionThe other important task with data frames is to transform their data. Most commonly wewant to apply a function on each column, or on each row. To illustrate, let’s first removetheSpeciescolumn since it is non-numeric:> x = subset(iris, select=-Species)
Now lets calculate the sum of each column separately:> fn = function(b) { sum(b) }> apply(x, 2, fn)Sepal.LengthSepal.Width Petal.LengthPetal.Width876.5458.6563.7179.9This is probably the toughest part of the companion, but is also very useful so stick withme. There are two important things happening here:•We defined ’fn’, which is afunction, i.e., it expects some input and returns an output.Here, input is assumed to be some vectorband output is the sum of all elements ofthe vector. You should check that, e.g.,> fn(c(1, 2, 3))[1] 6•Next we wroteapply(x, 2, fn). What this does is to take the data framex(whichisiriswithoutSpecies) and then take every column of the frame and apply functionfn. Thus,apply(x, 2, fn) = (fn(column1, fn(column2), ..), as a vector.To gain more experience with this, suppose we want to sum the squares of the elements foreach column of the data frame. What we have to do is:> fn = function(b) { sum(b^2) }> apply(x, 2, fn)Sepal.LengthSepal.Width Petal.LengthPetal.Width5223.851430.402582.71302.33You can always check your work through quick tests:> sum(x$Sepal.Length^2)[1] 5223.85This number agrees with the number from usingapply.To sum the elements of the data frame acrossrowsis a minor change:> s = apply(x, 1, fn)> head(s)12345610.29.59.49.4 10.2 11.4
3Data wrangling withtidyverseThetidyversepackage inRis a collection of tools that aims to streamline the manipulationof data frames. The key concept is the “pipe” operator%>%which allows a sequence of datamanipulations (similar to pipes in Unix/Linux).Some key commands here arefilter,groupby,summarise,arrangeandmutate.Asmentioned before, these commands can be connected together through pipes, which is wherethetidyverseshines.Suppose for example we want to calculate the averagepetal.widthfor each species of iris:> iris %>% group_by(Species) %>% summarize(avg_petal=mean(Petal.Width))# A tibble: 3 x 2Speciesavg_petal<fct><dbl>1 setosa0.2462 versicolor1.333 virginica2.03Note that you can almost read the commands in sequence, where the steps are separatedby%>%: First, we passiristo a step where we group the data according to theSpecies.Then, we summarize the data according to the averagePetal.Widthfor eachSpecies.What if we only wanted sepal lengths>5? Easy. Just add an extra pipe:> iris %>% filter(Sepal.Length > 5) %>% group_by(Species) %>%summarize(avg_petal=mean(Petal.Width))# A tibble: 3 x 2Speciesavg_petal<fct><dbl>1 setosa0.2772 versicolor1.353 virginica2.03To add a new variable that adds sepal length and petal length, usemutate:> df = iris %>% mutate(total=Petal.Length + Sepal.Length)> head(df)Sepal.Length Sepal.Width Petal.Length Petal.Width Species total15.13.51.40.2setosa6.524.93.01.40.2setosa6.3...This form of data wrangling is extremely practical. I would highly recommend that you spendsome time on it. You can find more examples herehttps://r4ds.had.co.nz/tidy-data.html.
4GraphicsTo plot a histogram of a variable:> hist(iris$Sepal.Length, breaks=20)By changing the value ofbreaksthe histogram can be more or less detailed.To plot a boxplot of a variable:> boxplot(iris$Sepal.Length)To have boxplots broken down by some variable we use the∼notation (in general∼means“with respect to” or “as a function of”, and will be useful in other tasks such as regression):> boxplot(iris$Sepal.Length ~ iris$Species)In most functions we can avoid the$notation by declaring which data frame we are using.For example, we could get the exact same boxplot as before using:> boxplot(Sepal.Length ~ Species, data=iris)To get a scatterplot useplot:> plot(iris$Sepal.Width, iris$Sepal.Length)We can use the∼notation as well:> plot(Sepal.Length ~ Sepal.Width, data=iris)
To color the data points we can specify thecolargument:> plot(Sepal.Length ~ Sepal.Width, col=Species, data=iris)We can see that the color adds another dimension to the plots, and thus makes it moreinformative. We can add a legend to show which species each color corresponds to:legend("topright", legend=levels(iris$Species), fill=1:3)We could do even more by plotting the points proportionally in size to the petal width:> plot(Sepal.Length ~ Sepal.Width, col=Species, cex=Petal.Width, data=iris)4.1Common graphical parameters•main: title of the plot;•xlabandylab: labels of the x-axis and y-axis respectively;•xlimandylim: limts of the x-axis values and the y-axis values, respectively;•cex: size of points in the plot proportionally to default (e.g., usecex=2to double thesize);•col: color of points.•par: defines plot grid; e.g.,par(mfrow=c(2, 2))defines a 2x2 plot grid.For example,> plot(Sepal.Length ~ Sepal.Width,main = "Sepal length and width for iris dataset",xlab="sepal width", ylab= "sepal length",xlim=c(1, 5), ylim=c(3, 9),col="red", cex = 1.5, data=iris)
5Simple linear regressionThe main function islm, which stands for linear model:> fit = lm(Petal.Length ~ Petal.Width, data=iris)> fitCall:lm(formula = Petal.Length ~ Petal.Width, data = iris)Coefficients:(Intercept)Petal.Width1.0842.230Our linear model is therefore:Petal.Length = 1.08 + 2.23 * Petal.Width.To see the fitted line:> plot(Petal.Length ~ Petal.Width, data=iris)> abline(fit, col="red")The most important variables offitare the residuals and the fitted values. For example,here is how we plot the residuals with respect to the explanatory variable:> plot(fit$residuals ~ iris$Petal.Width)To plot the true values with fitted values:> plot(iris$Petal.Length ~ fit$fitted.values)Another important variable oflmoutput iscoefficientswhich is a vector with the interceptand slope of the regression line:> fit$coefficients(Intercept) Petal.Width1.0835582.2299405.1summaryoutput oflmfitSuppose we have the modelPetal.Length=β0+β1Petal.Width+ε,ε∼ N(0, σ2).We can estimate the unknown model parametersβ0, β1, σ2, using least squares, and reportdetails of the model fit using the functionsummary:5.2Confidence intervalsTo create the 95% confidence intervals for the model parameters:> fit = lm(Petal.Length ~ Petal.Width, data=iris)
> confint(fit, level = 0.95)2.5 %97.5 %(Intercept) 0.9393664 1.227750Petal.Width 2.1283752 2.331506If we want to check the hypothesis thatβ1= 0 we can look at whether 0 is in the intervalforβ1(Petal.Width). Here, it is not in the interval so we reject the hypothesis thatβ1= 0.5.3Prediction intervalsTo predict new values:> xf = data.frame(Petal.Width=c(0.2, 0.4))> predict(fit, newdata = xf, interval="prediction")fitlwrupr1 1.529546 0.575991 2.483101
2 1.975534 1.023927 2.927142For example, when petal width is 0.4 our prediction for the petal length is 1.97 with predictioninterval [1.024, 2.93].To get an idea about the shape of the prediction intervals, here is some sample code:set.seed(123)# parameters that control prediction intervalssx2 = .1# X-spread increase this value to reduce curvaturesigma = 20# population sdn = 100 # no. of datapoints# Generate data.X = runif(n) * sqrt(12 * sx2)# so that Var(X) = sx2Y = 1 + 2 * X + rnorm(n, sd=sigma)plot(X, Y)ols = lm(Y ~ X) # run LS# New X to predictXf = data.frame(X=seq(-5, 20, length.out = 1000))# prediction intervalspi = predict(ols, newdata = Xf, interval = "prediction")# min/max of prediction intervalm1 = min(c(pi[,2], pi[,3]))m2 = max(c(pi[,2], pi[,3]))# plot intervals.plot(Xf$X, pi[, 2], type="l", ylim=c(m1, m2))lines(Xf$X, pi[, 3], col="red")abline(v=mean(X), lwd=2, col="blue")text(mean(X)-1.5, m1+1, "mean(X)")This code will generate Figure 1.
-505101520-200-1000100200Xf$Xpi[, 2]mean(X)Figure 1: Upper (red) line is the upper end of the prediction interval for given value ofXf.Lower (black) line is the lower end of the prediction interval. The blue line denotes¯X.6Multiple Linear RegressionTo run linear regression we uselmand formula notation:> fit = lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris)> summary(fit).[abbv.]Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept)2.249140.247979.07 7.04e-16 ***Sepal.Width0.595520.069338.59 1.16e-14 ***Petal.Length0.471920.0171227.57< 2e-16 ***Here, we regressedSepal.Lengthon just two variables, namely,Sepal.WidthandPetal.Length.To fit all variables we can use “.” and to remove one variable from the regression we use“-” notation:> fit = lm(Sepal.Length ~ . - Species, data=iris)This command regressesSepal.Lengthon all other variables in theirisdataset exceptspecies.6.1InteractionsLets try to predict petal length using sepal width in theirisdataset:
> fit = lm(Petal.Length ~ Sepal.Width, data=iris)> summary(fit)[..]Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept)9.06320.92899.757< 2e-16 ***Sepal.Width-1.73520.3008-5.768 4.51e-08 ***We can see that the slope appears to be negative and significant. Let’s plot to see how thedata look like (always do!).plot(Petal.Length ~ Sepal.Width, col=Species, data=iris, pch=20)We see that the relationship differs a lot between species and so thelmoutput was misleading.So let’s use interactions to get more nuanced analysis!> fit2 = lm(Petal.Length ~ Sepal.Width * Species, data=iris)> summary(fit2)[...]Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept)1.182920.500722.3620.01949 *Sepal.Width0.081410.145200.5610.57589Speciesversicolor0.752000.699831.0750.28437Speciesvirginica2.327980.715073.2560.00141 **
Sepal.Width:Speciesversicolor0.757970.227703.3290.00111 **Sepal.Width:Speciesvirginica0.604900.224082.6990.00778 **The interactions here define different intercepts and slopes for each species. The picture isnow completely reversed, there is a positive slope for all three species! But for setosa speciesthe slope is not significant. It seems that this species (setosa, black dots on lower-right ofthe plot) dragged down the other two.Note that interactions can be defined ad-hoc. For example, suppose we would like to runthe regression such that we get two different LS lines, one for the problematic setosa andanother line that is common for the other two species. We could do this as follows:> S <- as.factor(iris$Species == "setosa")> fit2 = lm(Petal.Length ~ Sepal.Width * S, data=iris)> summary(fit2)[...]Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept)1.202080.514442.3370.0208 *Sepal.Width1.289670.177957.248 2.27e-11 ***STRUE-0.019160.92244-0.0210.9835Sepal.Width:STRUE -1.208250.28454-4.246 3.85e-05 ***Here,Sis an ad-hoc variable that is TRUE when the species is setosa, and is FALSEotherwise. When we interact this withSepal.Widthwe get two regression lines. For setosathe line is 1.2−0.02+(1.29−1.21)×Sepal.Width= 1.18+0.08×Sepal.Width, which uncoversthe small slope for this species we saw in the plot. On the other hand, the (common) linefor the other two species is 1.2 + 1.29×Sepal.Width, which uncovers the significant positiveslope we saw in the plot, which is approximately shared between the two species.6.2DiagnosticsDiagnostics are checks of the main assumptions in a linear regression modelY=β0+β1X+ε:linearity of conditional expectationE(Y|X), normality and constant variance of errorε.The easiest way to get diagnostic plots is to use theplotfunction:> fit = lm(Sepal.Length ~ . - Species, data=iris)> par(mfrow=c(2, 2))> plot(fit, pch=20)Here is a quick summary on how to read such plots:•Top-Left: Here we plot residuals versus fitted values. Note that we could plot residualsagainst each one ofXj, but using the fitted values, which is a linear combination ofXj, is a good summarization.•Top-right:This is the Q-Q plot where the distribution of standardized residuals is
Figure 2: Diagnostics withplot.compared against the standard normal.We would like the points in this plot to lieon the 45-degree line, which would indicate that the two distributions have similarquantiles, and hence they are similar themselves.•Bottom-left: We look at this plot for heteroskedasticity. It is similar to Top-Left plot,but has square-root of standardized residuals instead of just residuals. The idea is tocheck whether the dispersion of points differs across different fitted values.Rplots ared line as a visual aid to trace a pattern. Here, this line is mostly horizontal so weare ok with the constant variance assumption.•This is a plot to look for influential data points, since on the x-axis we report theleverage. Higher leverage means more influential points.Rwill give you a visual aidby adding a red dashed line, so that points that are outside of the line will be veryinfluential.
7The log-log modelThe log-log model is widely used in multiplicative models, such as when modeling sales withprice. Here is the model in action using beer sales (in #cases) and price.> beer = read.csv("beer.csv")> fit = lm(CASES.12PK ~ PRICE.12PK, data=beer)> log_fit = lm(log(CASES.12PK) ~ log(PRICE.12PK), data=beer)> par(mfrow=c(1, 2))> plot(beer$PRICE.12PK, beer$CASES.12PK, pch=20)> lines(beer$PRICE.12PK, fit$fitted.values, col="red")> plot(log(beer$PRICE.12PK), log(beer$CASES.12PK), pch=20)> lines(log(beer$PRICE.12PK), log_fit$fitted.values, col="red")> AIC(fit)[1] 677.0566> AIC(log_fit)[1] 26.13003Figure 3: Left: data and LS line for modelsales∼price. Right: data and LS line forlog-log modellog(sales)∼log(price).We observe that the data points are aligned better on a line after the transform. Furthermore,by using theAICcriterion we see that the log-log model offers a much better fit to the datausing the same # of parameters compared to the classical SLR model.A more detailedcomparison for model fit could be done using diagnostic plots for both models.
7.1Interpretation of slopeIn the log-log regression model,E(Y|X) =αXβ, andβis the slope.As we have seen inclass, to interpret this parameter we need to isolate it in equation likeβ=. . .. Here is howwe can proceed.We increaseXby 1% so that it becomes 1.01Xand see what happens to the averageY:100·E(Y|1.01X)−E(Y|X)E(Y|X)|{z}% change in expected Y= 100·α(1.01X)β−αXβαXβ.= 100·α(1.01β−1)XβαXβ.= 100·(1.01β−1).≈100·(1 + 0.01β−1)=β.So, 1% of increase inXimplies aβ% change in the expectation ofY. The derivation relieson the approximation 1.01β≈1 + 0.01β, which is actually pretty accurate for small valuesofβ(you can try numerical values inR).8Generalized linear modelsHere we will illustrate the use of multinomial regression, which is an extension of logisticregression. Our goal will be to build a model for theSpeciesclass of the flowers in theirisdata set.8.1Splitting data to training and testingTo start, let’s create the training dataset. First we sample the data points that will comprisethe training data set (70% of the original data).> N = nrow(iris)> train = sample(1:N, size=0.8 * N, replace=F)> head(train)[1]7051417 1195You will get different different results in your sample. In my run data points 70, 51, etc.,were included in the training data.Now we fit the multinomial regression model:> require(nnet)> model = multinom(Species ~ ., data=iris[train, ])# weights:18 (10 variable)initialvalue 131.833475iter10 value 8.069023
...convergedNowmodelis an object that contains the fitted model, and as inlmwe can usesummarytosee inside. Here we are interested more in prediction so let’s keep going. Instead let’s lookat the fitted probabilities:> phat = model$fitted.values> head(phat, 3)setosaversicolorvirginica708.157812e-08 9.999989e-01 9.803395e-07512.084786e-06 9.999611e-01 3.678849e-05411.000000e+00 2.864312e-08 1.164649e-30Rowicorresponds to data pointiand columnjcorresponds to the probability that aparticular data point is classified asj. For example, data point 51 has 0.999.. probability ofbeing classified asversicolor. Thus, the sums of all rows have to be one. Indeed:> head(rowSums(phat))7051417 11951111118.2Classification and accuracy in training dataTo doclassificationbased on the fitted probabilities we simply scan each row usingapplyand then take the column that has the maximum probability:> cl = apply(phat, 1, function(row) which.max(row))> head(cl)7051417 1195221131You should confirm that this is the output that we are expecting here. It is more convenientto transform these column ids to the actualSpeciesnames they correspond to:> cl_names = colnames(phat)[cl]> head(cl_names)[1] "versicolor" "versicolor" "setosa""setosa"[5] "virginica""setosa"We can tabulate the classification accuracy in the training set:> table(cl_names, iris[train,]$Species)cl_namessetosa versicolor virginicasetosa4000versicolor0431virginica0135
We see that the model has great in-sample classification accuracy.8.3Accuracy in testing dataFinally we can tabulate the classification accuracy in the testing set:> Xf = subset(iris[-train, ], select=-Species)> cl_names_test = predict(model, newdata = Xf)> table(cl_names_test, iris[-train,]$Species)cl_names_test setosa versicolor virginicasetosa1000versicolor060virginica0014We see that we get perfect classification accuracy in the testing set. This is because the dataare linearly separable (see earlier plots of this data set, e.g., Section 4.4)
9Special topics9.1BootstrapThe bootstrap is a method to understand thestandard errorsof estimators, that is, thestandard deviation of their sampling distribution. An estimator is always a function of data,sayf(data). But we know data are random, that is, they are distributed according to somemodeldata∼model, and so we want to know the variance (equivalently, standard deviation)off(data) with respect to the randomness in the data. Let’s illustrate through an example.Suppose we have 100 iid samples from a random variableXthat follows the normal distribu-tion:X∼ N(µ,1); sayx1, . . . x100are the samples. Setµ= 10 and now use the neuralyzer(https://en.wikipedia.org/wiki/Neuralyzer) to forget this value.> mu = 10# and then forget> x = rnorm(100, mean=mu)Our goal is to estimate the unknown quantityµand produce a CI. What is a reasonableestimate ofµ? Clearly, one is the sample mean: ˆµ=1100∑xi.> mu_hat = mean(x)> mu_hat[1] 9.828599That’s close. But we know the sample mean is random because the 100-sample dataset israndom. We need to know thestandard errorof this estimate. That is, to produce a CI forµwe need to write something like:muhat±2·se(muhat).We knowmuhatso we only need to find out its standard error. Boostrap tells us that wecan find out the standard error with the following simple procedure:1. Generate new 100-sample dataset by sampling with replacement from the original one.2. Calculate your estimator (here, sample mean) on the new dataset.3. RepeatMtimes.4. Calculate the standard deviation of theMvalues you get from above.To sample with replacement we need to do:> xnew = sample(x, replace=T)To repeat a piece codeMtimes we can usereplicate(M,{code..}). Putting the piecestogether we get the following bootstrap code:
> muhat_boot = replicate(1e3, { xnew = sample(x, replace=T); mean(xnew) })> sd(muhat_boot)[1] 0.08663554Now, 0.0866 is the standard error of the sample mean in this case. So the CI is:9.82±2·0.0866 = [9.65,10.0]Success! The true value (µ= 10) is indeed in the CI we constructed.Final notes.Bootstrap is a very general method and is routinely used when the standarderror off(data) cannot be obtained mathematically. The idea is the same in all situations.Just resample your data and do the same analysis (e.g., calculating an estimator) in everydataset. The outputs will jointly form the sampling distribution of your estimator, and fromthat you can obtain standard errors, or other summaries.9.2Cross-validationCross-validation is a general method to estimate the generalization error of a statisticalprocedure, i.e., estimate how well the procedure will do to unseen data. The idea is similarto bootstrap, and creates ”unseen data” by splitting the observed data. The procedure worksas follows:1. Split the data inKfolds.2. Take fold 1 and createtraindatafrom all other folds;3. createtestdatausing fold 1.4. Train model intraindata.5. Compute MSE intestdata.6. Repeat for folds 2, . . . ,K.7. Average over all MSE. This is the generalization error of the procedure.Here is an example to illustrate. Suppose that we have 10 predictor variables,X1, . . . , X10,and 100 samples for each. We generateYdata asY= 1 + 2∗X1+ε, so that onlyX1isimportant for the response. UsingR:> n = 100> X = matrix(rnorm(n * 10), nrow=100)# 10 variables, 100 datapoints each.> Y = 1 + 2 * X[, 1] + rnorm(n)# Y = 1 + 2 * X1 + noise> D = data.frame(X=X, Y=Y)# all dataNow we implement the cross-validation procedure described above.In particular, we willcompare two models in terms of their CV error:Y ~ X1
Y ~ X1 + X2 + ... + X10Let’s see how these models do in-sample:> reg0 = lm(Y ~ X.1, data=D)> reg1 = lm(Y ~ ., data=D)> mean((reg0$fitted.values - D$Y)^2)[1] 1.04499> mean((reg1$fitted.values - D$Y)^2)[1] 1.023788Clearly, the bigger model is doing better in terms of in-sample MSE, as expected.Buthow do they compare in terms of out-of-sample MSE? This is captured by their CV error.Because the small model is the true one (this is how data were generated), we expect this towin over the bigger model, regardless of in-sample error performance. Hopefully, CV errorcan capture this.To calculate the CV error of the big model we run the following code:mse = c()for(i in 1:5) {# 5-fold validation# every fold has n/5 size# fold i starts at 1 + (i-1) * n/5 and ends at n/5 * ifold_i = seq(1 + (i-1) * n/5, n/5 * i)# Train datatrain_data = D[-fold_i, ]# Test datatest_data = D[fold_i, ]# Train on all folds but i.fit = lm(Y ~ ., data=train_data)# train big model here.# Predict at fold i.yhat = predict(fit, newdata=test_data)y = test_data$Y# Store MSE on fold i.mse = c(mse, mean((y - yhat)^2))}mean(mse)In my experiment the CV error for the big model (as given bymean(mse)) is equal to:> mean(mse)[1] 1.297907To get the MSE for the smaller (and more accurate) model we simply change the line in theloop where we train the model to:
fit = lm(Y ~ X.1, data=train_data)Then, we run the same code again (starting frommse=c()..) to get the CV error for thesmall model. In my experiment this was equal to:> mean(mse)[1] 0.9146495Success!! The CV procedure figured out that the small model generalizes better than thebigger one.This is really important (and impressive!).Note that the CV procedure hadno knowledge of which model was the true one. By training on a subset of data and thentesting on another held-out subset, the procedure estimated that the small model generalizesbetter, even though the big model was doing better in-sample.9.3Ordinal regressionOrdinal regression refers to regression models whereYis categorical but where the categorieshave a natural ordering. For instance, in the IMDB dataset,Yis the movie rating. This is acategorical from 1-9 “stars”, and we also understand that these ratings are ordered: 2 starsare better than 1, 3 are better than 2, etc.Let’s analyze the IMDB data using ordinal regression.First, we pre-process the data as follows:> imdb = read.csv("imdb_large.csv", stringsAsFactors = T)> imdb = imdb %>% filter(country=="USA")> imdb = imdb %>% select(imdb_score, budget, actor_1_facebook_likes, director_facebook_likes)> imdb$imdb_score = round(imdb$imdb_score)> I = c(which(imdb$imdb_score==6), which(imdb$imdb_score==7))> I1 = sample(I, length(I)/2, replace=T)> I0 = setdiff(1:nrow(imdb), I)> imdb = imdb[c(I1, I0),]> table(imdb$imdb_score)> imdb = imdb %>% mutate(log_budget=log(budget, base=2),log_actor_likes=log(actor_1_facebook_likes+1, base=2),log_director_likes=log(director_facebook_likes+1, base=2)) %>%select(-budget, -actor_1_facebook_likes, -director_facebook_likes)Note:The above code removes a few “6” and “7” ratings because these are over-representedin our data. In a real application, we shouldn’t do such data filtering as it may negativelyaffect the model’s generalization error.After the pre-processing, the dataset looks as follows:imdb_score log_budget log_actor_likes log_director_likes2632622.931578.5077954.321928151727.160399.2620959.2502982636622.931579.9218410.0000001196725.0608513.4253476.022368
1406724.8384613.5508674.4594323433719.931579.9672266.820179The goal is to model the effect of social media popularity of the main actor and the directorof a movie on the movie’s IMDB score. This could potentially be useful before productionin selecting, say, actors/directors, deciding on ad campaigns, etc.To fit an ordinal regression model install theMASSpackage (install.packages("MASS"))and then execute the following command:> imdb = load_imdb()> fit_ord = polr(imdb_score ~ ., data=imdb)> summary(fit_ord)...Coefficients:Value Std. Error t valuelog_budget-0.063370.01767-3.586log_actor_likes0.151350.017268.766log_director_likes0.072330.011086.526Intercepts:ValueStd. Error t value2|3-4.54750.4537-10.02333|4-3.07150.3938-7.79914|5-1.71190.3807-4.49685|6-0.40980.3788-1.08186|70.71460.37861.88727|81.89320.37914.99328|95.05010.419712.0321Residual Deviance: 7705.727AIC: 7725.727(223 observations deleted due to missingness)> exp(coef(fit_ord))log_budgetlog_actor_likes log_director_likes0.93859961.16340151.0750061The interpretation of ordinal regression is a bit strange. The basic underlying assumption isthat the following quantities:The change in odds-ratio of score{10}over{9,8,7, . . . ,1},whenX←X+ 1,all else fixedThe change in odds-ratio of score{10,9}over{8,7, . . . ,1},whenX←X+ 1,all else fixedThe change in odds-ratio of score{10,9,8}over{7, . . . ,1},whenX←X+ 1,all else fixed. . .The change in odds-ratio of score{10,9,8, . . . ,2}over{1},whenX←X+ 1,all else fixedareall equal. That is, ordinal regression compares the odds-ratios (OR) between “adjacent”(HIGH, LOW) set-pairs of scores. One comparison is HIGH={10}and LOW={9, 8,...., 1};another is HIGH={10, 9}and LOW={8, 7, ..., 1}, and so on. The assumption is that the
change in odds-ratio between any HIGH- LOW pair-sets is constant for one-unit change inany given explanatory variableX.In our application, this means thatThe odds-ratio of a HIGH score set over the corresponding LOW score set (asdefined above)decreasesby 100·(1−0.938) = 6.2% for one-unit increase in‘‘logbudget", all else fixed.Because we log’ed the variables using the base of 2 (see pre-processing above), the one-unitincrease in the log-space means doubling in the natural scale. This may be more intuitive.So, an equivalent interpretation is:The odds-ratio of a HIGH score set over the LOW score set (as defined above)decreasesby 6.2% as thebudgetdoubles, all else fixed.Similarly,The odds-ratio of a HIGH score set over the LOW score set (as defined above)increasesby 100·(1.147−1) = 14.7% as the number of‘‘actor FB likes"doubles, all else fixed.AndThe odds-ratio of a HIGH score set over the LOW score set (as defined above)increasesby 100·(1.07−1) = 7% as the number of‘‘director FB likes"doubles, all else fixed.How different is this from multinomial regression? Not so much. Here is an analysis of thesame data usingmultinomHere are some statistics on how the model fits in-sample.> library(nnet)> options(digits=3)> fit_mul = multinom(imdb_score ~., data=imdb)...> summary(fit_mul)Call:multinom(formula = imdb_score ~ ., data = imdb)Coefficients:(Intercept) log_budget log_actor_likes log_director_likes37.119-0.3890.322-0.0286845.128-0.1900.212-0.0671353.990-0.1420.302-0.0612863.126-0.1280.364-0.0195074.703-0.2220.4090.0055484.631-0.2680.4760.0435590.883-0.3240.6490.07411
Std. Errors:...> exp(coef(fit_mul))(Intercept) log_budget log_actor_likes log_director_likes31235.820.6781.380.9724168.750.8271.240.935554.050.8681.350.941622.780.8801.440.9817110.230.8011.501.0068102.630.7651.611.04592.420.7231.911.077We see that we get quantitatively similar results: when budget increases the odds-ratio ofsome higher score over the baseline (score=2) decreases; for more popular actors the odds-ratio increase. These relations are uniform across scores. One difference of the multinomialmodel to the ordinal regression model is about the results on director popularity.Themultinomial model tells us that, as the director popularity increases, the change in odds-ratio of higher-score over the baseline actually decreases for scores<7; and it increases onlyfor scores≥7, all else fixed. We didn’t have such differentiation in the ordinal regressionmodel because ordinal regression, by definition, compares sets of lower to higher scores, anddoes not compare individual scores.In practice, ordinal regression gives very similar results to multinomial regression. So, it isnot uncommon to fit a multinomial regression model even whenYis ordered to avoid theinterpretation complexities of ordinal regression.
AppendixANormal distributionTo sample fromN(µ, σ2) usernormcommand (you will see slightly different numbers):> x = rnorm(1000, mean=2, sd=1.5)> mean(x)[1] 2.015342> var(x)[1] 2.112895To sample from multivariate normal you first need the packagemvtnorm. To install it run> install.packages("mvtnorm")For the multivariate normal we need to define the mean vector and then the covariancematrix.Let’s sample from a bivariate normal with individual means (1,10), individualvariances 1, and correlationρ= 0.9:> library(mvtnorm)> mu = c(1, 10)> S = matrix(c(1, 0.9, 0.9, 1), nrow=2)> S[,1] [,2][1,]1.00.9[2,]0.91.0> w = rmvnorm(1000, mean=mu, sigma=S)> head(w)[,1][,2][1,]2.0834550 10.642602[2,]0.69168169.335202[3,]1.3614966 10.172921[4,] -0.12163689.498492[5,]1.9057834 10.264121[6,]0.9062696 10.022309We see thatwis a matrix with 1000 rows and each is row a two-element vector. One rowcorresponds to one sample random vector (with two elements) from the bivariate normal.One column corresponds to all samples of the corresponding element, i.e., it is the marginaldistribution of that element. Thus, to get a summary of the marginal for the first element:> w1 = w[, 1]> summary(w1)Min. 1st Qu.MedianMean 3rd Qu.Max.-2.07100.37631.03001.01601.64403.9860
We can estimate conditional quantities usingR’s indexing capabilities.For example, toestimateE(w1|w2>12), wherew2is the second element:> w1 = w[, 1]> w2 = w[, 2]> mean(w1[w2 > 12])[1] 3.053291We see that whenw2is large wrt to its mean (12 vs 10) then the expected value forw1isalso large wrt to its mean (3.05 vs 1), sincew1andw2are correlated.BWhy do the regression lines go through(¯X,¯Y)?In class we saw this picture for the prediction intervals:It is strange that both the blue line and the red line seem to go through the point (¯X,¯Y).Here, we will show with simple arguments that the least-squares line (blue line) goesexactlythrough the point (¯X,¯Y), whereas the true line (red line) passes very near it (and increasinglyso as the data size increases).Think about the LS line first. We can write down the line equation for all data points:Y1=b0+b1X1+e1Y2=b0+b1X2+e2. . .Yn=b0+b1Xn+en.
If we sum up all these equations we get:XiYi=nb0+b1XiXi+Xiei.All the summations are fromi= 1 toi=n. Now divide both sides withn. We get:¯Y=b0+b1¯X+ ¯e.But we know from Week 1 that the average of LS residuals is zero, and so ¯e= 0. Thus,¯Y=b0+b1¯X.This means that the point (¯X,¯Y) is exactly on the LS line.Now, we follow a similarargument for the true regression line (red line). This will lead us to the following equation:¯Y=β0+β1¯X+ ¯ε.What changed is that we moved from the estimatesb0, b1to the true parametersβ0, β1; also,we now have the average of the errors ¯εinstead of the average of the residuals. In contrastto LS the average of errors is not exactly zero. However, due to the Law of Large Numbersthis average gets closer and closer to zero asnincreases (you can try that withR). So wecan write ¯ε≈0 and¯Y≈β0+β1¯X.Thus, the red line is also very near the point (¯X,¯Y), and gets closer and closer to it as thenumber of data points increases.CDerivation for standard error of LS slopeHere, we derive the standard error of the slope as described in class. For simplicity, considerthe model:yi=β1xi+εi,where there is no intercept and ¯x= 0. For a dataset withndatapoints, the least squaresloss is:L(b) = (1/2)nXi=1(yi−xib)2.We can differentiate this and set this to zero with respect tob.Doing so will yield thefollowing expression for the LS slope:b1= arg minbL(b) =∑iyixi∑ix2i.
From the definition of SLR above:b1=∑ixi(β1xi+εi)∑ix2i=β1+∑ixiεi(n−1)s2x,where we set∑ix2i= (n−1)s2x. Becauseεiis i.i.d∼N(0, σ2) it follows that∑ixiεiis alsonormal. The mean of that is zero since allεihave mean zero. The variance is equal to:var(Xixiεi) =Xix2ivar(εi) =Xix2iσ2= (n−1)s2xσ2.The first derivation above follows from i.i.d-ness ofεiand the second follows from the constantvariance assumption. It follows that the entire random variable∑ixiεi(n−1)s2xhas mean 0 and varianceσ2/(n−1)s2x.When we combine all those results to the solution forb1, we get:b1=β1+∑ixiεi(n−1)s2x∼N(β1,σ2(n−1)s2x).DHypothesis testing and degrees of freedomIn hypothesis testing, there are some tricks. The following patterns usually repeat..StatisticDistributionsample meanNormalsample variance withkdf.χ2ksample mean/psample variance/ktk(sample variance/k1) / (sample variance/k2)Fk1,k2Here:∗χ2k= “chi-squared onkdegrees of freedom”.∗tk= “t-distribution onkdegrees of freedom”.∗Fk1,k2= “F-distribution on (k1, k2) degrees of freedom”.But, seriously, what isd.f.?When you hear the term “degrees of freedom” in statistics it means that somewhere we aretrying to estimate a variance; e.g.,σ2in OLS. This estimation is usually done through a sumof squared terms. However, these terms may be “linearly constrained”. Then,
d.f. = #terms - #linearconstraints= #terms - #parameters(in MLR)The #linear constrains determine how many add’l parameters we have to estimatebeforewecan estimate the variance.Here as some examples:•Example 1:SupposeZi∼N(0,σ2). Then, we can estimateσ2simply byˆσ2=1n(Z21+Z22+. . .+Z2n)|{z}σ2·χ2n.This expression involvesnsquared terms that are completely unconstrained. Hence,there are “ndegrees of freedom”. BecauseZis are normal then the entire sum has a“chi-squared distribution” withndegrees of freedom.•Example 2:What ifZi∼N(µ,σ2), but we don’t knowµ? Then, we need to estimateµ(using¯Z), and thus “lose 1 degree of freedom”:ˆσ2=1n−1[(Z1−Z)2+ (Z2−Z)2+. . .(Zn−Z)2]|{z}σ2·χ2n−1This is, in fact, the familiar sample variance formula. It involvesnsquared terms butthere is one linear constraint for them. Specifically, they have to add up to 0, since:Xi(Zi−¯Z) =XiZi−n¯Z= 0.Hence, there are “n−1 degrees of freedom”. BecauseZis are normal then the entiresum has a “chi-squared distribution” withn−1 degrees of freedom.Now you know why in the same variance we divide byn-1!•Example 3:In OLS we need to estimateβ0, β1, . . . , βd−1before we can estimate thevariance. This is done by solvingdlinear equations in the LS minimization problem.Sinceei=Yi−b0−b1X1i−. . .−bdXdithen,
e21+e22+. . .+e2n∼σ2·χ2n−d.This explains why in OLS we estimateσ2throughˆσ2=1n−d(e21+. . .+e2n)