Finalexam2022FallMATH4432

.pdf

School

The Hong Kong University of Science and Technology**We aren't endorsed by this school

Course

MATH 547

Subject

Mathematics

Date

Dec 15, 2024

Pages

Uploaded by BaronBear4821

MATH4432013Statistical Machine Learning414INSTRUCTIONS:1. Answer ALLof the following questions.2. The full mark for this examination is 100.3. Answers without sufficient explanations/steps receive no or partial marks.4. A calculator is allowed during the exam.Internet access (e.g., Google search) is NOTallowed.5. Open book and open notes, but each student should work on it independently.1.(5 marks)Suppose that we have collected dataD={xi}i=1,...,n, wherenis the sample size. Considerthe following optimization problem:maxμ,σ2nXi=1wilogN(xi|μ, σ2),wherewiis the known nonnegative weight for thei-th dataxi,N(xi|μ, σ2) denotes thedensity function of univariate Gaussian distribution with meanμand varianceσ2. Find theoptimal solution of the above problem.2.(10 marks)Consider aK-class classification problem. Each sample (xi, yi) is generated independentlyfrom the following probabilistic model:πk=p(y=k) andp(X|y=k) =fk(X), whereX= [X1, . . . , Xd] denotes thed-dimentional features, andk∈ {1,2, . . . , K}is the index ofa class label. The class priorπkis known but the class density functionfk(·) is unknown.Suppose that we have collectednsamples as the training datasetD={xi, yi}i=1,...,n, wherexi= [xi1, xi2, . . . , xid]Tandyi∈ {1,2, . . . , K}. Based on the training datasetD, we havecorrectly and satisfactorily fitted a model to estimate the conditional probabilitiesp(y=k|X=x) as ˆpk(x) for allk.Now we are asked to deploy the fitted model in a testingpopulation, where the class priorp(y=k) is known to be changed fromπkto ˜πkbutfk(·)remains unchanged.In this setting, a student proposes to simply adjust the estimatedconditional probabilities ˆpk(x) as follows: (1) Define ˆgk(x) = log(ˆpk(x)) + log(πk)-log(˜πk).(2) Then obtain ˜pk=exp ˆgk(x)∑K‘=1exp ˆg‘(x)as the conditional proabilities for the testing population.Do you agree with this student? If yes, please provide your reason, otherwise propose yourown solution with justification.The Hong Kong University of Science and TechnologyPage:ofFall EXAMINATION, 2022-2023Course Code:Section No.:Time Allowed:Hour(s)Course Title:Total Number of Pages:

MATH4432013Statistical Machine Learning4243.(5 marks)Suppose we aim to find the optimal solution minxexp(x) +x2+ 2x. Derive an algorithmbased on the newton method (Hint: you are allowed to use calculators here).4.(10 marks)(a) Find the optimal solution minw0.5(y-cw)2+w, subject tow≥0, whereyandcare knownscalars.(5 marks)(b) Suppose we have collected a data set{xi, yi}i=1,...,n, wherexi= [xi1, xi2, . . . , xip]T∈Rpandyi∈R,nis the sample size andpis the number of variables. Consider the followingoptimization problem,min{w1,...,wp}12nnXi=1yi-pXj=1xijwjβj!2+λpXj=1wj,subject towj≥0,whereβj, j= 1, . . . , pare known constants andλis a given positive scalar.Derive thecoordinate descent algorithm to optimize{w1, . . . , wp}.(5 marks)5.(15 marks)Suppose that we have a dataset{y,X}with seven samples, wherey= (y1, . . . , y7)T∈R7×1is the known true label vector,yi∈ {1,0}and 1 means positive and 0 means negative,X= (x1, . . . ,x7)Tis the feature matrix, andxirepresents the features of thei-th sample.We also have a fitted classifier that could output the probability ˆp(yi= 1|xi) for eachsample. Assume the following table is the case.Sample ID1234567True label1000110ˆp(y= 1|x)0.800.200.450.300.700.400.70(a) Given the classification thresholdτ= 0.5, the case with ˆp(y= 1|x)≥τwill be classifiedas 1, and 0 otherwise. Please calculate the type I and II error rates.(5 marks)(b) Plot the receiver operating characteristic (ROC) curve and calculate the area under theROC curve (AUC).(10 marks)6.(15 marks)Consider ridge regression for a data setD={xi, yi}i=1,...,nwithxTi= [xi1, . . . , xip]minβ1,...,βp1nnXi=1(yi-pXj=1xijβj)2+λpXj=1β2j,(1)whereλis given.The Hong Kong University of Science and TechnologyPage:ofFall EXAMINATION, 2022-2023Course Code:Section No.:Time Allowed:Hour(s)Course Title:Total Number of Pages:

MATH4432013Statistical Machine Learning434(a) Derive the closed form solution to the above problem.(5 marks)(b) Describe a fast algorithm such that it can efficiently compute the entire solution pathˆβ(λ) = [β1(λ), β2(λ), . . . , βp(λ)]Tfor a given sequence ofλ.(5 marks)(c) What are the degrees of freedom for the above ridge regression? Justify your answer.(5 marks)7.(20 marks)Consider a data setD={xi, yi}, i= 1, . . . , nfor supervised learning, wherexi= [xi1, . . . , xip]∈Rp,yiis a non-negative integer, andnis the number of samples.Suppose we choose tominimize the exponential loss functionL(y, F) =-[yF(x)-exp(F(x))], whereF(x) =f0+∑Mm=1fm(x) is an additive model,f0is the intercept term andfm(x) is to be fitted byaregression tree.(a) Estimatef0and justify your result.(5 marks)(b) Suppose we have fittedF(x) asˆFm(x) at them-th step. Using the gradient boostingapproach, we would like to findfm+1(x) by minimizing1n∑ni=1(-gm,i-fm+1(xi))2,wheregm,iis the functional gradient evaluated at the current step. Derive the closedform of-gm,i.(5 marks)(c) Suppose we have fitted a tree withJ-terminal nodes by solving the optimization prob-lem in (b). Letˆfm(x) =∑Jj=1ˆcjI(x∈Sj) be the fittedregression treein (b), whereSjis thej-th partition region andI(·) is the indicator function. To improve the mini-mization of the loss function, please re-adjust the constant ˆcjby solving the followingoptimization problem, given the fittedˆFm(x) and the partition{Sj}j=1,...,J:mincj1nnXi=1"Lyi,ˆFm(xi) +JXj=1cjI(xi∈Sj)!#.(5 marks)(d) Suppose that we have fitted a regression tree in (c) with adjusted constant ˆcj,adj, denotedasˆfadj(x) =∑Jj=1ˆcj,adjI(x∈Sj). Then we are going to update the current model asˆFm+1(x) =ˆFm(x) +λ·ˆfadj(x),why isλoften set to be a small positive value (e.g.,λ= 0.01) rather thanλ= 1?(5 marks)8.(10 marks)(a) What is the major difference between Bagging trees (Bootstrap aggregating of many trees)and Random Forest? Why does Random Forest often outperform Bagging trees?(5 marks)(b) Why is Bootstrap aggregating often applied to large trees rather than small trees?(5 marks)(Hint: these are conceptional questions. You don’t need to prove anything.)The Hong Kong University of Science and TechnologyPage:ofFall EXAMINATION, 2022-2023Course Code:Section No.:Time Allowed:Hour(s)Course Title:Total Number of Pages:

MATH4432013Statistical Machine Learning4449.(10 marks)Suppose you want to peform forward stepwise linear regression based on a training datasetDtrain={xi, yi}i=1,...,nand you plan to apply the fitted model to make prediction for atesting data setDtest={xi0}i0=1,...,ntest, wherexi∈Rpandxi0∈Rp,nandntestare thesample sizes of training dataset and testing dataset, respectively. Please describe how youare going to apply cross-validation to pick the model which is used for prediction of a testingsamplexi0.— END —The Hong Kong University of Science and TechnologyPage:ofFall EXAMINATION, 2022-2023Course Code:Section No.:Time Allowed:Hour(s)Course Title:Total Number of Pages: