DDA3020 final

.pdf

School

The Chinese University of Hong Kong**We aren't endorsed by this school

Course

DDA 3020

Subject

Computer Science

Date

Jan 7, 2025

Pages

Uploaded by BaronResolve13642

DDA2020 Machine LearningFinal Exam1 Multiple-choice questions(2 points per ques-tion) (30 points)Note that there might be one or more correct op-tion(s). If there are more than one correct answer, youshould select all the correct options in order to get fullmarks. If your answer is correct but incomplete, youwill only get partial marks. If any incorrect option ischosen, you will get zero mark.1.1Which one(s) are the sub-area of artificial intelli-gence?A. Computer VisionB. RoboticsC. Natural Language ProcessingD. Machine LearningD. Optimization1.2Suppose in a 2-class classification problem, ourstandard logistic regression model achieves 99% accu-racy on the training set, but only 51% accuracy on thetest set.Which of the following modifications mightpotentially improve our algorithm’s test accuracy?A. Use regularized logistic regression.B. Use polynomial hypothesis function to replace theoriginal hypothesis functionw>x+b.C. Add more training data.D. Add more testing data.1.3Which of the following statement(s) is/are correct?A. Backpropagation consists of a forward pass thatcomputes the error, and a backward pass thatadjusts the weights.B. In a feedforward neural network, the informationalways moves in one direction.C. A convolutional neural network is a fully con-nected network.D. When training a multi-layer perception, we com-pute the error using a loss function at the outputlayer and pass the gradients from the outputlayer backwards to the input layer to update theweights.1.4Which of the following statement(s) is/are correct?A. Toss a coin 4 times, then there will be 4 possibleevents.B. If the probability of Event A is 0.1 and theprobability of Event B is 0.01, then informationof A is smaller then B.C. KL divergence is positive and non-symmetric.D. Both Binomial and Gaussian distributions arecontinuous distributions.1.5Which of the following statement(s) is/are correct?A. When the number of training set n is very largeand the feature dimension d is very small, weprefer to use the closed form solution rather thangradient descent method.B. Linear regression can only be used for regressiontasks.C. ThedecisionboundaryofRidgeregressionisnon-linear in the original feature space.D. Thedecisionboundaryofpolynomiallinearregression is non-linear in the original featurespace.E. If standard linear regression is overfitting, we cantry Ridge regression.F. If standard linear regression is overfitting, we cantry polynomial linear regression.1.6Which of the following statement(s) is/are correct?A. StandardSVMcannothandleanon-linearlyseparated data.B. SVMwithslackvariablescanhandleanon-linearlyseparateddataanditstrainingerrorcould be 0.C. Kernel SVM can perfectly fit non-linear separateddata.D. Margin is the distance from the closest pointof positive and negative classes to the decisionboundary1.7Which of the following statement(s) is/are correct?A. PCA can give non-linear projection of data.B. Dimensionality reduction tasks can be solved bysupervised or unsupervised learning methods.C. We project N dimensional data points to a k di-mensional space, the dimension of reconstructionpoints is lower than that of the original data.D. In PCA, we choose the k eigenvectors with thetop k largest eigenvalues to form a new matrix.1.8Which of the following statement(s) is/are correct?A. Decision tree can only handle training data withnumerical attributesB. When we set the minimal size of leaf node to be alarge value, then we will grow a deep decision tree.1X-*midmtXalsoforclassification.OOXOOjXtooloverfit.O--XC-OXX*XOXafter-OI!-o3?OXXcLastlayer.OXtINE*-le+Proj(xe)-partialrighto-X+UzcodeoXE=p?(xj+)XOXRegressionTree·OXShallow

C. When we set the maximal depth to be a smallvalue, then we will grow a shallow decision tree.D. When we increase the number of decision trees,the training performance of Bagging will often bebetter.1.9Which of the following would you apply unsuper-vised learning to?A. Given the data of body dimensions collected from1,000 consumers, determine the size of the clothesto be produced.B. Develop a model to predict the stock market.C. Given a large dataset of medical records of patientincluding the disease name, predict the disease ofthe new patient by the symptom.D. Compressahigh-resolutionimagetoalow-resolution image.1.10A class has 10 students. They received marks fortheir mid-term quiz as follows.Student ID0102030405Marks9050667182Student ID0607080910Marks7275689960To group the students into two tutorial groups accord-ing to their marks, we use k-means. We pick student03 as the initial centroid for Group A, and 07 forGroup B, and assign the students to the two groupsusing Euclidean distance.A. We will have 4 students in Group A.B. We will have 5 students in Group B.C. If we change the initial centroid, the clusteringresult by kmeans will not be changed.D. With the above group assignments, we re-estimatethe new centroids for the two groups.The newcentroid of Group A is 60.1.11Which of the following statement(s) is/are cor-rect?A. When the TPR-curve is plotted as y-axis and theFPR-curve is plotted as x-axis, the plot is calledthe ROC curve.B.TPR+FPR= 1.C. Accuracy =T P R+T NR2.D. As the classification threshold increases, the FNRincreases.1.12Which of the following statement(s) is/are cor-rect?A. When we increase the model complexity,biasdrops and variance grows in the training set.B. When we increase the model complexity,biasdrops and variance grows in the testing set.C. When we increase the model complexity, bothbias and variance drops in the training set.D. When we increase the model complexity,biasgrows and variance drops in the testing set.1.13Which of the following statement(s) is/are cor-rect?A. If we fix the covariance matrix⌃as the identitymatrixIwhen fitting GMM using EM, then it isequivalent to the standard K-means algorithm.B. It is guaranteed that EM algorithm can improvethe log likelihood function.C. The latent variable in latent variable models mustbe discrete.D. Suppose thatfis a concave function andXis arandom variable, thenf(E(X))≥E(f(X)).1.14Given a linear systemXw=y, wherew2Rdis the variable we want to solve, andX2Rm⇥dandy2Rmare the data provided, which of the followingstatement(s) is/are correct?A. Whenm=d, it is guaranteed to obtain a uniquesolution.B. Whenm > d, this linear system is called under-determined system, and there is no solution.C. Whenm > d, this linear system is called over-determined system, and there is no solution.D. Whenm<d,thislinearsystemiscalledunder-determined system, and there are infinitesolutions.E. Whenm<d,thislinearsystemiscalledover-determined system,and there are infinitesolutions.1.15Suppose we want to train a classifier in a super-vised learning manner, in order to automatically eval-uate student assignments. In this setting, which of thefollowing statement(s) is/are correct?A. The collected students assignments which havenot been graded by teachers can be used as theexperience.B. The task is to predict the grades of the studentassignments.C. The accuracy of the predicted grades can be usedas the performance measure.D. The accuracy of the students’ answers to theassignmentscanbeusedastheperformancemeasure.2 Calculations and Derivations(70 points)2.1 (5 points)Consider two discrete random variablesXandY,X2{0,1,2},Y2{2,3}.P(Y= 2) =0.4,P(Y= 3) = 0.6.GivenY,Xfollows binomialdistribution,i.e.,P(X=k|Y=y) =n!k!(n-k)!(1y)k(1-1y)n-k, wheren= 2.1. Calculate the probability distribution ofX,i.e.,P(X). (1 point)2. Calculate the conditional distribution ofP(Y|X=1). (Hint: using Bayes rule) (2 points)2OO-gOOO-O/OK-means.-OopeOXXXconditionBalance!G*O-

3. CalculateE(X), andE(Y|X= 1). (2 points)2.2 (15 points)We define a CNN model asfCNN(X) =(1)Softmax(FC1(Conv2(MP1(Relu1(Conv1(X)))))).The size of the input dataXis 36⇥36⇥3; the firstconvolutional layerConv1includes 10 8⇥8⇥3 filters,stride=2, padding=1;Relu1indicates the first Relulayer;MP1is a 2⇥2 max pooling layer, stride=2;the second convolutional layerConv2includes 1005⇥5⇥10 filters, stride=1, padding=0;FC1indi-cates the fully connected layer, where there are 10 out-put neurons;Softmaxdenotes the Softmax activationfunction. The ground-truth label ofXis denoted ast,and the loss function used for training this CNN modelis denoted asL(y, t).1. Compute the feature map sizes afterRelu1andConv2(2 points)2. Calculate the number of parameters of this CNNmodel (hint: don’t forget the bias parameter of inconvolution and fully connection) (4 points)3. Plot the computational graph (CG) of the for-ward pass of this CNN model (3 points) (hint: usez1, z2, z3, z4, z5, z6denote the activated value afterConv1,Relu1,MP1,Conv2,FC1,Softmax)4. Based on the plotted CG, write down the formu-lations of back-propagation algorithm, includingthe forward and backward pass (6 points) (hint:for the forward pass, write down the process ofhow to get the value of loss functionL(y, t); for thebackward pass, write down the process of comput-ing the partial derivative of each parameter, like@L@w1,@L@b1)1. AfterConv1, the output size of width and heightareW1=H1=36+1⇥2-82+ 1 = 16. The output size ofdepth isD1= 10. Thus, the output size is 16⇥16⇥10.AfterRelu1, the output size does not change.AfterMP1, the output size of width and height areW2=H2=16-22+ 1 = 8. The output size of depth isD2=D1= 10.AfterConv2, the output size of width and height areW3=H3=8-51+ 1 = 4. The output size of depth isD3= 100. Thus, the output size is 4⇥4⇥100.•Compute the feature map size afterConv1(0.5point)•Compute the feature map size afterRelu1(0.5point)•Compute the feature map size afterMP1(0.5point)•Compute the feature map size afterConv2(0.5point)2. The number of parameters are (8⇥8⇥3+1)⇥10+(5⇥5⇥10+1)⇥100+(1600+1)⇥10 = 43040 (4 points)3.•z1, z2, z3, z4, z5, z6denote the activated value afterConv1,Relu1,MP1,Conv2,FC1,Softmax(1.5points)•b1, w1, w3, b3, w3, b3denotetheparametersinConv1,Conv2,FC1, respectively (1.5 points)4.•The process of forward pass:z1=w1X+b1, z2=max(0, z1), z4=w3z3+b3, z5=w4z4+b4, z6=Softmax(z5),L(y, z6) (3 points)•The process of backward pass:@L@z6,@z6@z5,@L@z5=@L@z6@z6@z5,@z5@w4,@z5@b4,@L@w4=@L@z6@z6@z5@z5@w4,@L@b4=@L@z6@z6@z5@z5@b4,@z5@z4,@z4@w3,@L@w3=@L@z5@z5@z4@z4@w3,@L@b3=@L@z5@z5@z4@z4@b3,@z4@z1,@z1@w1,@z1@b1,@L@w1,@L@b1(3 points)3

2.3 (17 points)Derivations of SVM’s objective func-tion and solution according to dual problem.Giventhe training data{(xi, yi)}ni, denoting the parametersasw, b, the objective function of SVM is formulated asfollows:minw,b12kwk2(2)s.t.yi(w>xi+b)≥1,8i1. Derive the above objective from the perspectiveof large margin. (Hint: The margin is defined asthe closest distance from the data point to thehyperplane.We wish to find a hyperplane thatseparate the data, meanwhile having the largestmargin among all the hyperplanes. The distanceof a pointyto a hyperplane given byfw,b(x) :=wTx+b= 0 is given by|fw,b(x)|kwk) (6 points)•Write the problem formulation of max mar-gin (2 points)•Argue that we can use scaling to fix the min-imum absolute value ofyi(wTx+b) = 1 (2points)•Derive the formulation of SVM (2 points)2. Derive the above objective from the perspective ofhinge loss. (Hint: The objective function of SVMusing hinge loss is given byCmXihingeloss(x,y;w, b) +12||w||2You can firstly write down the explicit expressionof hinge loss. )•Write down the hinge loss (3 points)•Derive the problem from hinge loss (2 points)(5 points)3. Derive the solution ofw, baccording to the La-grangian function and KKT condition, and tellwhen a training data is called support vector.(Hint: firstly write down the Lagrangian function,then KKT conditions, then use the KKT condi-tions to derive the solution.) (6 points)•Write down Lagrangian (1 point)•Write down the KKT condition (1 point)•Write down the variablewusing the KKTcondition (3 points)•Point out the condition for being support vec-tor (1 point)2.4 (10 points)Consider the figure below:The red circles represent the data points in positiveclass and the blue squares represents negative. We con-struct a model to do the prediction and get a decisionboundary as the black curve.The points on the topright of the decision boundary are predicted to be pos-itive and other points at the bottom left of the bound-ary are predicted to be negative.4

1. Write down the confusion matrix of this classifica-tion. (2 points)2. Calculate the precision, recall and accuracy of thisclassification method. (3 points)3. Calculate the FNR and FPR. (2 points)4. Suppose that the posterior probabilities of the 6positive datap(yi= +|xi) are 0.2, 0.5, 0.7, 0.7, 0.8and 0.9, respectively. The posterior probabilitiesof the 4 negative datap(yi= +|xi) are 0.1, 0.3,0.5 and 0.6, respectively.Calculate the AUC ofthis model. (3 points)2.5 (15 points)Suppose that a random variablexfollows the Gaussian mixture distribution withp(x) =PKk=1⇡kN(x|μk,⌃k).Where⇥={⇡,μ,⌃},⇡={⇡1, . . . ,⇡K},μ={μ1, . . . ,μK},⌃={⌃1, . . . ,⌃K}and latent variablez⇠Categorical (⇡),where⇡≥0,Pk⇡k= 1.1.Likelihood Decomposition. Show thatlnp(D;⇥)≥L(q;⇥),8q,⇥,whereL(q;⇥) =XNn=1Eqn(z(n))ln✓p(x(n), z(n); )qn(z(n))◆,andq(z) =QNn=1qn(z(n)). And, write down thegap between lnp(D;⇥) andL(q;⇥). (6 points)2.E-Step Derivation.With given⇥={⇡,μ,⌃},updateq(z). (3 points)3.M-Step DerivationGivenq(z), update⇥={⇡,μ,⌃}. (6 points)1. 2 methods:(a) By Jensen’s inequality and concavity of theln function:Eq(z)ln✓p(x,z;✓)q(z)◆lnEq(z)✓p(x, z;✓)q(z)◆= lnKXkq(z=k)·p(x, z=k;✓)q(z=k)= lnp(x;✓)Thus we can easily show that:lnp(D;✓) =NXn=1lnp(xn;✓)≥NXn=1Eqn(z(n))"lnp(x(n), z(n);✓)qn(z(n))!#=L(q;✓)(b) By non-negativity of KL divergence:lnp(D;✓) =NXn=1Eqn(z(n))"lnp(x(n), z(n);✓)qn(z(n))!#+NXn=1Eqn(z(n))"lnqn(z(n))p(z(n)|x(n);✓)!#=L(q;✓) + KL(q(z)kp(z| D;✓))Since KL(q(z)kp(z| D;✓))≥0 we finish theproof.2.γk=⇡kN(x|μk,⌃k)PKj=1⇡jN(x|μj,⌃j)(If the result is correct getfull grades. If result is incorrect but shows BayesRule derivation correctly, get 2 points )3.μk=1NkNXn=1γ(n)kx(n)⌃k=1NkNXn=1γ(n)k⇣x(n)-μk⌘ ⇣x(n)-μk⌘>⇡k=NkN,withNk=NXn=1γ(n)kIf the result is correct get 2 points for each pa-rameter. If result is incorrect but shows the max-imization derivation of the likely hood correctly,get 1 to 1.5 points for each parameter(depends onwhether you show the derivations in details)2.6 (8 points)Consider a binary classification withthe following training data. Each data is described by3 attributes: Color, Size, Shape.1. Use the entropy method to find the best attributefor the root node.(Hint: Calculate the entropyand information gain) (4 points)2. Build a complete tree according to the entropymethod.The tree should include the class labeland training data in each node. (2 points)3. Compute the classification error of each node ofthe tree. (2 points)5

ExampleColorShapeSizeClass1RedSquareBig+2BlueSquareBig+3RedSquareBig+4RedCircleBig+5GreenCircleBig-6BlueCircleSmall-7BlueCircleSmall-8GreenSquareBig-6⊥logt+Eluga2"