Understanding Maximum Likelihood Estimation: A Comprehensive

School

University of Waterloo**We aren't endorsed by this school

Course

SYDE 552

Subject

Statistics

Date

Dec 12, 2024

Pages

Uploaded by UltraDanger15861

PartI:backgroundSuppose we have a probability distribution or densityp(x;θ), wherexmay be discreteor continuous depending on the problem we are interested in.θspecifies the parametersof this distribution such as the mean and the variance of a one dimensional Gaussian.Different settings of the parameters imply different distributions overx.The availabledata, when interpreted as samplesx1, . . . , xnfrom one such distribution, should favor onesetting of the parameters over another. We need a formal criterion for gauging how well anypotential distributionp(·|θ) “explains” or “fits” the data. Sincep(x|θ) is the probability ofreproducing any observationx, it seems natural to try to maximize this probability. Thisgives rise to the Maximum Likelihood estimation criterion for the parametersθ:ˆθML= arg maxθL(x1, . . . , xn;θ) = arg maxθni=1p(xi|θ)(1)where we have assumed that each data pointxiis drawn independently from the same dis-tribution so that the likelihood of the data isL(x1, . . . , xn;θ) =ni=1p(xi;θ). Likelihood isviewed primarily as a function of the parameters, a function that depends on the data. Theabove expression can be quite complicated (depending on the family of distributions weare considering), and make maximization technically challenging. However, any monoton-ically increasing function of the likelihood will have the same maxima. One such functionis log-likelihood logL(x1, . . . , xn;θ); taking the log turns the product into a sum, makingderivatives significantly simpler. We will maximize the log-likelihood instead of likelihood.1Tutorial 1: Oct. 4, 2024

Problem 1: Maximum Likelihood EstimationConsider a sample ofnreal numbersx1, x2, . . . , xndrawn independently from the samedistribution that needs to be estimated. Assuming that the underlying distribution belongsto one of the following parametrized families, the goal is to estimate its parameters (eachfamily should be treated separately):Uniform :p(x;a) =1aforx∈[0, a],0 otherwise(2)Exponential :p(x;η) =1ηexp(-x/η), η >0(3)Gaussian :p(x;μ, σ) =1√2πσ2exp-(x-μ)22σ2(4)1. (10 points) Derive the maximum likelihood estimators ˆaML, ˆηML, ˆμML, ˆσ2ML.Theestimators should be obtained by maximizing the log-likelihood of the dataset undereach of the families, and should be a function ofx1, x2, . . . , xnonly.To assess how well an estimatorˆθrecovers the underlying value of the parameterθ, westudy itsbiasandvariance. The bias is defined by the expectation of the deviation fromthe true value under the true distribution of the sample (X1, X2, . . . , Xn):bias(ˆθ) = EXi∼P(X|θ)ˆθ(X1, X2, . . . , Xn)-θ(5)Biased (i.e. with a non-zero bias) estimators systematically under-estimate or over-estimatethe parameter.The variance of the estimatorvar(ˆθ) = EXi∼P(X|θ)ˆθ(X1, X2, . . . , Xn)-Eˆθ(X1, X2, . . . , Xn)2(6)measures the anticipated uncertainty in the estimated value due to the particular selection(x1, x2, . . . , xn) of the sample. Note that the concepts of bias and variance of estimatorsare similar to the concepts of structural and approximation errors, respectively.Estimators that minimize both bias and variance are preferred, but typically there is atrade-off between bias and variance.2. (10 points) Show that ˆaMLis biased (no need to compute the actual value of the bias),ˆηMLand ˆμMLare unbiased.3. (optional) Show that ˆσ2ML, equal to the sample variance1n∑ni=1(xi-¯x)2, is biased.Show that the ML estimator of the variance becomes unbiased after multiplicationwithn/(n-1). Let ˆσ2n-1be this new estimator.2

4. (optional) A standard way to balance the tradeoff between bias and variance is tochoose estimators of lowermean squared error:MSE(ˆθ) = EXi∼P(X|θ)(ˆθ-θ)2.Show that MSE(ˆθ) = bias(ˆθ)2+var(ˆθ) and that MSE(ˆσ2ML)<MSE(ˆσ2n-1) even thoughˆσ2MLis biased.Problem 2: Maximum A-Posteriori EstimationWe want to determine the bias of an unfair coin for “heads” or “tails” from observing theoutcome of a series of tosses. We model the coin by a single parameterθthat representsthe probability of tossing heads.Givennindependent observed tossesD={x1, . . . , xn}out of whichnHare “heads”, thelikelihood function is:p(D|θ) =θnH(1-θ)n-nH(7)1. (5 points) Show thatˆθML=nH/n. Thus if we toss the coin only once and we see“tails” (n= 1 andnH= 0), according to maximum likelihood flipping the coin shouldalways result in “tails”.While the maximum likelihood estimator is accurate on large training samples, if data isvery scarce the estimated value is not that meaningful (on small samples the variance ofthe estimator is very high and it overfits easily). In contrast, inMaximum A-Posteriori(MAP) estimationwe compensate for the lack of information due to limited observationswith ana prioripreference on the parameters based on prior knowledge we might have.In the case of the coin toss for instance, even without seeing any tosses we can assume thecoin should be able to show both “heads” and “tails” (θ= 0).We express the prior preference/knowledge aboutθby a distributionp(θ) (theprior). As-suming thatθand the observed sample are characterized by an underlying joint probabilityp(θ,D), we can use the Bayes rule to express our adjusted belief about the parameters afterobserving the trials (theposterior):p(θ|D) =p(D|θ)p(θ)p(D)(8)wherep(D) =p(D|θ)p(θ)dθnormalizes the posterior.Maximization of the posteriordistribution gives rise to theMaximum A-Posteriori(MAP) estimate of the parameters:ˆθMAP= arg maxθp(θ|D) = arg maxθp(D|θ)p(θ)(9)As in maximum likelihood, to compute the MAP estimate it is often easier to maximizethe logarithm logp(θ) + logp(D|θ).3

For the coin toss we will consider separately each of the following priors:Discrete :p1(θ) =0.5ifθ= 0.50.5ifθ= 0.40otherwise(10)Beta :p2(θ) =1Zθα-1(1-θ)β-1(11)Hereαandβarehyperparametersthat should be given, not estimated, andZis a normal-ization constant needed to makep2(θ) integrate to 1 whose actual value is not important.2. (10 points) Priorp1(θ) translates into a strong belief that the coin is either fair, orbiased towards “tails” with a “heads” probability of 0.4. Express the MAP estimateˆθ1MAPunder this prior as a function ofnH/n.3. (10 points) The Beta prior expresses the belief thatθis likely to be nearα/(α+β).The largerα+βis, the more peaked the prior, and the stronger the bias thatθis close toα/(α+β).Deriveˆθ2MAPunder the Beta prior and show that whennapproaches infinity the MAP estimate approaches the ML estimate, thus the priorbecomes irrelevant given a large number of observations.4. (optional) Compare qualitativelyˆθ1MAPandˆθML. Assuming that the coin has a true“heads” probability of 0.41, which of the two estimators is likely to learn it faster? Ifdata is sufficient, which of the two estimators is better?