STA365 Module 9

.pdf
School
University of Toronto**We aren't endorsed by this school
Course
STA 365
Subject
Statistics
Date
Dec 28, 2024
Pages
18
Uploaded by MajorParrot3163
Bayes9_GaussianProcessesMarch 27, 2024Class Activty 9: Mixture Model Residual Quiz (10 minutes)1. Describe what a sample from aDirichlet Process Mixture distributionis and how toproduce it.Hint: how does aDirichlet Process Mixture distributionsample differ fromaDirichlet Processsample?2. How can the following be updated specify anonparametricresidual distribution for a linearregression model?withpm.Model(coords={"cluster": range(k)}) as model:m=pm.Normal("m", mu=mus, sigma=sigmas, initval=initvals, dims="cluster",transform=pm.distributions.transforms.univariate_ordered)s=pm.HalfNormal("s", sigma=1, dims="cluster")weights=pm.Dirichlet("w",2*np.ones(k), dims="cluster")v=pm.Categorical('v', p=weights, size=n)#v = pm.Multinomial('v', n=1, p=p, size=n)pm.Normal('y', mu=m[v], sigma=s[v], observed=y)#pm.NormalMixture("y", w=weights, mu=m, sigma=s, observed=y)3. Could it be further changed to provide a fullynonparametricmodel forE[y] =f(x)iffisa monotonical function increasing from a minimum value oflimx!-1f(x) = 0to a maximumvalue oflimx!1f(x) = 1?Class Activty 9: Mixture Model Residual Reivew (10 minutes)1. ADirichlet Processsample is a discrete distribution while a sample from aDirichletProcess Mixture distributionis a continuous distribution (which is a sampled mixturemodel distribution).ipip=P1j=1wjδyjfi=N(i,σ= 0.3)DPM(, p0, N(·,σ= 0.3))pDP(, p0)wj=βjQj-1k=1(1-βk)yjN(0,1)f=DMPZfip(i)dinXi=1wiN(yi,σ= 0.3)2,N(0,1)βjBeta([for Beta]= 1,β== 2[from DP])2. How can the following be updated specify anonparametricresidual distribution for a linearregression model?withpm.Model(coords={"cluster": range(k)}) as model:m=pm.Normal("m", mu=mus, sigma=sigmas, initval=initvals, dims="cluster",transform=pm.distributions.transforms.univariate_ordered)s=pm.HalfNormal("s", sigma=1, dims="cluster")1
Background image
weights=pm.Dirichlet("w",2*np.ones(k), dims="cluster")v=pm.Categorical('v', p=weights, size=n)#v = pm.Multinomial('v', n=1, p=p, size=n)b=pm.Normal("b", mu=beta_mus, sigma=beta_sigmas, ims=q)pm.Normal('y', mu=m[v]+X@b, sigma=s[v], observed=y)3. Could it be further changed to provide a fullynonparametricmodel forE[y] =f(x)iffisa monotonical function increasing from a minimum value oflimx!-1f(x) = 0to a maximumvalue oflimx!1f(x) = 1?[7]:fromscipyimportstatsimportmatplotlib.pyplotaspltimportnumpyasnpn=100x=np.linspace(-10,10,n)y_true=(stats.norm(-5,1).cdf(x)+stats.norm(0,1).cdf(x)+stats.norm(5,1).cdf(x))/,!3y=y_true+ 0.05*(stats.gamma(1,1).rvs(n)-stats.gamma(1,1).mean())plt.plot(x,y,'.'); plt.plot(x,y_true);[150]:importpymcaspmimportpytensor.tensoraspt2
Background image
defstick_breaking(beta):portion_remaining=pt.concatenate([[1], pt.extra_ops.cumprod(1 -beta)[:,!-1]])returnbeta*portion_remainingk= 10; mus=np.linspace(-5,5,k); sigmas=np.array(k*[1]); initvals=mus.copy()withpm.Model(coords={"f_cluster":range(k),"e_cluster":range(k)})asmodel:alpha_f=pm.Gamma("alpha_f",1.0,1.0); beta_f=pm.Beta("beta_f",1.0,￿,!alpha_f, dims="f_cluster")w_f=pm.Deterministic("w_f", stick_breaking(beta_f), dims="f_cluster")m_f=pm.Normal("m_f", mu=mus, sigma=sigmas, initval=initvals,￿,!dims="f_cluster",transform=pm.distributions.transforms.univariate_ordered)s_f=pm.HalfNormal("s_f", sigma=1, dims="f_cluster")p_f=pm.NormalMixture("p_f", w=w_f/w_f.sum(), mu=m_f, sigma=s_f)f=pm.Deterministic("f", pm.math.exp(pm.logcdf(p_f,x)))alpha_e=pm.Gamma("alpha_e",1.0,1.0); beta_e=pm.Beta("beta_e",1.0,￿,!alpha_e, dims="e_cluster")w_e=pm.Deterministic("w_e", stick_breaking(beta_e), dims="e_cluster")m_e=pm.Normal("m_e", mu=mus/50+0.05, sigma=sigmas/5, initval=initvals/50+0.,!05, dims="e_cluster",transform=pm.distributions.transforms.univariate_ordered)s_e=pm.HalfNormal("s_e", sigma=sigmas/5, dims="e_cluster")v=pm.Categorical('v', p=w_e/w_e.sum(), size=n)#v = pm.Multinomial('v',￿,!n=1, p=p, size=n)pm.Normal('y', mu=m_e[v]+f, sigma=s_e[v], observed=y)#e = pm.NormalMixture("y", w=w_e, mu=m_e, sigma=s_e)#llike = pm.Potential("llike", pm.logp(e,y-f))withmodel:idata=pm.sample()#(tune=5000, init="advi", target_accept=0.95)Multiprocess sampling (4 chains in 4 jobs)CompoundStep>NUTS: [alpha_f, beta_f,m_f,s_f, p_f, alpha_e, beta_e,m_e,s_e]>CategoricalGibbsMetropolis: [v]<IPython.core.display.HTML object><IPython.core.display.HTML object>Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 drawstotal) took 111 seconds.The rhat statistic is larger than 1.01 for some parameters. This indicatesproblems during sampling. See https://arxiv.org/abs/1903.08008 for detailsThe effective sample size per chain is smaller than 100 for some parameters.Ahigher number is needed for reliable rhat and ess computation. See3
Background image
https://arxiv.org/abs/1903.08008 for detailsThere were 136 divergences after tuning. Increase￿target_accept￿orreparameterize.[151]:importarvizasaz;fig,ax=plt.subplots(2,3,figsize=(18,5)); az.plot_trace(idata,￿,!var_names=["m_e","s_e","w_e"], axes=ax.T);[152]:foriinrange(100):plt.plot(x,idata.posterior['f'].values[0][i,:],'k', alpha=0.1)plt.plot(x,y,'.'); plt.plot(x,y_true)[152]:[<matplotlib.lines.Line2D at 0x149f71890>]4
Background image
[154]:support=np.linspace(-0.15,0.3,100); ave_mixture=0*support; reps=100forjinrange(reps):chain,draw=np.random.choice([0,1,2,3]),np.random.choice(np.,!linspace(0,999,1000,dtype=int))mixture=stats.norm(idata.posterior['m_e'][chain,draw],idata.posterior['s_e'][chain,draw]).pdf(np.,!array([support]*k).reshape(k,100).T)mixture=(mixture*idata.posterior['w_e'][chain,draw].values).sum(axis=1)plt.plot(support, mixture, c='gray', alpha=0.1); ave_mixture+=mixture/repsplt.plot(support, ave_mixture,'k', lw=2)plt.hist(y-y_true, density=True);5
Background image
[155]:fig,ax=plt.subplots(3,4)forcinrange(4):ax[0,c].bar(height=idata.posterior['w_e'].values[c].mean(axis=0), x=np.,!arange(10)); ax[0,c].set_title('w_e'+'chain'+str(c))ax[1,c].bar(height=idata.posterior['m_e'].values[c].mean(axis=0), x=np.,!arange(10)); ax[1,c].set_title('m_e'+'chain'+str(c))ax[2,c].bar(height=idata.posterior['s_e'].values[c].mean(axis=0), x=np.,!arange(10)); ax[2,c].set_title('s_e'+'chain'+str(c))plt.tight_layout()6
Background image
[156]:fig,ax=plt.subplots(3,4)forcinrange(4):ax[0,c].bar(height=idata.posterior['w_f'].values[c].mean(axis=0), x=np.,!arange(10)); ax[0,c].set_title('w_f'+'chain'+str(c))ax[1,c].bar(height=idata.posterior['m_f'].values[c].mean(axis=0), x=np.,!arange(10)); ax[1,c].set_title('m_f'+'chain'+str(c))ax[2,c].bar(height=idata.posterior['s_f'].values[c].mean(axis=0), x=np.,!arange(10)); ax[2,c].set_title('s_f'+'chain'+str(c))plt.tight_layout()7
Background image
Gaussian Process Regression (15 minutes)Supposeyi=f(xi) +iforiN(0,σ2)and smooth functionfGaussian Process Regressionuses aGaussian Process (GP)model for the smooth functionff(x)GP(μ, K) =N(μ=μ(x),K=K(x,x))soyN(f(x), Iσ2)withμ= [μ(x1), ..., μ(xn)]|{z}The "regression"andK=264k(x1, x1). . .k(x1, xn).........k(xn, x1). . .k(xn, xn)375|{z}The "covariance kernel matrix"Random variableswhose dependency structure and instantiation occur over an indexable domain(time, space, etc.)are calledstochastic processes; so, aGPis amultivariate Gaussianrandom variable (MVN)which is astochastic processrealized over indexable domainXdefining dependency structure through a so-calledcovariance kernelKExamplecovariance kernelsare availableherewith the commonexponential quadratic kernelspecification being8
Background image
K(x1, x2) =2exp-||x1-x2||2λ2for amplitudeand length scaleλso the “farther apart” you are on the inexable domainXthe smaller thecovarianceWhat we see of GPs are MVNs (multivariate normal distributions)(8minutes)AGPis a the collection ofrandom variables{yx|x2X}defined over indexable domainXfor which any arbitrary finite subset{x1, x2, ..., xn}X,{yx1, yx2, ...yxn}follow aMVN(furtherdetailed in thisvideoandtextbook) -GPs areinfinite random variablesfor infinteX, slightlygeneralizingMVNs which arefinite random variables- All that’s needed to define aGP(assuming a mean0specification) is acovaraince kerneldefining a symetric and positive-definitecovariance matrix for{yx1, yx2, ..., yxn}as a function of{x1, x2, ..., xn}x[64]:importnumpyasnp;fromscipyimportstats;importmatplotlib.pyplotaspltnp.random.seed(365); sigma2,_lambda=1,1; n= 70x=np.sort(stats.uniform(-5,10).rvs(size=n))[:,np.newaxis]; K=sigma2*np.,!exp(-(x-x.T)**2/_lambda**2); K=(K+K.T)/2plt.figure(figsize=(18,4)); plt.plot(x,stats.,!multivariate_normal(cov=K,allow_singular=True).rvs(1),'.');What we see of GPs are MVNs (multivariate normal distributions)(7minutes)AGPis a the collection ofrandom variables{yx|x2X}defined over indexable domainXfor which any arbitrary finite subset{x1, x2, ..., xn}X,{yx1, yx2, ...yxn}follow aMVN(furtherdetailed in thisvideoandtextbook) -GPs areinfinite random variablesfor infinteX, slightlygeneralizingMVNs which arefinite random variables- All that’s needed to define aGP(assuming a mean0specification) is acovaraince kerneldefining a symetric and positive-definitecovariance matrix for{yx1, yx2, ..., yxn}as a function of{x1, x2, ..., xn}x[73]:9
Background image
K=sigma2*np.exp(-(x-x.T)**2/_lambda**2); K=(K+K.T)/2; fig,ax=plt.,!subplots(1,3, figsize=(15,15)); ax[1].imshow(K)ax[0].imshow(K[:10,:10]); ax[2].imshow(K[-10:,-10:]); xi,xj=np.meshgrid(np.,!arange(10,dtype=int),np.arange(10,dtype=int));fori,j,sinzip(xi.flatten(),xj.flatten(), K[:10,:10].round(2).flatten()):ax[0].text(x=i,y=j,s=s, verticalalignment='center',￿,!horizontalalignment='center')Bayesian GPs (10 minutes)GPs readily admit a hierarchical Bayesian (generative modeling) specification.f(x)GP(μ, K) =N(μ=μ(x),K=K(x,x))(1)p()p()(2)xp(X)/1or just taken as given and "fixed"(3)yN(E[y] =f(x),Var[y] =Iσ2)withσ2p(σ2)(4)for an independent homoskedastic variance prior(5)(6)sop(f(x)|-)/N(E[y] =f(x),Var[y] =Iσ2)N(μ=μ(x),K=K(x,x))p(,,σ2,x)(7)which even has closed analytical posterior forMVN conjugate priors(8)PyMChasmanyGPtutorials, e.g., oncovariance functionsandlatent(versus marginalized)GPimplementations[86]:k,b,x_min,x_max,n= 1,5,-50,50,150# Haining Tandefaleatoric(x):# variablity function for a particular xr=(x-x_min)/(x_max-x_min);return2 *rdefgenerate_data(n):x=(x_max-x_min)*np.random.rand(n)+x_minnoise=np.random.randn(n)*aleatoric(x)10
Background image
y=(k*x*(1 +np.sin(x))+b)+noise; x=x[..., np.newaxis]returnx, yPyMCGPs (5 minutes)[96]:x_train, y_train=generate_data(n); x_test, y_test=generate_data(n);import￿,!matplotlib.gridspecasgridspec; fig=plt.figure(tight_layout=True,￿,!figsize=(16,4)); gs=gridspec.GridSpec(1,4);ax=fig.add_subplot(gs[0]);ax.,!plot(x_train, y_train,"b.");ax=fig.add_subplot(gs[1:]);ax.plot(x_train,￿,!y_train,"w.");[131]:importpymcaspm# adapted from Haining Tanwithpm.Model()asgp_model:ita=pm.HalfNormal('ita', sigma=5); iota=pm.Gamma('iota', alpha=4, beta=2)sq_exp=ita** 2 *pm.gp.cov.ExpQuad(input_dim=1, ls=iota)# exponential￿,!quadratic kernelgp=pm.gp.Marginal(cov_func=sq_exp); sigma=pm.HalfNormal('sigma',￿,!sigma=5, initval=1)y_=gp.marginal_likelihood('y', X=x_train, y=y_train, sigma=sigma)withgp_model:tr=pm.sample(target_accept=0.95, return_inferencedata=True)Auto-assigning NUTS sampler...Initializing NUTS using jitter+adapt_diag...Multiprocess sampling (4 chains in 4 jobs)NUTS: [ita, iota, sigma]<IPython.core.display.HTML object><IPython.core.display.HTML object>Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 drawstotal) took 731 seconds.11
Background image
PyMCGPs (5 minutes)[159]:importarvizasaz; fig,ax=plt.subplots(2,3,figsize=(18,5)); az.plot_trace(tr,￿,!var_names=['ita','iota','sigma'],axes=ax.T); plt.tight_layout()[161]:Xnew=np.linspace(-75,75,200)withgp_model:predy=gp.conditional("predy", Xnew=Xnew[:, np.newaxis])ppred=pm.sample_posterior_predictive(tr, var_names=['predy'])12
Background image
Sampling: [predy]<IPython.core.display.HTML object><IPython.core.display.HTML object>[154]:fig=plt.figure(figsize=(18,5)); ax=plt.gca()pm.gp.util.plot_gp_dist(ax, ppred.posterior_predictive['pred'].values.,!reshape(-1,4000,200)[0,:], Xnew, plot_samples=True, palette='Blues'); ax.,!plot(x, y,'.');[154]:[<matplotlib.lines.Line2D at 0x292bd2d50>]PyMCGPs (5 mintues)[162]:fig=plt.figure(figsize=(18,5)); ax=plt.gca();pm.gp.util.plot_gp_dist(ax, ppred.posterior_predictive['predy'].values.,!reshape(-1,4000,200)[0,:], Xnew, plot_samples=True, palette='Blues'); ax.,!plot(x, y,'.');[143]:Xnew=np.linspace(-50,50,200)withgp_model:pred=gp.conditional("pred", Xnew=Xnew[:, np.newaxis])13
Background image
ppred=pm.sample_posterior_predictive(tr, var_names=['pred'])Sampling: [pred]<IPython.core.display.HTML object><IPython.core.display.HTML object>[154]:fig=plt.figure(figsize=(18,5)); ax=plt.gca()pm.gp.util.plot_gp_dist(ax, ppred.posterior_predictive['pred'].values.,!reshape(-1,4000,200)[0,:], Xnew, plot_samples=True, palette='Blues'); ax.,!plot(x, y,'.');[154]:[<matplotlib.lines.Line2D at 0x292bd2d50>]Conditional MVN distributions (are MVN distributions) (15 min-utes)Givenxand observedytheGP posteriorp(f(x)|-)characterizing uncertainty over smoothfunctions passing closely through(x,y)according to the noise inferencep(σ2|-)is aconditionalMVN distributionThis follows sincef(˜x= (x,x))(given all model parameters) isMVNp0B@˜xμ=24μxn1μxm135,=264xnnx,xnmx,x=Tx,xmnxmm3751CA=exp-12(˜x-μ)T-1(˜x-μ)p(2)k||and theconditional distributionf(x|x)of aMVNis alsoMVNx|x,μ,MVN¯μ=μxm1+x,xmn-1xnn(x-μx)n1,=xmm-x,xmn-1xnnx,xnmBut ifx,x0mnthenxis essentially independent ofxso there is no information in the updatedprediction and soGPsamples are just informed by thepriorspecification of theGP14
Background image
Variational Inference: Approximate Bayesian Inference (10 minutes)Make Bayesian inference an optimization problem (like MLE inference) ratherthan a sampling problem (via MCMC)KL divergenceprovides anobjective functionfor providing anapproximate Bayesian in-ferencethroughoptimizationminqKL[q()||p(|y) ]minφKL[qφ()||p(|y) ]Kullback-Leibler (KL) divergenceKullback-Leibler (KL) divergencemeasures a “distance between distributions”remember as notatingKL[q()||p() ]p()is "on the bottom"=Zq()p()q()dwhich is nonnegativeKLdivergencedoesnotsatisfythetriangleequalitysoisnotametricKL[q()||p() ]6KL[q()||r() ] +KL[r()||p() ]Variational Inference: Approximate Bayesian Inference (10 minutes)Theposteriorp(|y)isapproximatedbased on aKL divergence objective functionwhereq()is oftenqφ()amultivariate normal distributionwithvariational parametersφ=μ,alocationand constrainedcovariance matrixminqKL[q()||p(|y) ]minφKL[qφ()||p(|y) ]Instead ofoptimizing KL divergencedirectly though,approximate Bayesian inferenceoftenindirectlyoptimizestheELBO(q) =the Evidence Lower BOundz}|{Eq()[logp(y|)]-KL[q()||p() ]logp(y)which findsq()in terms of a (unsurprisingly) balanced contribution from thepriorand thelikelihood, sincea fixedlogp(y)constant=Zlogp(y)q()d=Zlogp(, y)p(|y)q()q()q()d=Zlogp(y|)p()p(|y)q()q()q()d=Zlogp(y|)q()d|{z}Eq()[logp(y|)]+KL[q()||p(|y) ]-KL[q()||p() ]15
Background image
Variational InferenceApproximate Bayesian Inference (5 minutes)Approximation Bayesian posterior inferenceis thus equivalentlyminφKL[qφ()||p(|y) ]ormaxφELBO(qφ)orminφ-ELBO(qφ)where ELBO(qφ)equivalence may be derived directly from the desiredBayesian optimizationKL[qφ()||p(|y) ] =Zlogq()p(|y)q()d=Zlogq()p(y)p(y|)p()q()d= logp(y)fixed+Zlogq()p()q()d-Zlogp(y|)q()d= logp(y)constant+KL[q()||p() ]-Eqφ()[logp(y|)]|{z}-ELBO(qφ)Variational InferenceApproximate Bayesian Inference (5 minutes)Approximation Bayesian posterior inferenceis thus equivalentlyminφKL[qφ()||p(|y) ]ormaxφELBO(qφ)orminφ-ELBO(qφ)or ELBO(qφ)may be derived withJensen’s inequalitywith expectations and no KL termslogp(y) = logZp(y,)q()q()d= logZp(y|)p()q()q()d= logEq()p(y,)1q()= logEq()p(y|)p()q()Eq()logp(y,)1q()=Eq()logp(y|)p()q()=Eqφ()[logp(y,)-logq()]|{z}ELBO(qφ)(now use EM algorithm...)=Eqφ()[logp(y|)]-KL[qφ()||p() ]|{z}ELBO(qφ)EM algorithmGeneral Proof (ELBO is a special case) (10 minutes)TheEM algorithmis typically presented as a methodology for fittingMixture modelsbut of course(as you’ve been seeing)Bayesiansdon’t need theEM algorithmto use (evennonparametric)mixture models16
Background image
logp(y|φ) = logp(y|φ)+ logp(|y,φ)-logp(|y,φ)=logp(y,|φ)-logp(|y,φ)now integrate out=Ep(|y,φ(t))[logp(y,|φ)]|{z}Q(φ|φ(t))-Ep(|y,φ(t))[logp(|y,φ)]|{z}H(φ|φ(t))logp(y|φ)-logp(y|φ(t)) =Q(φ|φ(t))-H(φ|φ(t))-Q(φ(t)|φ(t)) +H(φ(t)|φ(t))but sinceH(φ(t)|φ(t))H(φ|φ(t))logp(y|φ)-logp(y|φ(t))Q(φ|φ(t))-Q(φ(t)|φ(t))so ifQ(φ(t+1)|φ(t))> Q(φ(t)|φ(t))thenlogp(y|φ(t+1))>logp(y|φ(t))whereQ(φ(t+1)|φ(t+1)) = logEp(|y,φ(t))[p(y,|φ(t+1))]Variational Autoencoders (“Nonlinear PCA”) (5 minutes)Letqφbe some function which transformsx2IRntov2IRmform < nwhilegγis some function which transforms thatv2IRmback tox02IRnAutoencodersseek to defineqφandgγsuch thatgγ(qφ(x))xviaminqφ,gγ||x-gγ(qφ(x))||22Variational autoencodersapproximate the “posterior”conditional distributionp(v|x)withvariational distributionqφ(x)(v)while takingp(x|v) =gγ(v)(x)based onvariational inferenceminφKLqφ(x)(v)||p(v|x)maxφ,γEqφ(x)(v)[logp(x|v) =gγ(v)(x)]-KLqφ(x)(v)||p(v)Note though this is not “Bayesian” posterior inference on the parametersφandγVariational Autoencoders (“Nonlinear PCA”) (5 minutes)minqKLqφ(x)(v)||p(v|x)maxφ,γEqφ(x)(v)[logp(x|v) =gγ(v)(x)]-KLqφ(x)(v)||p(v)Useneural networksφμ(x),φσ(x), andγ(v)parameterizing distributionsqandgqφ(x)(v) =N(μ=φμ(x), μ=φσ(x))andp(v) =N(μ0,σ0)so theKLterm has a known analytical form for normal distributionsUse areparameterizationtrickv=φμ(x) +φσ(x)=μx+σx,N(0,1)so@Eqφ(x)(v)[loggγ(v)(x)] =E[@loggγ(v)(x)] =E[@@vloggγ(v)(x)@v+@loggγ(v)(x)Total derivative]Use a single samplev0qφ(x)Monte Carlo estimation of@E[· · ·]=E[@· · ·] =@loggγ(v0)(x)and@(Eqφ(x)(v)[loggγ(v)(x)]-KLqφ(x)(v0)||p(v0)⇤ )17
Background image
to provide estimates forgradient descentonneural networkparametersφandγBayesian Deep Learning (BDL) (10 minutes)Variation fromDropoutin a fittedneural networksapproximates aGassian process pos-terior distribution. . .Trajectories ofstochastic gradient decentare “samples” fromvariational inference“ap-proximate posteriors”. . .Stochastic NN Layers / Bayes by Backpropprovidesvariational inferenceonneuralnetworkparameters. . .The classic(xTw+b)layer for inputx(hereafter omitted for brevity) is a standard part ofneuralnetworkp(y|w)and admitsvariational inferenceappoximation ofp(w|y)withq(w)minKL[q(w)||p(w|y)]maxEq(w)log(p(y|w)KL after expectationz}|{p(w)÷q(w)|{z}f(w,y))⇤= max=μ,σE[logp(y|w)+cz}|{logp(w=μ+✏σ, y)-logq(w=μ+✏σ)|{z}with the reparameterization trickwi=μi+iσiwithiN(0,1)]@@✓E[· · ·] =E[@@✓f(w, y)]and (total derivative)@@✓f(w, y) =@@wf(w, y)@w@✓+@@✓f(w, y)which is estimated with a single sample Monte Carlo draw that optimizesq(w)=N(μ,σ)p(w|y)18
Background image