STA365 Module 9

.pdf

School

University of Toronto**We aren't endorsed by this school

Course

STA 365

Subject

Statistics

Date

Dec 28, 2024

Pages

Uploaded by MajorParrot3163

Bayes9_GaussianProcessesMarch 27, 2024Class Activty 9: Mixture Model Residual Quiz (10 minutes)1. Describe what a sample from aDirichlet Process Mixture distributionis and how toproduce it.Hint: how does aDirichlet Process Mixture distributionsample differ fromaDirichlet Processsample?2. How can the following be updated specify anonparametricresidual distribution for a linearregression model?withpm.Model(coords={"cluster": range(k)}) as model:m=pm.Normal("m", mu=mus, sigma=sigmas, initval=initvals, dims="cluster",transform=pm.distributions.transforms.univariate_ordered)s=pm.HalfNormal("s", sigma=1, dims="cluster")weights=pm.Dirichlet("w",2*np.ones(k), dims="cluster")v=pm.Categorical('v', p=weights, size=n)#v = pm.Multinomial('v', n=1, p=p, size=n)pm.Normal('y', mu=m[v], sigma=s[v], observed=y)#pm.NormalMixture("y", w=weights, mu=m, sigma=s, observed=y)3. Could it be further changed to provide a fullynonparametricmodel forE[y] =f(x)iffisa monotonical function increasing from a minimum value oflimx!-1f(x) = 0to a maximumvalue oflimx!1f(x) = 1?Class Activty 9: Mixture Model Residual Reivew (10 minutes)1. ADirichlet Processsample is a discrete distribution while a sample from aDirichletProcess Mixture distributionis a continuous distribution (which is a sampled mixturemodel distribution).✓i⇠p✓i⇠p=P1j=1wjδyjf✓i=N(✓i,σ= 0.3)⇠DPM(↵, p0, N(·,σ= 0.3))p⇠DP(↵, p0)wj=βjQj-1k=1(1-βk)yj⇠N(0,1)f=⇠DMPZf✓ip(✓i)d✓i⇡nXi=1wiN(yi,σ= 0.3)2,N(0,1)βj⇠Beta(↵[for Beta]= 1,β=↵= 2[from DP])2. How can the following be updated specify anonparametricresidual distribution for a linearregression model?withpm.Model(coords={"cluster": range(k)}) as model:m=pm.Normal("m", mu=mus, sigma=sigmas, initval=initvals, dims="cluster",transform=pm.distributions.transforms.univariate_ordered)s=pm.HalfNormal("s", sigma=1, dims="cluster")1

weights=pm.Dirichlet("w",2*np.ones(k), dims="cluster")v=pm.Categorical('v', p=weights, size=n)#v = pm.Multinomial('v', n=1, p=p, size=n)b=pm.Normal("b", mu=beta_mus, sigma=beta_sigmas, ims=q)pm.Normal('y', mu=m[v]+X@b, sigma=s[v], observed=y)3. Could it be further changed to provide a fullynonparametricmodel forE[y] =f(x)iffisa monotonical function increasing from a minimum value oflimx!-1f(x) = 0to a maximumvalue oflimx!1f(x) = 1?[7]:fromscipyimportstatsimportmatplotlib.pyplotaspltimportnumpyasnpn=100x=np.linspace(-10,10,n)y_true=(stats.norm(-5,1).cdf(x)+stats.norm(0,1).cdf(x)+stats.norm(5,1).cdf(x))/,!3y=y_true+ 0.05*(stats.gamma(1,1).rvs(n)-stats.gamma(1,1).mean())plt.plot(x,y,'.'); plt.plot(x,y_true);[150]:importpymcaspmimportpytensor.tensoraspt2

defstick_breaking(beta):portion_remaining=pt.concatenate([[1], pt.extra_ops.cumprod(1 -beta)[:,!-1]])returnbeta*portion_remainingk= 10; mus=np.linspace(-5,5,k); sigmas=np.array(k*[1]); initvals=mus.copy()withpm.Model(coords={"f_cluster":range(k),"e_cluster":range(k)})asmodel:alpha_f=pm.Gamma("alpha_f",1.0,1.0); beta_f=pm.Beta("beta_f",1.0,,!alpha_f, dims="f_cluster")w_f=pm.Deterministic("w_f", stick_breaking(beta_f), dims="f_cluster")m_f=pm.Normal("m_f", mu=mus, sigma=sigmas, initval=initvals,,!dims="f_cluster",transform=pm.distributions.transforms.univariate_ordered)s_f=pm.HalfNormal("s_f", sigma=1, dims="f_cluster")p_f=pm.NormalMixture("p_f", w=w_f/w_f.sum(), mu=m_f, sigma=s_f)f=pm.Deterministic("f", pm.math.exp(pm.logcdf(p_f,x)))alpha_e=pm.Gamma("alpha_e",1.0,1.0); beta_e=pm.Beta("beta_e",1.0,,!alpha_e, dims="e_cluster")w_e=pm.Deterministic("w_e", stick_breaking(beta_e), dims="e_cluster")m_e=pm.Normal("m_e", mu=mus/50+0.05, sigma=sigmas/5, initval=initvals/50+0.,!05, dims="e_cluster",transform=pm.distributions.transforms.univariate_ordered)s_e=pm.HalfNormal("s_e", sigma=sigmas/5, dims="e_cluster")v=pm.Categorical('v', p=w_e/w_e.sum(), size=n)#v = pm.Multinomial('v',,!n=1, p=p, size=n)pm.Normal('y', mu=m_e[v]+f, sigma=s_e[v], observed=y)#e = pm.NormalMixture("y", w=w_e, mu=m_e, sigma=s_e)#llike = pm.Potential("llike", pm.logp(e,y-f))withmodel:idata=pm.sample()#(tune=5000, init="advi", target_accept=0.95)Multiprocess sampling (4 chains in 4 jobs)CompoundStep>NUTS: [alpha_f, beta_f,m_f,s_f, p_f, alpha_e, beta_e,m_e,s_e]>CategoricalGibbsMetropolis: [v]<IPython.core.display.HTML object><IPython.core.display.HTML object>Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 drawstotal) took 111 seconds.The rhat statistic is larger than 1.01 for some parameters. This indicatesproblems during sampling. See https://arxiv.org/abs/1903.08008 for detailsThe effective sample size per chain is smaller than 100 for some parameters.Ahigher number is needed for reliable rhat and ess computation. See3

https://arxiv.org/abs/1903.08008 for detailsThere were 136 divergences after tuning. Increasetarget_acceptorreparameterize.[151]:importarvizasaz;fig,ax=plt.subplots(2,3,figsize=(18,5)); az.plot_trace(idata,,!var_names=["m_e","s_e","w_e"], axes=ax.T);[152]:foriinrange(100):plt.plot(x,idata.posterior['f'].values[0][i,:],'k', alpha=0.1)plt.plot(x,y,'.'); plt.plot(x,y_true)[152]:[<matplotlib.lines.Line2D at 0x149f71890>]4

[154]:support=np.linspace(-0.15,0.3,100); ave_mixture=0*support; reps=100forjinrange(reps):chain,draw=np.random.choice([0,1,2,3]),np.random.choice(np.,!linspace(0,999,1000,dtype=int))mixture=stats.norm(idata.posterior['m_e'][chain,draw],idata.posterior['s_e'][chain,draw]).pdf(np.,!array([support]*k).reshape(k,100).T)mixture=(mixture*idata.posterior['w_e'][chain,draw].values).sum(axis=1)plt.plot(support, mixture, c='gray', alpha=0.1); ave_mixture+=mixture/repsplt.plot(support, ave_mixture,'k', lw=2)plt.hist(y-y_true, density=True);5

[155]:fig,ax=plt.subplots(3,4)forcinrange(4):ax[0,c].bar(height=idata.posterior['w_e'].values[c].mean(axis=0), x=np.,!arange(10)); ax[0,c].set_title('w_e'+'chain'+str(c))ax[1,c].bar(height=idata.posterior['m_e'].values[c].mean(axis=0), x=np.,!arange(10)); ax[1,c].set_title('m_e'+'chain'+str(c))ax[2,c].bar(height=idata.posterior['s_e'].values[c].mean(axis=0), x=np.,!arange(10)); ax[2,c].set_title('s_e'+'chain'+str(c))plt.tight_layout()6

[156]:fig,ax=plt.subplots(3,4)forcinrange(4):ax[0,c].bar(height=idata.posterior['w_f'].values[c].mean(axis=0), x=np.,!arange(10)); ax[0,c].set_title('w_f'+'chain'+str(c))ax[1,c].bar(height=idata.posterior['m_f'].values[c].mean(axis=0), x=np.,!arange(10)); ax[1,c].set_title('m_f'+'chain'+str(c))ax[2,c].bar(height=idata.posterior['s_f'].values[c].mean(axis=0), x=np.,!arange(10)); ax[2,c].set_title('s_f'+'chain'+str(c))plt.tight_layout()7

Gaussian Process Regression (15 minutes)Supposeyi=f(xi) +✏ifor✏i⇠N(0,σ2)and smooth functionfGaussian Process Regressionuses aGaussian Process (GP)model for the smooth functionff(x)⇠GP(μ, K) =N(μ=μ(x),K=K(x,x))soy⇠N(f(x), Iσ2)withμ= [μ(x1), ..., μ(xn)]|{z}The "regression"andK=264k(x1, x1). . .k(x1, xn).........k(xn, x1). . .k(xn, xn)375|{z}The "covariance kernel matrix"Random variableswhose dependency structure and instantiation occur over an indexable domain(time, space, etc.)are calledstochastic processes; so, aGPis amultivariate Gaussianrandom variable (MVN)which is astochastic processrealized over indexable domainXdefining dependency structure through a so-calledcovariance kernelKExamplecovariance kernelsare availableherewith the commonexponential quadratic kernelspecification being8

K(x1, x2) =↵2exp✓-||x1-x2||2λ2◆for amplitude↵and length scaleλso the “farther apart” you are on the inexable domainXthe smaller thecovarianceWhat we see of GPs are MVNs (multivariate normal distributions)(8minutes)AGPis a the collection ofrandom variables{yx|x2X}defined over indexable domainXfor which any arbitrary finite subset{x1, x2, ..., xn}⇢X,{yx1, yx2, ...yxn}follow aMVN(furtherdetailed in thisvideoandtextbook) -GPs areinfinite random variablesfor infinteX, slightlygeneralizingMVNs which arefinite random variables- All that’s needed to define aGP(assuming a mean0specification) is acovaraince kerneldefining a symetric and positive-definitecovariance matrix for{yx1, yx2, ..., yxn}as a function of{x1, x2, ..., xn}⇢x[64]:importnumpyasnp;fromscipyimportstats;importmatplotlib.pyplotaspltnp.random.seed(365); sigma2,_lambda=1,1; n= 70x=np.sort(stats.uniform(-5,10).rvs(size=n))[:,np.newaxis]; K=sigma2*np.,!exp(-(x-x.T)**2/_lambda**2); K=(K+K.T)/2plt.figure(figsize=(18,4)); plt.plot(x,stats.,!multivariate_normal(cov=K,allow_singular=True).rvs(1),'.');What we see of GPs are MVNs (multivariate normal distributions)(7minutes)AGPis a the collection ofrandom variables{yx|x2X}defined over indexable domainXfor which any arbitrary finite subset{x1, x2, ..., xn}⇢X,{yx1, yx2, ...yxn}follow aMVN(furtherdetailed in thisvideoandtextbook) -GPs areinfinite random variablesfor infinteX, slightlygeneralizingMVNs which arefinite random variables- All that’s needed to define aGP(assuming a mean0specification) is acovaraince kerneldefining a symetric and positive-definitecovariance matrix for{yx1, yx2, ..., yxn}as a function of{x1, x2, ..., xn}⇢x[73]:9

K=sigma2*np.exp(-(x-x.T)**2/_lambda**2); K=(K+K.T)/2; fig,ax=plt.,!subplots(1,3, figsize=(15,15)); ax[1].imshow(K)ax[0].imshow(K[:10,:10]); ax[2].imshow(K[-10:,-10:]); xi,xj=np.meshgrid(np.,!arange(10,dtype=int),np.arange(10,dtype=int));fori,j,sinzip(xi.flatten(),xj.flatten(), K[:10,:10].round(2).flatten()):ax[0].text(x=i,y=j,s=s, verticalalignment='center',,!horizontalalignment='center')Bayesian GPs (10 minutes)GPs readily admit a hierarchical Bayesian (generative modeling) specification.f(x)⇠GP(μ, K) =N(μ=μ✓(x),K=K⌧(x,x))(1)✓⇠p(✓)⌧⇠p(⌧)(2)x⇠p(X)/1or just taken as given and "fixed"(3)y⇠N(E[y] =f(x),Var[y] =Iσ2)withσ2⇠p(σ2)(4)for an independent homoskedastic variance prior(5)(6)sop(f(x)|-)/N(E[y] =f(x),Var[y] =Iσ2)N(μ=μ✓(x),K=K⌧(x,x))p(✓,⌧,σ2,x)(7)which even has closed analytical posterior forMVN conjugate priors(8)PyMChasmanyGPtutorials, e.g., oncovariance functionsandlatent(versus marginalized)GPimplementations[86]:k,b,x_min,x_max,n= 1,5,-50,50,150# Haining Tandefaleatoric(x):# variablity function for a particular xr=(x-x_min)/(x_max-x_min);return2 *rdefgenerate_data(n):x=(x_max-x_min)*np.random.rand(n)+x_minnoise=np.random.randn(n)*aleatoric(x)10

y=(k*x*(1 +np.sin(x))+b)+noise; x=x[..., np.newaxis]returnx, yPyMCGPs (5 minutes)[96]:x_train, y_train=generate_data(n); x_test, y_test=generate_data(n);import,!matplotlib.gridspecasgridspec; fig=plt.figure(tight_layout=True,,!figsize=(16,4)); gs=gridspec.GridSpec(1,4);ax=fig.add_subplot(gs[0]);ax.,!plot(x_train, y_train,"b.");ax=fig.add_subplot(gs[1:]);ax.plot(x_train,,!y_train,"w.");[131]:importpymcaspm# adapted from Haining Tanwithpm.Model()asgp_model:ita=pm.HalfNormal('ita', sigma=5); iota=pm.Gamma('iota', alpha=4, beta=2)sq_exp=ita** 2 *pm.gp.cov.ExpQuad(input_dim=1, ls=iota)# exponential,!quadratic kernelgp=pm.gp.Marginal(cov_func=sq_exp); sigma=pm.HalfNormal('sigma',,!sigma=5, initval=1)y_=gp.marginal_likelihood('y', X=x_train, y=y_train, sigma=sigma)withgp_model:tr=pm.sample(target_accept=0.95, return_inferencedata=True)Auto-assigning NUTS sampler...Initializing NUTS using jitter+adapt_diag...Multiprocess sampling (4 chains in 4 jobs)NUTS: [ita, iota, sigma]<IPython.core.display.HTML object><IPython.core.display.HTML object>Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 drawstotal) took 731 seconds.11

PyMCGPs (5 minutes)[159]:importarvizasaz; fig,ax=plt.subplots(2,3,figsize=(18,5)); az.plot_trace(tr,,!var_names=['ita','iota','sigma'],axes=ax.T); plt.tight_layout()[161]:Xnew=np.linspace(-75,75,200)withgp_model:predy=gp.conditional("predy", Xnew=Xnew[:, np.newaxis])ppred=pm.sample_posterior_predictive(tr, var_names=['predy'])12

Sampling: [predy]<IPython.core.display.HTML object><IPython.core.display.HTML object>[154]:fig=plt.figure(figsize=(18,5)); ax=plt.gca()pm.gp.util.plot_gp_dist(ax, ppred.posterior_predictive['pred'].values.,!reshape(-1,4000,200)[0,:], Xnew, plot_samples=True, palette='Blues'); ax.,!plot(x, y,'.');[154]:[<matplotlib.lines.Line2D at 0x292bd2d50>]PyMCGPs (5 mintues)[162]:fig=plt.figure(figsize=(18,5)); ax=plt.gca();pm.gp.util.plot_gp_dist(ax, ppred.posterior_predictive['predy'].values.,!reshape(-1,4000,200)[0,:], Xnew, plot_samples=True, palette='Blues'); ax.,!plot(x, y,'.');[143]:Xnew=np.linspace(-50,50,200)withgp_model:pred=gp.conditional("pred", Xnew=Xnew[:, np.newaxis])13

ppred=pm.sample_posterior_predictive(tr, var_names=['pred'])Sampling: [pred]<IPython.core.display.HTML object><IPython.core.display.HTML object>[154]:fig=plt.figure(figsize=(18,5)); ax=plt.gca()pm.gp.util.plot_gp_dist(ax, ppred.posterior_predictive['pred'].values.,!reshape(-1,4000,200)[0,:], Xnew, plot_samples=True, palette='Blues'); ax.,!plot(x, y,'.');[154]:[<matplotlib.lines.Line2D at 0x292bd2d50>]Conditional MVN distributions (are MVN distributions) (15 min-utes)Givenxand observedytheGP posteriorp(f(x⇤)|-)characterizing uncertainty over smoothfunctions passing closely through(x,y)according to the noise inferencep(σ2|-)is aconditionalMVN distributionThis follows sincef(˜x= (x,x⇤))(given all model parameters) isMVNp0B@˜xμ=24μxn⇥1μx⇤m⇥135,⌃=264⌃xn⇥n⌃x,x⇤n⇥m⌃x⇤,x=⌃Tx,x⇤m⇥n⌃x⇤m⇥m3751CA=exp⇣-12(˜x-μ)T⌃-1(˜x-μ)⌘p(2⇡)k|⌃|and theconditional distributionf(x⇤|x)of aMVNis alsoMVNx⇤|x,μ,⌃⇠MVN✓¯μ=μx⇤m⇥1+⌃x⇤,xm⇥n⌃-1xn⇥n(x-μx)n⇥1,⌃=⌃x⇤m⇥m-⌃x⇤,xm⇥n⌃-1xn⇥n⌃x,x⇤n⇥m◆But if⌃x⇤,x⇡0m⇥nthenx⇤is essentially independent ofxso there is no information in the updatedprediction and soGPsamples are just informed by thepriorspecification of theGP14

Variational Inference: Approximate Bayesian Inference (10 minutes)Make Bayesian inference an optimization problem (like MLE inference) ratherthan a sampling problem (via MCMC)KL divergenceprovides anobjective functionfor providing anapproximate Bayesian in-ferencethroughoptimizationminqKL[q(✓)||p(✓|y) ]⌘minφKL[qφ(✓)||p(✓|y) ]Kullback-Leibler (KL) divergenceKullback-Leibler (KL) divergencemeasures a “distance between distributions”remember as notatingKL[q(✓)||p(✓) ]p(✓)is "on the bottom"=Zq(✓)p(✓)q(✓)d✓which is nonnegativeKLdivergencedoesnotsatisfythetriangleequalitysoisnotametricKL[q(✓)||p(✓) ]6KL[q(✓)||r(✓) ] +KL[r(✓)||p(✓) ]Variational Inference: Approximate Bayesian Inference (10 minutes)Theposteriorp(✓|y)isapproximatedbased on aKL divergence objective functionwhereq(✓)is oftenqφ(✓)amultivariate normal distributionwithvariational parametersφ=μ,⌃alocationand constrainedcovariance matrixminqKL[q(✓)||p(✓|y) ]⌘minφKL[qφ(✓)||p(✓|y) ]Instead ofoptimizing KL divergencedirectly though,approximate Bayesian inferenceoftenindirectlyoptimizestheELBO(q) =the Evidence Lower BOundz}|{Eq(✓)[logp(y|✓)]-KL[q(✓)||p(✓) ]logp(y)which findsq(✓)in terms of a (unsurprisingly) balanced contribution from thepriorand thelikelihood, sincea fixedlogp(y)constant=Zlogp(y)q(✓)d✓=Zlogp(✓, y)p(✓|y)q(✓)q(✓)q(✓)d✓=Zlogp(y|✓)p(✓)p(✓|y)q(✓)q(✓)q(✓)d✓=Zlogp(y|✓)q(✓)d✓|{z}Eq(✓)[logp(y|✓)]+KL[q(✓)||p(✓|y) ]-KL[q(✓)||p(✓) ]15

Variational InferenceApproximate Bayesian Inference (5 minutes)Approximation Bayesian posterior inferenceis thus equivalentlyminφKL[qφ(✓)||p(✓|y) ]ormaxφELBO(qφ)orminφ-ELBO(qφ)where ELBO(qφ)equivalence may be derived directly from the desiredBayesian optimizationKL[qφ(✓)||p(✓|y) ] =Zlogq(✓)p(✓|y)q(✓)d✓=Zlogq(✓)p(y)p(y|✓)p(✓)q(✓)d✓= logp(y)fixed+Zlogq(✓)p(✓)q(✓)d✓-Zlogp(y|✓)q(✓)d✓= logp(y)constant+KL[q(✓)||p(✓) ]-Eqφ(✓)[logp(y|✓)]|{z}-ELBO(qφ)Variational InferenceApproximate Bayesian Inference (5 minutes)Approximation Bayesian posterior inferenceis thus equivalentlyminφKL[qφ(✓)||p(✓|y) ]ormaxφELBO(qφ)orminφ-ELBO(qφ)or ELBO(qφ)may be derived withJensen’s inequalitywith expectations and no KL termslogp(y) = logZp(y,✓)q(✓)q(✓)d✓= logZp(y|✓)p(✓)q(✓)q(✓)d✓= logEq(✓)p(y,✓)1q(✓)= logEq(✓)p(y|✓)p(✓)q(✓)≥Eq(✓)logp(y,✓)1q(✓)=Eq(✓)logp(y|✓)p(✓)q(✓)=Eqφ(✓)[logp(y,✓)-logq(✓)]|{z}ELBO(qφ)(now use EM algorithm...)=Eqφ(✓)[logp(y|✓)]-KL[qφ(✓)||p(✓) ]|{z}ELBO(qφ)EM algorithmGeneral Proof (ELBO is a special case) (10 minutes)•TheEM algorithmis typically presented as a methodology for fittingMixture modelsbut of course(as you’ve been seeing)Bayesiansdon’t need theEM algorithmto use (evennonparametric)mixture models16

logp(y|φ) = logp(y|φ)+ logp(✓|y,φ)-logp(✓|y,φ)=logp(y,✓|φ)-logp(✓|y,φ)now integrate out✓=Ep(✓|y,φ(t))[logp(y,✓|φ)]|{z}Q(φ|φ(t))-Ep(✓|y,φ(t))[logp(✓|y,φ)]|{z}H(φ|φ(t))logp(y|φ)-logp(y|φ(t)) =Q(φ|φ(t))-H(φ|φ(t))-Q(φ(t)|φ(t)) +H(φ(t)|φ(t))but sinceH(φ(t)|φ(t))≥H(φ|φ(t))logp(y|φ)-logp(y|φ(t))≥Q(φ|φ(t))-Q(φ(t)|φ(t))so ifQ(φ(t+1)|φ(t))> Q(φ(t)|φ(t))thenlogp(y|φ(t+1))>logp(y|φ(t))whereQ(φ(t+1)|φ(t+1)) = logEp(✓|y,φ(t))[p(y,✓|φ(t+1))]Variational Autoencoders (“Nonlinear PCA”) (5 minutes)•Letqφbe some function which transformsx2IRntov2IRmform < n•whilegγis some function which transforms thatv2IRmback tox02IRnAutoencodersseek to defineqφandgγsuch thatgγ(qφ(x))⇡xviaminqφ,gγ||x-gγ(qφ(x))||22Variational autoencodersapproximate the “posterior”conditional distributionp(v|x)withvariational distributionqφ(x)(v)while takingp(x|v) =gγ(v)(x)based onvariational inferenceminφKL⇥qφ(x)(v)||p(v|x)⇤⌘maxφ,γEqφ(x)(v)[logp(x|v) =gγ(v)(x)]-KL⇥qφ(x)(v)||p(v)⇤•Note though this is not “Bayesian” posterior inference on the parametersφandγVariational Autoencoders (“Nonlinear PCA”) (5 minutes)minqKL⇥qφ(x)(v)||p(v|x)⇤⌘maxφ,γEqφ(x)(v)[logp(x|v) =gγ(v)(x)]-KL⇥qφ(x)(v)||p(v)⇤•Useneural networksφμ(x),φσ(x), andγ(v)parameterizing distributionsqandg–qφ(x)(v) =N(μ=φμ(x), μ=φσ(x))andp(v) =N(μ0,σ0)∗so theKLterm has a known analytical form for normal distributions•Use areparameterizationtrickv=φμ(x) +φσ(x)✏=μx+σx✏,✏⇠N(0,1)so@@φEqφ(x)(v)[loggγ(v)(x)] =E✏[@@φloggγ(v)(x)] =E✏[@@vloggγ(v)(x)@v@φ+@@φloggγ(v)(x)Total derivative]•Use a single samplev0⇠qφ(x)Monte Carlo estimation of@@γE✏[· · ·]=E✏[@@γ· · ·] =@@γloggγ(v0)(x)and@@φ(Eqφ(x)(v)[loggγ(v)(x)]-KL⇥qφ(x)(v0)||p(v0)⇤ )17

to provide estimates forgradient descentonneural networkparametersφandγBayesian Deep Learning (BDL) (10 minutes)•Variation fromDropoutin a fittedneural networksapproximates aGassian process pos-terior distribution. . .•Trajectories ofstochastic gradient decentare “samples” fromvariational inference“ap-proximate posteriors”. . .•Stochastic NN Layers / Bayes by Backpropprovidesvariational inferenceonneuralnetworkparameters. . .The classic⇢(xTw+b)layer for inputx(hereafter omitted for brevity) is a standard part ofneuralnetworkp(y|w)and admitsvariational inferenceappoximation ofp(w|y)withq✓(w)min✓KL[q✓(w)||p(w|y)]⌘max✓Eq✓(w)⇥log(p(y|w)⇥KL after expectationz}|{p(w)÷q✓(w)|{z}f✓(w,y))⇤= max✓=μ,σE✏[logp(y|w)+cz}|{logp(w=μ+✏σ, y)-logq✓(w=μ+✏σ)|{z}with the reparameterization trickwi=μi+✏iσiwith✏i⇠N(0,1)]@@✓E✏[· · ·] =E✏[@@✓f(w, y)]and (total derivative)@@✓f(w, y) =@@wf(w, y)@w@✓+@@✓f(w, y)which is estimated with a single sample Monte Carlo draw that optimizesq✓(w)=N(μ,σ)⇡p(w|y)18