Gaussian Process Regression With Heteroscedastic Residuals

Gaussian Process Regression with Heteroscedastic Residuals
Chunyi Wang April 2012

Abstract Standard Gaussian Process (GP) Regression models typically assume that the residuals have the same variance across all observations. However, applications with input-dependent noise (heteroscedastic) frequently arise in practice. In this paper, we propose a GP Regression model with a latent variable which serves as an additional (unobserved) covariate for the regression. This model addresses the heteroscedasticity issue since it allows the function to have a changing partial derivative with respect to the observed covariates. When a covariance function is correctly chosen, this model can handle residuals with (a) equal variance, Gaussian distribution, or (b) input-dependent variance, Gaussian distribution, or (c) input-dependent variance, non-Gaussian distribution. We compare our model with a regular GP model which can handle only case (a), and with a model proposed by Goldberg et al. [which only deals with case (b)] with synthetic datasets. Experiments show that when the residual is Gaussian, our model (with correct covariance function) is as good as ?s model, while when the residual is non-Gaussian, our model are better than the others.
Introduction
Gaussian Process (GP) models have become popular in recent years for regression and classication problems, mainly because these models are very exible - with many covariance functions to choose from, the model can achieve dierent degrees of smoothness or dierent degrees of additive structure. Regular GP regression models typically assume that the residuals are independent of the input covariate, and are i.i.d. Gaussian. However in many applications, the variance of the residuals actually depend on the inputs and are not necessarily Gaussian. In this paper, we present a GP regression model which can deal with input-dependent noise. The model essentially assumes that the heteroscedasticity comes from an unobserved quantity, which is included in the model as a latent variable. We call this model Gaussian Process regression model with a Latent Covariate, or GPLC. We give the details of this model as well as the regular GP model (REG) and the model by Goldberg et al. (which we call Gaussian Process regression model with a Latent Variance, or GPLV) in Section 2, and then discuss the relationships/equivalencies between these models. We 1
describe the computation methods in the following section, and then present the results of these models on various synthetic dataset in Section 4.
2
2.1
The Models
The Regular GP Regression Model
yi = f (xi ) + (1)
For a regression problem,

i
where
is a i.i.d. Gaussian noise term. The problem is to nd the association f between the
covariate (input) x (a p-dimensional vector input) and the scalar response y with n observed values (xi , yi ), i = 1...n (training set), and make predictions for yn+1 , yn+2 , ... corresponding to xn+1 , xn+2 ... (test set). The Bayesian GP treatment of regression is to put a Gaussian prior on f , with prior mean zero, and a covariance matrix determined by a covariance function, e.g. the squared exponential function: K(xi , xj ) = exp
k=1 2 p
(xik xjk )2 2 k
(2)
Any function that leads to a positive denite covariance matrix can be used as a covariance function. Unless noted otherwise, we will use the squared exponential function thoughout this paper. The covariance between yi and yj is Cov(yi , yj ) = K(xi , xj ) + Cov( i , j )
p
(3) + ij 2
= exp
k=1
(xik xjk )2 2 k
If we know the values of , and (called hyper-parameters) then the predictive distribution of yn+1 for a test case xn+1 based on observed values (x1 , y1 ), ..., (xn , yn ), is Gaussian with the following mean and variance: E(yn+1 |x, y, xn+1 ) = k T Cy Var(yn+1 |x, y, xn+1 ) = V k T Ck (4) (5)
In the equations above, we write x = (x1 , ..., xn ), y = (y1 , ..., yn ). k is the vector of covariances between yn+1 and each of yi , C is the covariance matrix of the observed y, and V is the prior variance of yn+1 [=Cov(yn+1 , yn+1 ) by (3)]. Of course we normally do not know the values of the hyperparameters. Therefore we also put priors on them (typically Gaussian priors on the logarithm of the hyper-parameters), and then 2
compute (4) and (5) by integrating over the posterior distribution of these hyperparameters. The posterior distribution is typically obtained by Markov Chain Monte Carlo. A detailed discussion of GP regression/classication can be found in Neal (1998) and Rasmussen and Williams (2006).
2.2
A GP Regression Model with a Latent Covariate
We add a latent variable w into the model as an additional (unobserved) input. The regression problem then becomes yi = f (xi , wi ) + i . (6)
We choose standard Gaussian as the prior for wi , (i = 1, ..., n). Since the squared exponential function is stationary (see Rasmussen and Williams, 2006, p.82), the individual location of w doesnt matter, and the scale of w can be adjusted by the length scale parameter p+1 . To make predictions we need to integrate (4) and (5) over the posterior distribution of the hyper-parameters as well as the latent variable w. Note that to compute k, we need wn+1 , which is randomly sampled from the prior of w, therefore the predicted mean and variance are also averaged over the prior of w. To see that the variance of f depends on x, we look at the Taylor-expansion of f (x, w) at w = 0: f (x, w) = f (x, 0) + wf2 (x, 0) + w2 f (x, 0) + ... 2 2 (7)
where fi denotes the (rst order) partial derivative of f with respect to the ith variable. For simplicity, if we ignore the high-order terms, then it follows that Var[f (x, w)] 0 + f2 (x, 0)Var(w) = f2 (x, 0) which depends on x as long as f2 (x, 0) depends on x (which usually is the case). We illustrate the heteroscedasticity eect by an unobserved covariate in Figure 1. The data is generated from a GP where x is evenly distributed in [0,5], w is a standard Gaussian covariate, and the hyper-parameters are set to be = 3, x = 0.8, w = 3. Suppose we only observe (x, y), the data is clearly heteroscedastistic, since the spread of y against x changes when x changes. For instance, the spread of y looks much bigger when x is around 1.5 than it is when x is near 3.5. We also notice that the distribution of the residuals cant be Gaussian, as, for instance, we see strong skewness near x = 5. To put this in another perspective, we make 19 plots of this function in the righthand side plot, each corresponding to a xed wi , (i = 1, 2, ..., 19), with wi being the 5ith percentile of the standard 3 (8)
y=f(x,w) 6 5 4 3 2 2 1 0 1 2 3 4 0 1 2 x 3 4 5 0 1 2 3 y y 1 random w w=0 6 w=0 5 4 3
y=f(x,w)
2 x
Figure 1: Heteroscedasticity eect of a function with an unobserved covariate Gaussian distribution. We can see that we have a much better idea what the true function is when x is near 3.5 than when x is near 1.5. These essentially show that if an important input quantity is not observed, the function values based only on the observed inputs will appear heteroscedastic. And even if the data is generated from a GP, the noise is not necessarily Gaussian. In fact, it follows from (7) that, if the high-order terms are in fact negligible, then the function is eectively Gaussian; otherwise, it is non-Gaussian. In other words, our model doesnt require a Gaussian assumption. We illustrate this in Figure 2 where, at a given x value, (here x = 2), the function translates a standard Gaussian covariate w into a non-Gaussian response (in fact, it follows a non-central Chi-Squared distribution). Note that the density curves of w and y are not to-scale for demonstration purposes. Lastly, similar to the regular GP model, we still have a constant-variance Gaussian noise term in the GPLC model, even though the addition of a latent covariate itself can account for the (heteroscedastic) noise. The primary purpose of this is to avoid computational singularity (see Neal, 1997). Also, it is possible that for some particular x, the function f could be at within a high-probability area of w, which will also create a singularity. This is shown in Figure 3 where a at function translates w into a response whose density goes up to innity near 5. Adding the jitter term can eectively avoid this as well.
y=f(x,w) 30 40 Function Density of w Density of y 35 30 20 E(Y|X=x) E(Y|X=x) 25 20 15 10 5 5 0 0 1 2 x 3 4 5 0
y=f(x,w) Function Density of w Density of y
25
15
10
2 x
Figure 2: GPLC allows non-Gaussian
Figure 3: Flat f creates singularity
2.3
A GP Regression Model with a Latent Variance
In this scheme, a main GP models the mean response, and a secondary GP is used to model the variance of the noise, which depends on the input. Therefore the regression problem is yi = f (xi ) + (xi ) where zi = log SD[ (xi )] depends on x through: zi = r(xi ) + J (10) (9)
f (x) and r(x) are both given (independent) GP priors, with zero mean and covariance functions Ky , Kz , with dierent hyper-parameters (y , z , y , z ). J is a Gaussian jitter term (see Neal, 1997) which is added to avoid computational singularity. To make predictions, we need to integrate (4) and (5) over the posterior distribution of the hyper-parameters as well as the latent values of z. Also, to compute k, we need to obtain the value of zn+1 , which is sampled from p(zn+1 |z1 , ..., zn ). This model was rst introduced in Goldberg et al. (1998). In their toy example, the hyperparameters are all xed, only the z values are sampled using MCMC. In this paper, we will take a full Bayesian approach, where both the hyper-parameters and the z values are sampled based on the observed data. In addition, we will also discuss fast computation methods for this model in Section 3.
2.4
Relationships between GPLC and other models
We will show in this section that, if a covariance function is carefully chosen, GPLC can be equivalent to the a regular GP regression model or nearly equivalent to a GPLV model. Suppose the underlying function f (x, w) is of the form f (x, w) = g(x) + cw (11)
where w N (0, 1). If we only observe x but not w, then we have a regression problem with unknown iid Gaussian noise N (0, c2 ). This is clearly equivalent to the regular GP regression (1). Suppose the covariance of g(x) between input i and j is given by a covariance function K1 (xi , xj ), then the covariance of f (x) between input case i and j will be (assuming g has zero mean) Cov[f (xi ), f (xj )] = E[(g(xi ) + cwi )(g(xj ) + cwj )] = E[g(xi )g(xj )] + c2 wi wj = K1 (xi , xj ) + c2 wi wj Therefore, if we put a GP prior on f (x, w) with zero mean and covariance function K(xi , xj , wi , wj ) = K1 (xi , xj ) + c2 wi wj (13) (12)
Then the results given by GPLC will be equivalent to a reguar GP regression with a covariance function K1 . Similarly, we can use a covariance function of the form K(xi , xj , wi , wj ) = K1 (xi , xj ) + wi wj K2 (xi , xj ) for GPLC, for an underlying function of the form f (x, w) = g(x) + wh(x) (15) (14)
which is almost equivalent to the problem described in (9) and (10), where h(x) = exp[r(x)] is the input-dependent standard deviation of the residuals. The only issue is that (14) cannot guarantee that h(x) = 0, while exp[r(x)] is always greater than zero. Nevertheless, the existence of the jitter term will take care of the degenerate case (where the noise has zero variance). To verify this, we generate two random functions from GPs with covariance functions (11) and (14), where we set K1 to be a squared exponential covariance function, and K2 (x, x ) = axx + b (so that the noise standard deviation is linearly associated with x). We check the distributions of y at x = 0.5, x = 2 and x = 3.5. As shown in Figures 4 and 5, all of them appear to be normal. In the rst case, the standard deviation of y at all three x locations are all approximately 0.64; while in the second case, the standard deviations are 0.13, 0.49 and 1.11, respectively. 6
y=f(x,w) 8 6 4 50 2 0 0 7 150 100
x=0.5 150 100 50 0
x=2.0 150 100 50 0
x=3.5
2 QQ Plot
QQ Plot
QQ Plot
Observed Quantiles
Observed Quantiles
4 6 8 10
2 0 2 4 4 2 0 2 4 Normal Quantiles
Observed Quantiles
4 2 0 2 4 4 2 0 2 4 Normal Quantiles
2 x
Figure 4: GPLC for equal variance Gaussian residuals
y=f(x,w) 6 5 100 4 3 2 1 50 0 150
x=0.5 150 100 50 0
x=2.0 150 100 50 0
x=3.5
2.4
2.6
2.8
3.5 3 2.5 2 1.5 QQ Plot
2 QQ Plot
QQ Plot
Observed Quantiles
Observed Quantiles
1 2 3 4 0 1 2 x 3 4
Observed Quantiles
4 2 0 2 4 4 2 0 2 4 Normal Quantiles
Figure 5: GPLC for input-dependent Gaussian residuals
Computation
We use Markov Chain Monte Carlo to obtain samples of the hyper-parameters and the latent values. An obvious choice of method is the classic Metropolis algorithm (Metropolis et al., 1953). To construct an optimal Markov Chain using this method we have to assign an appropriate value for the proposal standard deviation so that the acceptance rate on each variable is neither too big nor too small. However, since we have many variables (n + p + 3 for GPLC, n + 2p + 2 for GPLV, and p + 3 for REG), its dicult to optimize the Chain. Therefore we use Slice Sampling (Neal, 2003) instead. Specically, we use univariate step-out slice sampler to update the variables (hyper-parameter or latent variable) one after another. This method also has tuning parameters (the length of the steps), however compared to the Metropolis algorithm, the slice sampler is much easier to tune in the sense that less optimal tuning parameter values do not aect the performance of the slice sampler as much as they do for the Metropolis sampler. Both Metropolis algorithm and slice sampling do not work well if the variables are not independent. This is not a problem for the regular GP regression and GPLC. In the case of GPLV, however, the latent variances are clearly correlated, as the variances are input-dependent.
3.1
A Metropolis Strategy for GPLV
Neal (1998) describes a method which uses a proposal distribution that takes into account the correlation information. That is, for the current latent values z, we propose to move to z according to z = (1
2 1/2
z + L
(16)
where is a small constant (a tuning parameter), L is the lower triangular Cholesky decomposition for Cz , the covariance matrix for the (conditional) prior for z, and is a vector of independent standard Gaussian covariates. We will use this method to develop a new strategy for GPLV. Since the posterior distribution for the hyper-parameters and latent values [we write = (y , z , y , z ) as the hyperparameters]. p(, z|x, y) = p(y|x, z, y , y )p(z|x, z , z )p()
1 |Cy |1 y T Cy y 1 |Cz |1 z T Cz z p()
(17)
where y = (y1 , ..., yn ) is the training responses, Cy is the covariance matrix for y (for the main GP), z = (z1 , ..., zn ) is the latent values, Cz is the covariance matrix for z (for the secondary GP). We use |C| to denote the determinant of C, and p() to denote the prior density for the
hyperparameters y , z , y , z . Note that y and z can actually be vectors of the same lengths as y and z, but we do not emphasize this because of notational simplicity. Therefore, suppose we wish to obtain new values , z based on current values and z, we can do the following: 1. Update the hyperparameters associated with the main GP (y , y ), one after another (in a systematic order or a random order). Notice that we only need to recompute the Cholesky decomposition of Cy for each of these updates. 2. Update one of the hyperparameters associated with the secondary GP (z , z ). We only need to recompute the Cholesky of Cz for this update. 3. Update all the z-values simultaneously by the method described in (16), where L = Cz . We only need to recompute the Cholesky of Cy for this update. 4. Repeat (2) and (3) until all of z and z are updated. Notice that Cz depends only on x, z and z , so a change of z will not result in a change of Cz . Hence it makes sense to do several updates on z [i.e. repeat step (3) for m times] before updating another hyper-parameter, as the z values are more dicult to sample than the hyper-parameters. We treat m as a tuning parameter.
3.2
Comparison of Metropolis with Slice Sampling for GPLV
We compare the method described above and slice sampling for GPLV, using the rst training set in dataset U1 in the experiment section. The eciency of a MCMC method is usually measured by the autocorrelation time (see Neal, 1993) of its sample:
=1+2
i=1
(18)
where i is the lag-i autocorrelation. To fairly compare two samples, we have to adjust them so that each sample requires the same amount of time to obtain. The Metropolis method described above is dominated by nding the Cholesky decompostions, which takes time proportional to n3 to compute. In our implementation, each iteration needs to compute (m + 2)(p + 1) Choleskies, where m is the number of z-updates following an update of a hyper-parameter for z. For the case of U1, we set m = 10. For slice sampling, an update of an invididual z needs to re-compute the Cholesky for Cy by a rank-1 update operation, which takes time proportional to n2 , therefore updating all of the z values 9
Metropolis, 1
Slice Sampling, 1
Metropolis, sum(z)
Slice Sampling, sum(z)
0.8 Sample Autocorrelation Sample Autocorrelation
0.8 Sample Autocorrelation
0.8 Sample Autocorrelation 0 20 40 Lag 60 80 100
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
5 Lag
10
15
0.2
5 Lag
10
15
0.2
0.2
20
40 Lag
60
80
100
Figure 6: Comparing Metropolis with Slice sampling for GPLV takes time proportional to n3 as well. However, the constant factor of this operation is much larger than that of nding the Cholesky decomposition, making the two methods impossible to compare theoretically. Therefore, we have to record the actual time needed for each iteration of each method. On average, it takes 12 times as long to obtain a new sample of slice sampling than the Metropolis method for our implementation. The Sample Auto-correlation plots for hyper-parameter 1 and the sum of latent values are given in Figure 6. We can see that the Metropolis method is faster in both cases, though the dierent is more signicant for the latent variable. According to the autocorrelation time, the Metropolis method is twice as fast as slice sampling to sample an independent 1 , while about 22 times faster to sample an independent latent variable.
4
4.1
Experiments
Datasets with only one covariate
We generate data from the following function f (x) = [1 + sin(4x)]1.1 where the single input x is uniformly distributed in [0,1]. And we contaminate the response with some noises: Dataset U1: with Gaussian noises: y = f (x) + where is Gaussian with zero mean and Var( ) = 0.2 + 0.3 exp[30(x 0.5)2 ] . 10
2
U1 3 3
U2 0.5
x vs sd for U1 and U2
2.5
2.5
0.45
2 sd True Function Observations 0 0.2 0.4 x 0.6 0.8 1
0.4
1.5
1.5
0.35
0.3
0.5 True Function Observations 0 0.2 0.4 x 0.6 0.8 1
0.5
0.25
0.2
0.2
0.4 x
0.6
0.8
Figure 7: Univariate Synthetic Datasets Dataset U2: with non-Gaussian noises: y = f (x) + where follows a location-scale extreme value distribution (see Gangadharan et al., 2011), with mean adjusted to zero and Var() = 0.2 + 0.3 exp[30(x 0.5)2 ] .
2
4.2
Datasets with multiple covariates
We then test the models on a function with three covariates: f (x) = [1 + sin(x1 + 1)]0.9 + [1 + sin(1.5x2 + 1)]1.1 [1 + sin(1.8x3 1)]1.5 where xi are independent standard Gaussian covariates. Similar to the one-covariate datasets, we also impose some noises on the responses: Dataset M1: with Gaussian noises: y = f (x) + Dataset M2: with non-Gaussian noises: y = f (x) +
11
where
is a Gaussian noise, and is a location-scale extreme value noise, both with zero mean,
and input-dependent standard deviation 0.1 + 0.2 exp[0.2x2 0.3(x2 2)2 ] + 0.2 exp[0.3(x3 + 1)2 ] 1
4.3
Results
For each dataset, we randomly generate 10 dierent training sets, each with 100 observations, and a test dataset with 5000 observations. We obtain samples using the methods described in the previous section, and drop the initial 1/4 samples as burn-in to make predictions on the test cases. In order to evaluate how well each model does in terms of estimated mean as well as the predictive distribution, we also compute the mean squared error (MSE) with respect to the true function values, and the negative log-probability density values (NLPD) with repect to the observed test cases. We give pairwise comparison of the mean squared errors (MSE) and the negative log-probability densities (NLPD). MSE is commonly used to measure the accuracy of the predictive mean, and NLPD is used to measure how close a predictive distribution is to the truth. Looking at Figure 8, its very clear that both GPLC and GPLV are much better in terms of predictive distributions, as they give better NLPD values than REG in every dataset with few exceptions. The dierence in MSE is less striking, but it still seems that GPLC and GPLV generally give better MSE than REG, especially when the noise is non-Gaussian. Its also clear that when the noise is Gaussian, GPLV does better than GPLC; when the noise is non-Gaussian, the winner changes to GPLC.
12
U1 NLPD 0.34 0.35 0.25 0.25 0.2 0.2 0.2 0.15 0.15 0.1 0.2 0.25 0.3 GPLCN
3
U1 NLPD U2 NLPD 0.3 0.3 0.25 U2 NLPD U2 NLPD
U1 NLPD
0.35 0.3
0.32
GPLV
GPLV
0.28 0.26 0.24 0.22 0.15 0.1 0.15 0.2 REG x 10 5 5 4 4 3 2 1 1 2 4 5 x 10
3 3
GPLV
GPLCN
0.25 0.25 0.1 0.15 0.2 0.25 REG U2 MSE 5 4 0.3 0.2 0.25 0.3 REG U1 MSE x 10 x 10 14 12 10 U1 MSE
3
0.2 0.2 0.2 0.35 0.25 0.3
0.2
0.25
0.3 REG x 10 14 12 10
3
0.35
GPLCN
GPLV
0.3 0.3
0.1
0.15 0.2 GPLCN x 10

3
0.25
x 10
U1 MSE
U2 MSE
U2 MSE
14
12
10
GPLV
GPLV
GPLV
GPLCN
6 2 4 2 2 5 x 10
3
6 4 1 5 x 10
3
GPLCN
GPLV
1 2 4 5 x 10
3
8 3
3 2 1 1 2 5 x 10
3
4 15
3
2 10 REG 15 10 GPLCN 15 3 REG
10 REG
x 10
3 REG
3 4 GPLCN
GPLV
GPLV
GPLV
GPLCN
0.24 0.22 0.2 0.2 0.25 REG M1 MSE x 10 12

3
GPLCN
0.24 0.22 0.2
0.25 0.2 0.2 0.15 0.15 0.2 0.25 0.3 REG M2 MSE 0.025 0.02 0.03 0.025 0.02 0.015 0.01 0.01 0.015 0.35 0.2 0.25 0.3 REG M2 MSE 0.35
GPLV
GPLV
GPLV
GPLV
GPLCN
GPLCN
8 8 6 4 14
3
8 6
GPLV
13
M1 NLPD 0.3 0.28 0.26 0.24 0.3 0.28 0.26 0.35 M1 NLPD 0.25 0.3 0.2 0.22 0.24 0.26 0.28 GPLCN M1 MSE 14 12 10 10 x 10
3
M1 NLPD
M2 NLPD 0.35
M2 NLPD 0.35
M2 NLPD
0.3
0.28
0.3 0.3
0.26
0.25
0.2
0.22
0.2 0.2
0.25 REG
0.3
0.15 0.15
0.2
0.25 0.3 GPLCN M2 MSE 0.03 0.025 0.02 0.015 0.01 0.005 0.005
0.35
14
x 10
M1 MSE
12
10
4 6 x 10
3
12
12
14
12 x 10
3
8 10 REG
x 10
10 REG
8 10 GPLCN
0.005 0.005 0.01 0.015 0.02 0.025 REG
0.01
0.02 REG
0.03
0.01
0.02 GPLCN
0.03
Figure 8: Pairwise comparison of MSE and NLPD
Conclusion
We propose a Gaussian Process regression model with a latent covariate for data with inputdependent noises. This model does not assume equal variance for all the residuals, or the residuals are all Gaussian. When an appropriate covariance function is chosen for this model, it can also deal with special cases like (a) GP regression with i.i.d. Gaussian noise, or (b) GP regression with heteroscedastistic Gaussian noise. Tests on both univariate and moderate-dimensional synthetic heteroscedastistic datasets show that this method gives better predictions (more accurate predictive means and more accurate predictive distributions) than the regular GP regression model. Although when the noise is Gaussian, the GPLV model generally does better, when the noise is non-Gaussian, our method will give better results. For the Gaussian residual case, we expect that out model will give comparable results if we specify the covariance function as (14). We also extended Goldberg et al.s method to a full Bayesian version, and discuss fast computation methods for it. Experiments show that a Metropolis type of strategy is faster than a generic slice sampling method.
References
Muraleedharan Gangadharan, Carlos Guedes Soares, and Cludia Susana Gomes Lucas. Characteristic and moment generating functions of generalised extreme value distribution (GEV). In Linda. L. Wright, editor, Sea Level Rise, Coastal Engineering, Shorelines and Tides, pages 269 276. Nova Science Publishers, 2011. P. W. Goldberg, C. K. I. Williams, and C. M. Bishop. Regression with input-dependent noise: A gaussian process treatment. Advances in Neural Information Processing Systems 10., 1998. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21(6):10871092, 1953. R. M. Neal. Probabilistic inference using markov chain monte carlo methods. Technical Report, Dept. of Computer Science, University of Toronto, (CRG-TR-93-1), 1993. Available from http://www.utstat.utoronto.ca/~radford. R. M. Neal. Monte carlo implementation of gaussian process models for bayesian regression and classication. Technical Report, Dept. of Statistics, University of Toronto, (9702), 1997. Available from http://www.utstat.utoronto.ca/~radford. 14
R. M. Neal. Regression and classication using gaussian process priors. In J. M. Bernardo, editor, Bayesian Statistics, volume 6, pages 475501. Oxford University Press, 1998. R. M. Neal. Slice sampling. Annals of Statistics, 11:125139, 2003. C. E. Rasmussen and C. K. I. Williams. Gaussian Process for Machine Learning. the MIT Press, 2006. ISBN 026218253X.
15

Gaussian Process Regression With Heteroscedastic Residuals

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Gaussian Process Regression With Heteroscedastic Residuals

Hochgeladen von

Copyright:

Verfügbare Formate

Gaussian Process Regression with Heteroscedastic Residuals

Chunyi Wang April 2012

For a regression problem,

A GP Regression Model with a Latent Covariate

y=f(x,w) 6 5 4 3 2 2 1 0 1 2 3 4 0 1 2 x 3 4 5 0 1 2 3 y y 1 random w w=0 6 w=0 5 4 3

y=f(x,w) 30 40 Function Density of w Density of y 35 30 20 E(Y|X=x) E(Y|X=x) 25 20 15 10 5 5 0 0 1 2 x 3 4 5 0

y=f(x,w) Function Density of w Density of y

Figure 2: GPLC allows non-Gaussian

Figure 3: Flat f creates singularity

A GP Regression Model with a Latent Variance

Relationships between GPLC and other models

y=f(x,w) 8 6 4 50 2 0 0 7 150 100

x=0.5 150 100 50 0

x=2.0 150 100 50 0

Figure 4: GPLC for equal variance Gaussian residuals

y=f(x,w) 6 5 100 4 3 2 1 50 0 150

x=0.5 150 100 50 0

x=2.0 150 100 50 0

3.5 3 2.5 2 1.5 QQ Plot

Figure 5: GPLC for input-dependent Gaussian residuals

A Metropolis Strategy for GPLV

Comparison of Metropolis with Slice Sampling for GPLV

Slice Sampling, sum(z)

0.8 Sample Autocorrelation Sample Autocorrelation

0.8 Sample Autocorrelation

0.8 Sample Autocorrelation 0 20 40 Lag 60 80 100

2 sd True Function Observations 0 0.2 0.4 x 0.6 0.8 1

0.5 True Function Observations 0 0.2 0.4 x 0.6 0.8 1

Datasets with multiple covariates

U1 NLPD U2 NLPD 0.3 0.3 0.25 U2 NLPD U2 NLPD

0.2 0.2 0.2 0.35 0.25 0.3

0.15 0.2 GPLCN x 10

2 10 REG 15 10 GPLCN 15 3 REG

0.24 0.22 0.2 0.2 0.25 REG M1 MSE x 10 12

0.24 0.22 0.2

0.005 0.005 0.01 0.015 0.02 0.025 REG

Figure 8: Pairwise comparison of MSE and NLPD

Das könnte Ihnen auch gefallen