1 Introduction
Gaussian processes (GP) rasmussen2006
are considered state of the art in (non)linear regression since they are a natural way to implement a predictive posterior distribution that provides a predictive target mean together with a relevant measure of uncertainty in the form of a predictive covariance. While standard GPs were initially designed to handle scalar outputs, it is becoming more and more common to have to face multitask (MT) or multidimensional output problems. While there are several approaches to formulate multioutput GPs
liu2018remarks, here we focus on a multitask GP (MTGP) formulation that is able to jointly and symmetrically estimate a set of
tasks from a single observation.The general MTGP formulation proposed in Bonilla2008 , and reformulated as a linear model of coregionalization (LMC) schmidt2003Bayesian ; fanshawe2012bivariate , assumes that the multitask covariance matrix is expressed as the Kroneker product of an intertask matrix and the input kernel matrix. To avoid the prohibitive cost of estimating a full rank matrix when is large, Bonilla2008 proposes the use of a lowrank approximation of order of , , so that the number of parameters to be learnt is reduced from to , with the drawback of requiring the adjustment of parameter .
To obtain computational savings in the above MTGP formulation, stegle2011efficient
proposes an efficient inversion of the MT covariance matrix by combining properties of the singular value decomposition (SVD) and the Kronecker product, reducing the computational cost from
to . Another interesting approach is proposed in Rakitsch2013 , where a noise covariance matrix is introduced to model noise dependencies together with the intertask relationships.Additionally, several convolutional models lee2002 ; Boyle2005 ; Alvarez2008 ; alvarez2011computationally have emerged, establishing a more sophisticated formulation that is able to model blurred relationships between tasks by the generalization of the MT kernel matrix through a convolution. However, adequate usage requires a careful selection of the convolutional kernel in order to make the integral tractable, and the number of parameters must be limited to balance the model’s flexibility against its complexity to avoid overfitting issues. Furthermore, this complex design limits the model’s interpretability since the covariance matrices are not explicitly estimated. Efficient versions of these models alvarez2011computationally select inducing points to reduce the computational cost down to .
Inspired by the MTGP probabilistic model proposed in Rakitsch2013 , where both the intertask and noise covariance matrices are modelled, we introduce a novel learning methodology based on a decomposition of the likelihood function to turn the MT learning process into a set of conditional oneoutput GPs. This methodology, combined with a hierarchical extension of the conditioned GPs, allows us to introduce the following novelties and advantages: (1) We do not need to use a low rank approximations, avoiding the selection of hyperparameter . (2) We reduce the number of parameters to be learnt in the noise and signal covariance matrices from to , while recovering the full noise and intertask matrices. This reduces the the model’s complexity and, as experiments show, leads to an improvement in performance. (3) This conditioned model, combined with an efficient implementation, leads to a computational time of , comparable to the one of stegle2011efficient ; Rakitsch2013 when
. (4) Finally, this learning approach can be easily adapted to leverage efficient GP libraries, such as Pytorch
GPTorch , that can run on graphical processing units (GPU). In this case, the model also admits an embarrasingly parallel implementation over the tasks to get a computational cost per process of .2 Introduction to the Multitask Gaussian Processes
Given the set of observations , and a transformation into a reproducing kernel Hilbert space ShaweTaylor04 , the general expression for the MT regression problem of estimating regressors or tasks, , from , results in the model
(1) 
being a mixing matrix and representing the model noise. To complete this probabilistic model, the following intertask noise distribution and weight prior are assumed
(2) 
where
is a vectorization operator and
is the Kroneker product. Matrix models the covariances between the elements of and it is common for all the tasks. This considers that correlation between the noise elements of different tasks is represented through the noise covariance , and relationships between tasks are modelled with the intertask covariance .Once these matrices are learnt, the predictive posterior for test sample is constructed as:
(3) 
where vector contains the dot products between the test sample and the training dataset , and is the matrix of dot products between data, and .
So far, this formulation extends the model of Bonilla2008 and stegle2011efficient and is formally identical to Rakitsch2013 , which introduces the noise covariance. The underlying limitation of these works lies in the fact that the number of parameters to be optimized grows with , and therefore a rank approximation of the form is used to model matrices and . This reduces the number of parameters to , but imposes the need of selecting a suitable value for parameter .
3 Parameter learning through conditional oneoutput likelihood for MTGPs
Here, we introduce a novel methodology based on a conditional oneoutput likelihood MTGP (CoolMTGP) where the previous MTGP formulation is decomposed into a set of conditioned oneoutput GPs. This methodology reduces the number of parameters to be learnt to twice the number of tasks without assuming any low rank approximation and the adjustment of additional hyperparameters.
3.1 MT likelihood as a product of conditional oneoutput distributions
Let us consider a model whose output corresponding to input in the th task is given by a linear combination of both the input data and the output of the previous tasks, , i.e.,
(4) 
where is the noise process for task and the weight of each factorized task is split into two components: weight for the input data and weight for the previous tasks. This model is closely related to the original MTGP described in Section 2 since, given the values of and , one can recursively recover the original weights as:
(5) 
We can now apply the chain rule of probability to the original joint MT likelihood to factorize it into a set of conditional probabilities, each associated to a conditional oneoutput GP:
(6) 
where the likelihood for each of these conditioned GPs is given by:
(7) 
And the prior over the input weight components is defined as:
(8) 
where assumes a common prior for all tasks scaled by a taskdependent factor .
(a) Global model  (b) Hierarchical extension 
This approach assumes that are latent variables modelled with a prior distribution, whereas previous task weights are defined as model parameters (see the graphical model in Figure1(a)); this way, for each task we generate a conditioned oneoutput likelihood GP (CoolGP) with mean and covariance . This guarantees that the model remains Gaussian, allowing us to recover the original MTGP joint likelihood from the set of CoolGP likelihoods, as defined in (7).
3.2 Parameter learning and model inference
In order to optimize the model in Figure 1(a), we need to learn the input prior factors , the noise covariances , the common kernel parameters and, additionally, the weights of the previous tasks . To reduce the number of parameters to be learnt we propose two approaches which are described in the following sections. A detailed derivation of both approaches and their corresponding algorithms is included in the Supplementary Material, alongside a software demo in notebook form to ensure the reproducibility of our results.
3.2.1 A hierarchical approach for CoolMTGP learning
This first method introduces a twostep hierarchical model for each CoolGP. In the first step we define a prior over ,
(9) 
which enables the application of a first inference pass over it and obtain its MAP estimation as
(10) 
Note that this process is equivalent to training a GP with zero mean and covariance matrix (as depicted in the model in Figure 1(b)); this way has the formal expression of a standard GP rasmussen2006 , where the input data kernel matrix is rescaled by factor plus a linear kernel matrix for the previous tasks outputs.
In the second step of the hierarchical model, we learn the remaining parameter values (, and ) by going back to the original coolGP (Figure 1(a)) and substituting by its MAP estimation; the model parameters are then learnt by maximizing the joint likelihood (6) for and marginalizing the influence of , i.e,
(11) 
where and parameters are optimized by gradient descent. Note that parameters and are particular of each GP, so they can be optimized independently, and only kernel parameters must be optimized jointly, and that each CoolMTGP is a Gaussian Process with mean equal to and a covariance matrix . Finally, when the model parameters are learnt, we can also recover the MAP estimation of as:
(12) 
3.2.2 An approximate approach for CoolMTGP learning
In this second approach, instead of applying a two step learning algorithm and requiring the inversion of two kernel matrices, we achieve an approximation by performing joint inference over and . To accomplish this, we directly consider the priors of and in equations (8) and (9), so that each CoolMTGP has zero mean and covariance . Then, to learn the model parameters we maximize the following marginalized joint likelihood
(13) 
where the kernel matrix is the same one used to infer both and whose MAP estimations are given by
(14) 
This approximate CoolMTGP approach uses to infer the weights and learn the parameters, thus slashing the computational time in half. However, because we are now considering
as a latent random variable, the conditional likelihoods are no longer Gaussian and therefore we can’t ensure that the original MTGP formulation can be accurately recovered. For this reason, we call this approach
approximate. Nevertheless, as we shall show in the experimental section, the results for both the hierarchical and the approximate approaches are very similar.3.2.3 Efficient implementation
A naive implementation of these two coolMTGP versions result in a computational complexity of and , respectively. However, as we discuss in the Supplementary Material, by using SVD and rankt update properties we can reduce this computational burden to , which results in a significant cost reduction in the common case where . Parallelization of the model is also straightforward since the learning stage for each CoolGP can be carried out at the same time. Furthermore, in the case of the approximate CoolMTG where all model parameters and variables for task t share the same GP model, any standard GP library (efficiently designed for single output GPs) can be leveraged to implement the model.
3.3 Recovering the multitask model
Despite the fact that intertask and noise covariance matrices and are not explicit in the conditioned model, they can be recovered from parameters , and the MAP estimation of the weights, and . The general MTGP formulation considers that the joint MT likelihood, which contains the noise matrix , is given by
(15) 
Considering the factorization given by (6), the factorized likelihoods of (7), and applying the properties of the products of conditional Gaussians (see, e.g. bishop2006pattern ), we can recursively recover the mean and the covariance terms, which leads us directly to the MT noise covariance , as:
(16) 
where it is assumed that and .
As for the construction of , we will provide a brief summary here while the full details can be found in the Supplementary Material. Given the knowledge of the joint covariance prior of , which we denote as , the expression of the intertask covariance matrix can be obtained from
(17) 
where is a matrix whose th row is constructed as the concatenation of and zeros, i.e . Hence, the first row is all zeros.
To obtain the terms of we first recall that its diagonal elements were predefined in (8). We approximate elements with the sample estimation of the cross correlation coefficients, i.e., where . Since this computation depends on variables and , we can carry it out by either generating samples from their posterior distribution and obtaining the values of with a Monte Carlo approximation, or by directly considering that are given by their MAP estimations. In fact, if we consider the latter approach (either Equation (12) or Equation (14)), the calculation of the terms of can be expressed in a more compact form as:
(18) 
where with given by (10) in the hierarchical CoolMTGP method, or in the approximate approach.
One might be concerned that this reconstruction process, and therefore the overall performance of the model, depends on the order in which the tasks are assigned to the CoolGPs. However, experimental results show that the reconstruction of the matrices is consistent for any random permutation of the tasks, confirming that task order has little impact.
4 Experimental Results
4.1 Synthetic benchmarks
We have carried out a synthetic data experiment to compare the performance of all the considered MTGP algorithms against a ground truth model. Following the general model of Section 2, a data set has been drawn from likelihood function where and the intertask and noise covariance matrices follow the low rank form (and similarly for ). We have created datasets for three scenarios in which the and matrices were generated with values of , and . In all cases, tasks, samples and iterations were run with random training/test partitions.
We compare both the hierarchical (HCoolMT) and approximate (CoolMT) versions of the proposed model to the standard MTGP (StdMT) of Bonilla2008 and the extension introduced in Rakitsch2013 , which includes a noise matrix (MT). Since these methods require the selection of parameter , we have analyzed three values: one equal to (the ideal case), a value of smaller than and, where possible, a value of greater than . Additionally, a groundtruth model that uses the true intertask and noise covariance matrices is included, as well as a set of independent GPs. Predictive performance is measured with the mean square error (MSE) averaged over all the tasks.
(a)  (b)  (c) 
, ,  , ,  , , 




(a) GroundTruth  (b) StdMT  (c) MT  (d) CoolMT  (e) HCoolMT 
Figure 2 shows that MT has the highest sensitivity to the choice of , and its performance degrades when the scenario complexity (defined by matrix rank ) increases. StdMT has more robustness with respect to both the selection of and the value of . As was expected, both models show their best performances when , and thus a cross validation of is paramount for these methods to perform optimally. This sensitivity depends on the number of the parameters to be inferred, which is in the case of StdMT and for MT, while it is only for both CoolMTGPs. This allows our model to perform closer to the ground truth independently of the scenario complexity .
Experimentally, it can be seen in Figure 3 that, while all models are capable of inferring the intertask covariance, the noise matrix is not properly inferred by the MT when scenario complexity ( value) is high. Besides, comparing both versions of the CoolMTGP model, the hierarchical approach is unsurprisingly better at estimating the true parameters and reconstructing the noise matrix, leading to a better consistency in its predictions as it was already shown in Figure 2.




(a) Real  (b) Random 1  (c) Random 2  (d) Random 3  (e) Random 4 
Finally we address the possibility that the model is sensitive to task ordering during the training of the CoolGPs. As can be seen in Figure 4, the model is consistent with regard to the order of the tasks.
4.2 Real data benchmarks
In this experiment we use real world scenarios to compare the CoolMT model’s capabilities to those of the StdMT and ConvMT GPflow2017 models^{1}^{1}1We have been unable to include MT in this part of our study due to convergence issues with the available implementation that led to unusable results.. To accomplish this, we have made use of the collection of datasets featured in spyromitros2016multi , which offers a wide variety in sample size, input dimensions and output tasks. A summary of their main characteristics can be found in the Supplementary Material. In all cases a standard normalization of the data was applied and 10 iterations were run with a random 80%/20% training/test partitioning of the data. In larger datasets (# samples 400) we dedicated 300 samples to the training set and 100 to the test set. All models were trained with both a linear kernel and a squared exponential (SE) kernel, but only the results for the best performing kernel are shown.
Dataset  Kernel  Ind. GPs  StdMT  ConvMT  CoolMT  HCoolMT 

oes10  Linear  1.04 0.88  0.15 0.07  0.14 0.07  0.14 0.07  0.14 0.07 
oes97  Linear  1.53 0.58  0.21 0.08  0.19 0.07  0.20 0.08  0.20 0.08 
scm1d  Linear  1.78 1.01  0.58 0.43  0.21 0.05  0.22 0.05  0.22 0.05 
scm20d  Linear  0.52 0.06  0.49 0.06  0.47 0.06  0.47 0.06  0.47 0.06 
andro  SE  0.20 0.11  0.20 0.11  0.82 0.29  0.20 0.11  0.21 0.11 
enb  SE  0.04 0.01  0.03 0.00  0.08 0.02  0.02 0.00  0.02 0.01 
edm  SE  0.39 0.12  0.44 0.16  0.49 0.13  0.49 0.21  0.41 0.13 
slump  SE  0.39 0.08  0.50 0.22  0.40 0.09  0.37 0.07  0.37 0.07 
atp1d  SE  0.20 0.07  0.99 0.36  1.04 0.37  0.20 0.07  0.20 0.07 
atp7d  SE  0.48 0.19  0.97 0.33  1.16 0.30  0.48 0.17  0.47 0.17 
As can be seen in Table 1, in the linear cases all models present similar MSE, whereas in the nonlinear scenarios the CoolMTGPs come clearly on top. After analysing the values for the SE kernel lengthscale parameter learnt by all the models, it is clear that ConvMT and, in some cases, StdMT are unable to achieve a correct estimation. We believe that this is due to our model’s reduced number of parameters to be learnt, making its adjustment easier. Once again, both CoolMT approaches lead to similar results except in the case of the edm dataset, where the hierarchical approach gains an edge.
4.3 Computational performance analysis
In this last section we evaluate the computational performance of some of the methods under study when executed on a CPU and a GPU. For this purpose, we have selected the StdMT and ConvMT, since they are efficiently implemented over Pytorch and TensorFlow, and we have designed a wrapper over the Pytorch GP implementation for the proposed
CoolMT approach in order to run it on GPUs. We have measured the run time and MSE performance of each algorithm with a linear kernel for different sized () training partitions of the scm20d dataset considering only 4 tasks. Computational times are averaged over iterations. optimization iterations were used for the StdMT and CoolMT methods, whereas ConvMT needed to obtain accurate results. The experiment was carried out on an Intel Core i9 Processor using a single core (3.3GHz, 98GB RAM) and a GeForce RTX 2080Ti GPU (2944 Cuda Cores, 1.545GHz, 10.76GB VRAM).Figure 5 shows the evolution of the runtime and MSE with . ConvMT and CoolMT show similar MSE, but the computational time of ConvMT grows much faster with the number of data. We conclude that CoolMT presents the best tradeoff between accuracy and computational burden.
Notably, while StdMT’s implementation is specific to GPUs using the optimizers provided by Pytorch, for now CoolMT only uses a wrapper. Despite this, CoolMT achieves comparable performance. Additional improvements can be expected with an implementation tailored for parallelization.
(a) Execution time in CPU  (b) Execution time in GPU  (c) MSE evolution 
5 Conclusions
In this paper we have proposed a novel solution for the MTGP problem that, compared to previous formulations, eliminates the need to validate any model hyperparameters and dramatically reduces the number of parameters to be learnt.
Similarly to other existing models, this proposal assumes that an intertask and a noise covariances exist. The novelty lies in the parameter inference, which is solved through the factorization of the joint MT likelihood into a product of conditional oneoutput GPs. Once these parameters are learnt, with either a hierarchical or an approximate approach, a recursive algorithm can be used to recover the MT intertask and noise covariances.
Experimental results show an accurate estimation of the MT intertask and noise matrices which translates into an improved error performance. At the same time, we have integrated the model with standard GP toolboxes, showing that it is computationally competitive with the state of the art.
Broader Impact
This article proposes a novel MTGP learning methodology based on a conditional oneoutput likelihood for MTGPs. This approach assumes full rank intertask and noise covariance matrices, without compromising the computational cost.
Our approach may become the reference formulation in the Machine Learning community for MTGP due to its improved performance, interpretability and competitive computational cost, as well as its easy integration with existing standard toolboxes for GPs.
It can be used in a variety of real world problems where it is important to account for the dependencies between tasks and where there is value in providing an interpretation of the observed MT regression process, for example, in medical time series analysis or in genomics. It can also be applied to forecast problems in areas such as SmartGrids or power load prediction.
Our software repository is publicly available and can be used right away in any real world scenario using a GPU or high performance computing facility with standard Python libraries.
Acknowledgments and Disclosure of Funding
We thank Dr. Miguel LázaroGredilla and Gustau CampsValls for their thorough review of the paper and fruitful discussions. Partially supported by the National Science Foundation EPSCoR Cooperative Agreement OIA1757207, the Spanish MINECO grants TEC201783838R and PID2019108539RBC22.
References
 (1) C. E. Rasmussen and C. K. Williams, Gaussian process for machine learning. MIT press, 2006.
 (2) H. Liu, J. Cai, and Y.S. Ong, “Remarks on multioutput Gaussian process regression,” KnowledgeBased Systems, vol. 144, pp. 102–121, 2018.
 (3) E. V. Bonilla, K. M. Chai, and C. Williams, “Multitask Gaussian process prediction,” in Advances in Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, eds.), pp. 153–160, Curran Associates, Inc., 2008.
 (4) A. M. Schmidt and A. E. Gelfand, “A Bayesian coregionalization approach for multivariate pollutant data,” Journal of Geophysical Research: Atmospheres, vol. 108, no. D24, 2003.
 (5) T. R. Fanshawe and P. J. Diggle, “Bivariate geostatistical modelling: a review and an application to spatial variation in radon concentrations,” Environmental and ecological statistics, vol. 19, no. 2, pp. 139–160, 2012.
 (6) O. Stegle, C. Lippert, J. M. Mooij, N. D. Lawrence, and K. Borgwardt, “Efficient inference in matrixvariate Gaussian models with iid observation noise,” in Advances in neural information processing systems, pp. 630–638, 2011.
 (7) B. Rakitsch, C. Lippert, K. Borgwardt, and O. Stegle, “It is all in the noise: Efficient multitask Gaussian process inference with structured residuals,” in Advances in Neural Information Processing Systems 26, pp. 1466–1474, 2013.
 (8) H. K. Lee, C. H. Holloman, C. A. Calder, and D. M. Higdon, “Flexible Gaussian processes via convolution,” Duke University, 2002.
 (9) P. Boyle and M. Frean, “Dependent Gaussian processes,” in Advances in Neural Information Processing Systems 17, pp. 217–224, 2005.
 (10) M. Alvarez and N. D. Lawrence, “Sparse convolved Gaussian processes for multioutput regression,” in Advances in Neural Information Processing Systems 21, pp. 57–64, 2009.
 (11) M. A. Álvarez and N. D. Lawrence, “Computationally efficient convolved multiple output Gaussian processes,” Journal of Machine Learning Research, vol. 12, no. May, pp. 1459–1500, 2011.
 (12) J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson, “Gpytorch: Blackbox matrixmatrix Gaussian process inference with gpu acceleration,” in Advances in Neural Information Processing Systems, pp. 7576–7586, 2018.
 (13) J. ShaweTaylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, UK: Cambridge University Press, 2004.

(14)
C. M. Bishop, Pattern recognition and machine learning
, ch. 2. Probability Distributions.
Springer, 2006.  (15) A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. LeónVillagrá, Z. Ghahramani, and J. Hensman, “GPflow: A Gaussian process library using TensorFlow,” Journal of Machine Learning Research, vol. 18, pp. 1–6, apr 2017.
 (16) E. SpyromitrosXioufis, G. Tsoumakas, W. Groves, and I. Vlahavas, “Multitarget regression via input space expansion: treating targets as inputs,” Machine Learning, vol. 104, no. 1, pp. 55–98, 2016.
Comments
There are no comments yet.