## 1. INTRODUCTION

In many real-world domains, datasets often comprise heterogeneous subsets, which is referred to as feature space heterogeneity (Xiao and Wang, 2019). Feature space heterogeneity is difficult to handle because a relationship between explanatory and response variables observed in some subsets may disappear or be reversed when all subsets are aggregated, as observed in Simpson’s paradox.

Model trees are used to deal with feature space heterogeneity by providing interpretability of the result. The induction procedure of the model tree can be divided into two phases: 1) partitioning the data space with tree growing, and 2) model learning on each terminal or leaf node. In the first phase, the data space is partitioned sequentially according to a splitting criterion. The main issue in this phase is selecting a splitting criterion, stopping rule, or pruning method. Existing split criteria for a model tree can be classified into three approaches. The first approach is based on the mean squared error of the target variable, as in the classification and regression tree (CART, Breiman *et al*., 1984), M5 (Quinlan, 1992), M5’ (Wang and Witten, 1996), and HTL (Torgo, 1997). The second approach for the splitting criterion is to use look-ahead error, as in RETIS (Karalic, 1992). This directly measures the performance of predictive models associated with child nodes assuming that the data space is partitioned using the given split candidate. This splitting criterion requires an exhaustive search for possible values of all variables. The third approach for the splitting criterion is based on the distribution of estimated residuals, as proposed in smoothed and unsmoothed piecewise-polynomial regression trees (SUPPORT, Chaudhuri *et al*., 1994). It fits a linear regression model at the current node and classifies all data points into two classes according to the residuals’ signs. For each numerical variable, the t-test and Levene’s test (Levene, 1960) for the difference in means and variances are performed, and then the variable with the smallest *p*-value is selected as the split variable and the split level is determined by the average of the two class means. Generalized, unbiased interaction detection and estimation (GUIDE, Loh, 2002) and segmented linear regression trees (SLRT, Zheng and Chen, 2019) also adopt this approach.

In addition to the splitting criterion, there are two other important components in the first phase, i.e., stopping rules and pruning methods, which are prevent overfitting of training data by limiting the overall size of the tree. For stopping rules, the minimum number of observations in the node, maximum depth of the tree, or a prespecified threshold value of the splitting criterion function are commonly used. A simple pruning method selects the subtree with the lowest cross-validation error. One popular pruning method is the 1-SE rule of CART, which selects a simpler subtree within one standard error in crossvalidation.

In the second phase of model tree induction, a prediction model is constructed at each terminal node. Here, it is common to apply a linear regression model to the data at each node; however, some model tree algorithms, e.g., HTL (Torgo, 1997) and the mathematical programming tree (Yang *et al*., 2017), use a nonlinear regression model.

In this study, we propose a method in the second phase of regression model tree learning. Two challenging problems arise in these learning approaches: 1) handling an insufficient number of observations for training at some nodes, and 2) consideration of global and local effects. A model tree divides the data space into smaller subsets; therefore, the number of training data at a terminal node may become insufficient to estimate model parameters reliably. In addition, a small number of training data may cause overfitting, which degrades predictive performance on unseen data. The second challenge is related to the reflection of the global effect. There is a trade-off between global and local effects when building a model tree. Reflecting the global effect on a model enforces some local models to share the same effect, which may reduce the performance of local models. Incremental learning is commonly used to effectively consider both global and local effects (Lee and Jun, 2018). Stepwise model tree induction (SMOTI, Malerba *et al*., 2002, 2004) introduces two node types for incremental learning, i.e., an ordinary split node and a regression node that only fits a univariate regression function. The final multivariate regression model is constructed by combining these regression nodes along the path from the root to the terminal node. However, SMOTI may ignore local effects. If one variable is selected at an earlier step, the regression coefficient associated with that variable cannot be modified. The simultaneous threshold interaction modeling algorithm (STIMA, Dusseldorp *et al*., 2010) also deals with the global effect by combining the multiple linear regression shared across all leaf models and locally trained leaf models. However, STIMA has a limitation in that the locally trained leaf model is simply a piecewise constant function.

To tackle two challenging problems, this study suggests the use of multi-task learning in the second phase of a model tree, which is the first attempt in the literature. When model estimation at each node is considered a task, multi-task learning estimates models at all nodes simultaneously rather than estimating them independently. Multitask learning learns related multiple tasks using the shared information among tasks (Argyriou *et al*., 2006). Multitask learning is advantageous when the number of training data is limited because multiple tasks are trained simultaneously; therefore, multi-task learning is expected to prevent overfitting in nodes with a small number of observations. In addition, multi-task learning encourages sharing information of different tasks via regularization; thus, the global effect can be reflected in leaf models.

Section 2 describes the proposed method of split selection when constructing a tree and of model estimation via multi-task learning on terminal nodes of the tree. In Section 3 we show the performance of the proposed method as compared with two baseline methods on one synthetic dataset and seven real datasets. Then, we conclude in Section 4.

## 2. PROPOSED METHOD

Consider dataset *D*={(*x _{i}*,

*y*) :

_{i}*i*=1, 2,…,

*N*, where

*N*is the number of observations. Here ${\text{x}}_{i}=({x}_{1i},\hspace{0.17em}\dots ,\hspace{0.17em}{x}_{di})$ is a

*d*-dimensional input vector, and

*y*is the

_{i}*i*observation of the target variable. The proposed method comprises two components, i.e., split selection and model learning.

^{th}### 2.1 Split Selection

The proposed method uses a hybrid method of the splitting criteria used in GUIDE and SLRT. GUIDE is designed to reduce computational time and avoid variable selection bias by replacing the traditional greedy search method with a two-step method. First, GUIDE selects the split variable, and then finds the best split level for the selected variable. GUIDE fits a linear regression model at the current node and classifies all data points into two classes according to the sign of the estimated residuals. To include the categorical variable and consider pairwise interactions, GUIDE uses the Pearson *χ*^{2} test to verify the relationship between the residuals’ signs and values of the given variable. For each numerical variable, it constructs a 2 × 4 contingency table with the residual sign of the data points as rows and sample quartiles as columns. For each categorical variable, the categories of the variable are substituted for the sample quartiles. Then, GUIDE performs the *χ*^{2} test on this contingency table and selects the split variable with the smallest *p*-value (GUIDE refers to this test as a curvature test). After a split variable is selected, GUIDE exhaustively searches all values of the selected variable and selects the level to minimize the sum of squared errors.

SLRT is a state-of-the-art regression model tree that uses the linear regression function. As mentioned previously, SLRT uses a splitting criterion based on the pattern of estimated residual. To select the split variable and level, SLRT utilizes Kendall’s *τ* rank correlation (Kendall, 1938) between each regressor and the estimated residual. The key idea of SLRT is that a regressor and estimated residual tend to be accordant (discordant) for a positive (negative) relationship if the underlying regression function is linear. Based on this assumption, SLRT selects the split variable and level so as to maximize the criterion function proposed. Zheng and Chen (2019) performed a theoretical analysis to verify that the split selection algorithm consistently identifies genuine split variables and levels. However, this procedure has a large time complexity of *O*(*d*^{2}×*N*^{2}) to find the optimal split variable and level. For example, SLRT took 605 s on a dataset with 200 observations and four variables, while GUIDE required only 3 s.

To reduce computation time, we propose to select the split variable using GUIDE’s method, and then the split level is determined by SLRT’s selection method. By combining GUIDE and SLRT, the proposed method is expected to build a model tree that provides relatively good performance in reduced time. In addition, we adopt two stopping rules, i.e., the minimum number of observations to be specified in a node and the maximum depth of a tree to be specified. For our experiments in Section 3, the minimum size of the training dataset in a leaf node is set to the maximum of the number of dimensions (or variables) of a dataset and one percent of the total observations of the training set. The maximum depth of a tree is set to 2 for the training datasets whose observations are smaller than or equal to 1,000 and to 3 for the datasets whose observations are larger than 1,000.

### 2.2 Model Learning

Rather than estimating a model at each terminal node independently, we suggest the use of multi-task learning to estimate models at all terminal nodes simultaneously. It is important to select a multi-task learning method that can handle two challenging issues appropriately, i.e., small data size at nodes and considerations of both global and local effects. Here, we assume that a model tree has a cluster structure, where a group of some terminal nodes shares similar models. Multi-task learning algorithms can be classified into four categories, i.e., the learning approach, decomposition approach, task clustering approach, and task relation learning approach (Zhang and Yang, 2017). Based on our assumption, a task clustering approach to multi-task learning is suitable for our case.

Among task clustering approaches, methods that assume disjoint cluster structures are excluded because disjoint groups prevent different clusters from sharing parameters. However, realistically, the models of different terminal nodes are likely to partially overlap. In addition, the multi-task learning method for the model tree is required to handle irrelevant (i.e., outlier) tasks to consider the local effect.

Thus, this study employs grouping and overlap for multi-task learning (GO-MTL, Kumar and Daume III, 2012) to train the model tree’s leaf models. One advantage of GO-MTL is that it is robust to outlier tasks. Though a leaf model is not related to other leaf models, it is handled naturally by not sharing bases with other models, which enables the proposed method to handle cases where the local effect is dominant. In addition, GO-MTL provides the relationship between different tasks, and visualization of these relationships provides important insight into understanding the overall structure of the model.

GO-MTL assumes that latent tasks exist, and each task can be expressed by a linear combination of these latent tasks. The multiple target tasks are determined by the latent tasks; thus, GO-MTL can allow tasks in different groups to overlap, which is a unique property compared to other task clustering-based multi-task learning algorithms.

These assumptions can be modeled using the following formulation. Assume *T* tasks (terminal nodes) and dataset ${D}^{(t)}=\{({x}_{i}^{(t)},\hspace{0.17em}{y}_{i}^{(t)}):i=1,\hspace{0.17em}2,\hspace{0.17em}\dots ,\hspace{0.17em}{N}_{t}\}$ for each task *t* = 1, 2, …, *T*. Here, index *t* indicates the *t*-th task. GO-MTL estimates the variable-task weight matrix $W=({w}_{\text{jt}})$, which is a *d*×*T* matrix, through decomposition of *W* = *LS*, as shown in Figure 1. Here, *L* is a ‘variable to latent task’ matrix of size *d*×*k* containing the weight vectors of *k* latent tasks, and *S* is a ‘latent task to task’ matrix of size *k*×*T*. Each column of *S*, *s _{t}*, is a sparsity pattern of task

*t*that determines the number of latent basis tasks related to that task. If two tasks have the same sparsity pattern, these tasks are considered to be from the same group; otherwise, two tasks with sparsity patterns orthogonal to each other are considered to be from different groups.

This assumption allows GO-MTL to handle an outlier task that is irrelevant to other tasks. If one task has nothing to do with other tasks, the task has its own predictive model by not sharing bases with any other task. Then, the objective function of GO-MTL for the regression task is given as follows:

where the first term is the square loss function, ${\Vert \cdot \Vert}_{1}$ is the entry-wise ${\mathcal{l}}_{1}$ norm of the matrix, and ${\Vert \cdot \Vert}_{F}$ is the Frobenius norm of the matrix. Here, parameter *μ* controls the sparsity of *S*, and *λ* is a shrinkage parameter for regularization of the latent tasks’ weight vectors. GOMTL adopts an iterative two-stage optimization strategy to solve *L* and *S* alternatively. Note that all algorithms are implemented in MATLAB (https://github.com/ wOOL/GO_MTL).

## 3. EXPERIMENTS

### 3.1 Experiment with a Synthetic Dataset

To validate how the proposed method handles the problems identified in Introduction, a synthetic dataset was generated as follows, where some terminal nodes share the same models. In addition to the dependency of the models, one node was set to have a smaller number of observations than other nodes. This additive setting allowed us to evaluate the effect of the proposed method on a small sample size of a node.

The observations of response variable *y* were generated using the following regression function:

Herein, each explanatory variable was drawn from an independent uniform distribution ${x}_{1}~U(0,\hspace{0.17em}10),{x}_{2}~U\left(-5,5\right),\hspace{0.17em}{x}_{3}~U\left(0,5\right),\hspace{0.17em}\text{and}\hspace{0.17em}{x}_{4}~U\left(0,15\right)$. The error terms *ε* were generated from a normal distribution with zero mean and standard deviation 2. In summary, 200 observations with four explanatory variables and a single response variable were generated with partitions represented by the tree structure shown in Figure 2. There are four partitions with split variables *x*_{1}, *x*_{2}, and *x*_{4}, but there are three unique regression functions. The second and third terminal nodes in Figure 2 share the same regression function. Here, the split level of *x*_{1} is set to 3; thus, the third terminal node is expected to have a smaller number of training data than other nodes.

The proposed method has three parameters for multitask learning, i.e., the number of latent tasks *k*, and the regularization parameters *μ* and *λ* in Eq. (1). For the synthetic dataset, *k* and *μ* were set to 3 and 10, re-spectively. We compared the performance of the proposed method to that of the following existing methods.

Global model (Global): A global model is trained using all training data without considering a tree structure. The proposed method uses the Frobenius norm for regularization; thus, a ridge regression model was applied to this global model. The shrinkage parameters for the ridge regression were searched from [10^{−3}, 10^{−2}, 10^{−1}, 1, 10] using threefold cross-validation.

Model tree with single task learning (Local-STL): With this method, the input data space is partitioned in the same manner as the proposed method, and then the prediction model is constructed individually using the observations at each terminal node. Again, a ridge regression model was applied to these local models in the manner described previously.

Prediction performance was measured by the root mean square error (RMSE):

where *y _{i}* is the actual response value, ${\widehat{y}}_{i}$ is the predicted value of

*i*test data, and

^{th}*n*is the size of the test dataset. All results were calculated by repeating threefold cross-validation 10 times for the synthetic dataset. To test the significance of the results, we used the Wilcoxon signed-rank test with

*α*= 0.05.

Table 1 shows the experimental results obtained on the synthetic dataset, where the average RMSE value is reported with the standard deviation. Note that the RMSE of “small node” was calculated separately, where the number of observations was the smallest. The best performance is shown in bold, and the statistically significant value (*p*-value ≤ *α*/2) from the Wilcoxon signed-rank test is underlined.

As can be seen, the proposed method obtained significantly lower prediction error than the Global and Local-STL methods. In particularly, the prediction performance of the “small” node was superior compared to the existing methods. Figure 3 shows the constructed model tree and a heatmap of the absolute value of the coefficients for each latent task, which is denoted by *S*. As shown, nodes 2 and 3 share the same latent task. In summary, the proposed method demonstrates better performance when leaf models are related. In addition, even when the local effect is dominant, the proposed method as well as the locally trained model. This means that the proposed method can reflect the global effect without degrading the local effect, which is an important advantage.

### 3.2 Experiments with Real Datasets

The proposed method was also tested on seven re-al-world datasets from the UCI Machine Learning Re-pository (Dua and Graff, 2019) and the StatLib library (http://lib.stat.cmu.edu/datasets/). A summary of the characteristics of each dataset is given in Table 2. These datasets are described in brief as follows.

Auto-mobile: This dataset comprises various characteristics, risk factors, and the relative average loss payment per insured vehicle year of 159 automobiles. It has 15 numerical variables and 10 categorical variables. In this study, the logarithm of “price” was used as the target response variable.

Auto-MPG: This dataset contains information about city-cycle fuel consumption in miles per gallon. The response variable “mpg” was predicted in terms of five numerical variables and three categorical variables.

Boston Housing: This dataset contains median housing values on 506 census tracts around Boston. Here, 12 numerical variables and one categorical variable were used to predict the log-transformed median housing value.

Energy Efficiency: This dataset was obtained from an energy analysis of a building. Here, cooling load was predicted by eight variables regarding the building’s characteristics. The number of observations is 768.

Concrete Strength: With this dataset, we predicted the compressive strength of concrete samples. This dataset contains 1030 observations with eight variables, e.g., cement and blast furnace slag.

Abalone: This dataset contains 4177 observations with eight variables describing the physical measurement of abalone. In our experiments, the number of rings was the target response variable.

Parkinson: This dataset was collected from a telemonitoring device that records speech signals. Here, 5875 voice recordings with 16 biomedical voice measures were used to predict the clinician’s total UPDRS score.

The proposed method has three parameters for multitask learning, i.e., the number of latent tasks *k*, and regularization parameters *μ* and *λ* in Eq. (1). For simplicity, *λ* was set to equal the shrinkage parameter of the Local-STL method, which minimizes the test RMSE in tenfold cross-validation. In addition, various *k* and *μ* values were used (*k* = [2, 4, 8] and *μ* = [10^{−3}, 10^{−2}, 10^{−1}, 1]). The experimental results obtained with various parameter configurations on real datasets are summarized in Table 3. It appears that performance was not quite sensitive to the values of *k* and *μ*, except on the Auto-mobile and Concrete datasets.

With the best configuration for each dataset, the performance of the proposed method is compared against other baseline methods in Table 4, where the best perfor-mance is shown in bold, and the statistically significant value (*p*-value ≤ *α* /2) is underlined. As can be seen, the proposed method outperformed the global method on all datasets. The results also demonstrate that the proposed method outperformed the Local-STL method on all datasets (except for the Auto-mobile dataset). This is indirect evidence that there exists feature space heterogeneity in a dataset, and the proposed method can handle the heterogeneity problem effectively.

To validate the effect of the proposed method on “small nodes,” we calculated the RMSE of small nodes on all datasets (except the Auto-mobile dataset, which has only 159 observations). The results are summarized in Table 5. Here, a small node is defined as the terminal node with fewer than 100 observations. As can be seen, the proposed method obtained significantly lower RMSE of small nodes than the compared methods on the Boston Housing, Concrete, Abalone, and Parkinson datasets. Therefore, we have verified that the proposed method works well in nodes with few observations.

In addition, the Auto-MPG dataset was selected as an example to show how the proposed method detects task relations between terminal nodes. As shown in Figure 4, the proposed model selected *weight* and *acceleration *(*acc*) as the split variables. The estimated sparsity pattern in Figure 4(b) suggests that the proposed method detected the presence of task relations between terminal nodes. As can be seen, one latent task corresponding to the second row of the heatmap contributed to all three leaf models, and the other latent task contributed to the first and third leaf models. In this example, it appears that this shared model structure improved the performance of the proposed method.

Table 6 shows that the proposed method’s total performance improvement over the Local-STL method was greater than 14% on the Auto-MPG dataset. In addition, the RMSE on Node 2 with a small number of training data was less than that of the Local-STL method. It appears that Node 2 is quite different from other nodes.

## 4. CONCLUSION

In this study, we have proposed a learning method for a regression model tree by utilizing multi-task learning. The proposed method employs a hybrid GUIDE and SLRT method for tree growing and employs the GOMTL multi-task learning algorithm to estimate leaf models. The main advantages of the proposed method are that it can deal with the small sample size problem of a terminal node and ease the difficulty of reflecting the global effect across all terminal nodes. In addition, multi-task learning offers an additional advantage relative to the interpretation of the relationship between leaf models.

Experiments on a synthetic dataset and several realworld datasets were performed, and prediction error was compared to that of two baseline methods. The experimental results demonstrate that the proposed method improves prediction performance compared to the existing methods in terms of average RMSE on most datasets. Thus, we conclude that the proposed method effectively handles the small sample size problem and reflect the global effect. In particular, even on a dataset where the local effect is dominant, the proposed method ensured the prediction performance as much as the local models.

To the best of our knowledge, the proposed method is the first regression model tree learning method that utilizes multi-task learning; however, several limitations must be acknowledged. First, various multi-task learning methods were not considered in this study. Although GOMTL demonstrated good performance, other multi-task learning algorithms may be more suitable according to the tree structure. Second, the proposed method was limited to the regression problem; however, it can be extended to classification problems. There are several studies about logistic model trees that learn the global effect in an incremental manner (Landwehr *et al*., 2005; Lee and Jun, 2018). Multi-task learning methods for logistic regression, e.g., GO-MTL, can be applied to logistic model trees in the same manner as the proposed method.