## 1.INTRODUCTION

‘Customer churn’ means the loss of the clients or customers in business context. Customer churn reduction is one of the most important issues in customer relationship management (e.g., in telecommunications market). Especially, in today’s competitive market, the importance of customer retention has increased since it is much more difficult to attract new customers, and it costs five or six times more than retaining them (Bhattacharya, 1998; Reinartz *et al*., 2004). Thus, companies are willing to predict potential customer churn and target them with marketing.

The goal of churn prediction is to identify potential churning customers, leading to proactive customer management and target marketing. The main job of churn prediction is to classify the type of customers into either churners or non-churners according to the score of constructed churn prediction model. With low-cost commodity storage and sophisticated data collection technologies, companies can automatically store massive operational data sets into their data repository for further analysis and strategic decision making. Therefore, by utilizing huge amount of customer data s*et al*ong with their historical churning behavior, a prediction model for identifying po- ssible future churner could be built. However, the churn prediction

However, the churn prediction problem often suffers from the curse of dimensionality since the data normally has a large number of features. In such high-dimensional problems, the performance of the models can be degraded by the curse of dimensionality.

This study will focus on dealing with large data sets with more realistic conditions. To overcome the aforementioned difficulties, the number of features is reduced to the appropriate number by using various dimension reduction methods for building churn prediction models. By comparing the performance of the churn prediction, we will determine the most suitable dimension reduction methods for a consistent and robust classification of the highly imbalanced overlapping churn data.

The organization of this paper is as follows. Section 2 describes the churn data used in this study and introduce six different linear or nonlinear dimension reduction methods which are applied to the data. Section 3 explains experimental settings and the detailed process of the experiment, and experimental results are provided in Section 4. Lastly, Section 5 provides the conclusions of the paper.

## 2.BACKGROUND

### 2.1.Churn Data

Churn prediction data sets, collected during the six months of 2001 by a major telecommunication company, consist of approximately 100,000 samples with 171 variables, which include the following information: First, demographics such as age, location, and number and ages of children are contained in the data sets. Secondly, financial information, such as credit score and credit card ownerships, is contained. Thirdly, product details, such as handset price and handset capabilities, are contained, and lastly, phone usage information, such as total minutes of use, total number of calls and average monthly minutes of use over previous 6 months, is contained. The data were made available to academic researchers by the Teradata Center for CRM at Duke University (Kim, 2006).

The description of Teradata set defines churn as a customer who left the company by the end of the following 60 days after sampled. The training data oversampled churn records to 50% even though the actual rate is approximately 1.8%, because previous attempts to train classifier with the original data were not successful. Nevertheless, the churn rate of test data is approximately 1.8%, which makes the churn prediction problem as realistic and challenging as possible.

Figure 1 clearly manifests the difficulty of predicting churn. It shows a plot of two principal components of the training data calculated by Principal Component Analysis (PCA). According to the figure, records along these two features are highly overlapping. Because it is hard to separate into churners and non-churners from selected features, sophisticated approaches are required to solve the churn prediction problem.

### 2.2.Dimension Reduction Methods

#### 2.2.1.Principal Component Analysis

PCA is a linear technique for dimensionality reduction. It performs dimensionality reduction by embedding the data into a linear subspace of lower dimensionality. This is done by finding a linear basis of reduced dimensionality for the data, along which the amount of variance in the data is maximal.

PCA attempts to find a linear mapping M that maximizes the cost function trace(M^{T} cov (X)M), where cov (X) is the sample covariance matrix of the data X with zeromean. This linear mapping is formed by the d principal eigenvectors of cov(X). Therefore PCA solves the eigenproblem

The reduced low-dimensional data representation of the data points are computed by mapping them on the linear basis M, i.e., Y = XM.

#### 2.2.2.Factor Analysis

Factor analysis (FA) is a linear technique for dimensionality reduction that posits a joint distribution on the data and lower dimensional latent variables. These latent variables, so called factors, are assumed to generate each data point. Thus, factors account for the observed correlation structure of the given data.

With D-dimensional data vector X and d-dimensional vector of real-valued factors Z, factor analysis model can be defined as X = μ+ΛZ+ε, where the columns of factor loadings Λ represent the correlation between given variables. The d factors play the d features of the dimension reduced data like the principal components in PCA. The maximum likelihood estimation of μ, Λ and covariance Ψ of ε could be found iteratively with EM algorithm (Ghahramani and Hinton, 1996).

#### 2.2.3.Locally linear embedding

Locally linear embedding (LLE) is a nonlinear technique for dimensionality reduction that attempts to preserve local relationship of a given data. Moreover, consideration of local properties allows us to find embedding manifold of non-convex structures. LLE describes the local relation around a D-dimensional original data point on some manifold as a linear combination of its k-nearest neighbors weighted by *w _{ij}* . It calculates the weights

*w*which represents the contribution of the j-th nearest point to reconstruction of the i-th data point by minimizing the reconstruction errors

_{ij}LLE then tries to find low-dimension coordinates which preserves local relationship in high-dimensional space as much as possible. It finds low-dimensional data representation that minimizes the following cost function,

where *w _{ij}* is fixed. This cost function can be rephrased as the following quadratic form φ (Y) = Y

^{T}XM where matrix M =

*(I −W)*. It is a constrained optimization problem and is equivalent to the problem of finding the smallest d+1 eigenvectors of the matrix M.

^{T}(I −W)#### 2.2.4.Local tangent space alignment

Local tangent space alignment (LTSA) is also a nonlinear dimensionality reduction technique like LLE. It utilizes the information of local tangent space of each data point. LTSA is based on the assumption that there exists a linear mapping from a high-dimensional space to a local tangent space and a linear mapping from the corresponding low-dimensional data point to the same local tangent space to get global coordinates by their alignments.

LTSA first computes bases for the local tangent space at the data point xi to calculate low-dimensional coordinates. Specifically, it computes d principal components by computing the d largest eigenvectors of the correlation matrix *(X _{i}-X_{i}e^{T} )^{T} (X_{i}-X_{i}e^{T} )* where

*X*is the set of k nearest neighbors. The local tangent space

_{i}*Θ*consists of k PCA scores of dimension d corresponding to

_{i}*X*. Then using a linear mapping

_{i}*L*from the local tangent space to the low-dimensional representation

_{i}*y*, LTSA performs the following minimization on the cost function of the form,

_{i}where *J _{k}* is the centering matrix. This cost function can also be redefined as a quadratic form φ (Y) = Y

^{T}MY where M is the alignment matrix. This problem is equivalent to calculating the d+1 smallest eigenvectors of M after computing matrix M.

#### 2.2.5.Locally preserving projections

Locally preserving projections (LPP) builds a graph with neighborhood information of the data set. Using the idea of the Laplacian of the graph, it computes a transformation matrix which maps the data points to a subspace. This linear transformation optimally preserves local neighborhood information in a certain sense. The mapping matrix generated by the algorithm may be viewed as a linear discrete approximation to a continuous map that naturally arises from the geometry of the manifold.

#### 2.2.6.Deep auto-encoder

Deep auto-encoder is an unsupervised neural network used for dimensionality reduction and feature extraction. More precisely, auto-encoders are feed-forward neural networks with an odd number of hidden layers and shared weights between the top and bottom layer, trained to predict the input itself as described in Figure 2. The middle hidden layer has d nodes which is the reduced dimension, and the input and the output layer have D nodes which is the original dimension.

The network is trained to minimize the mean squared error between inputs and outputs of the network. The middle hidden layer of a trained deep auto-encoder gives d-dimensional representations of input data, such that preserves as much structure in the input data as possible. The low-dimensional representations can be obtained by extracting the node values in the middle hidden layer of the trained network, when each data point is fed to the input layer.

## 3.EXPERIMENT SETTINGS

### 3.1.Data Preprocessing

We performed the preprocessing on the original data for further analysis as following the process in Lee *et al*. (2011). First, we removed continuous variables which have more than 20% of missing values. Since the data sets contain a substantial number of input variables and that many of these variables are highly correlated, valuable information will not be lost. Second, categorical variables with high missing rates were also eliminated because generally each categorical variable has slight predictive power (Rossi *et al*., 1996). Thus we include only 11 categorical variables among the remaining features. Finally, we removed observations with missing values in the data set due to the abundance of records.

After the preprocessing steps, we obtained 123 features consisting of 11 categorical variables and 112 continuous variables. The training data set has 67,181 observations with 32,862 churns with the churn rate of approximately 49%. The test set has 34,986 observations with 619 churns with the churn rate of approximately 1.8%.

### 3.2.Dimension Reduction and Out-of-Sample Extension

In this experiment, we focus on reducing the dimensionality of data using various dimension reduction techniques. Among the techniques introduced above, PCA, FA, and LPP are linear dimensionality reduction methods which perform dimensionality reduction by embedding data into a linear subspace of lower dimensionality. So, besides mapped data with reduced dimension, they give linear mapping functions (or mapping matrix) which allow us to use them for out-of-sample extension. Deep auto-encoder is a nonlinear method for dimension reduction. And it also learns a nonlinear mapping between the high-dimensional data and low-dimensional representation. Sigmoid activation functions are generally used except in the middle layer, where a linear activation function is usually employed. While these methods generate mapping functions, other two nonlinear dimension reduction 0methods (LLE and LTSA) do not provide mappings. 0So, for those two methods, a parametric out-of-sample extension is not available. To solve this problem, we assume that training data and test data, generated from the same distribution with the similar sample density, have similar connected neighbor graph with similar degree of density. With this assumption, we can apply those methods separately to the training data and the test data. Without the assumption, we may use the nonparametric out-ofsample extension which estimates the transformation from the high dimensional to low dimensional space using the Nyström approximation (Bengio *et al*., 2004). Also we can use sequential manifold learning method (Kim and Lee, 2012).

### 3.3.Experiment Process

With this background, we model churn classifiers as follows. First, we preprocess the data to appropriate forms for each dimension reduction methods, for example, scaling each attributes to [0, 1], and apply each dimension reduction method to obtain low-dimensional data with target dimension d. Second, we train classification methods, for example, support vector machine (SVM) and logistic regression by preprocessing and applying the low dimensional data. Next, we embed the test data to lower dimensional space. For PCA, FA, LPP, and deep autoencoder, we use the mapping generated from the first step. Otherwise, we use nonparametric out-of-sample extension method or simply apply the same dimension reduction method to the test data. Then we train prediction models using low-dimensional features as the input and obtain predicted churn score on the test data set with the trained model. Finally, we measure and compare the performance of the models utilizing different dimension reduction techniques.

The overall flow of the process is shown in Figure 3.

### 3.4.Evaluation Metric

To measure model performances, we used an evaluation metric so called the ‘hit rate’. The hit rate is a wellknown measure in marketing field which evaluates the predictive power of models numerically (Rosset *et al*., 2001). It is calculated as follows,

where *H _{i}* is 1 if the prediction is correct and 0 otherwise. n represents the number of samples in data set. Namely, the hit rate measures the ratio of correctly predicting churns to churn candidates. In this paper, a hit rate of x% is a hit rate when only the top x% of customers are considered for evaluation based on their estimated churn probabilities or scores. For example, if we assume that we have 10,000 observations, a hit rate at a target point of 10% is the percentage of correctly predicted churner out of 1,000 customers who are most likely to churn. From the viewpoint of marketing managers, due to budget and time constraints, customers who are highly likely to churn should be the main target of marketing. This being so, it is important to consider the hit rate with target points.

## 4.EXPERIMENTAL RESULT

When we reduce the dimensionality of the data, we choose the lower dimension d. The higher such d is, the better is the performance of classification models (see Section 4.3). However, there is a trade-off because the running time of model building gets larger when we use higher dimensional input data. To choose the appropriate dimensionality, we use Kaiser’s criterion (Kaiser, 1960), which determines the number of principal components (PCs) of which corresponding eigenvalues are greater than or equal to 1.0 utilizing PCA. Figure 4 graphically presents that the chosen number of PCs is 27. Therefore 27 would be the proper target dimensionality of PCA. To be fit for our purpose which is to compare dimension reduction methods for churn prediction, we will use the same lower dimension for each method. Here, we choose 30 as the reduced dimensionality to achieve universality.

There are several other parameters for each dimension reduction methods. For example, the number of nearest neighbors k for the local-structure-based algorithms such as LLE, LTSA or LPP, and the number of layers and nodes for deep auto-encoder.

Parameters of classification methods should also be set, for example, the kernel function and corresponding kernel parameters for SVM. In this experiment, we use radial basis function *k (x _{i} , x _{j})* = exp(

*−v || x*), which is the most popular kernel function for SVM, with the kernel parameter

_{i}- x_{j}||^{2}*v*to be the reciprocal of feature number of data and regularization constant C to be 1 (Chang and Lin, 2011; Hsu

*et al*., 2003).

### 4.1.Linear Dimension Reduction Methods

First, we will compare the linear dimension reduction methods, PCA, FA, and LPP. The linear techniques have a number of strengths. The most important advantages 0are their runtime complexity and availability of outof- sample extension. PCA and FA are the fastest methods among the other dimension reduction methods. Even though we deal with almost 70,000 records of data, they produce results in just a few seconds. However, LPP takes quite long time and large memory because it looks for k-nearest neighbors for each point and builds a connected graph.

We compare SVM and logistic regression models with random prediction model. The random prediction model generates uniformly random numbers between 0 and 1 with the size of test data. By using these random numbers as probabilities, the model predicts churn if a probability is greater than 0.5. The classification models give higher hit rate than the random prediction model. Thus we can say that the models are well-trained.56

### 4.2.Nonlinear Dimension Reduction Methods

Next, we will compare the nonlinear dimension reduction methods, LLE, LTSA and deep auto-encoder.

In this experiment, we assume that the training data and the test data have similar density of the manifold. The churn prediction data we used has comparable size of training and test data. Notwithstanding their different rate of churn, both training and test data share a characteristic that the records of churn and non-churn are highly overlapping. Thus, our assumption is valid.

With the assumption, when LLE or LTSA builds the graph by connecting the neighbors of each points and generates local hyper-planes for each points, we may suppose that the points of training and test data with the same position will have similar local hyper-plane. Therefore, we can apply LLE or LTSA separately to training data and test data. We may also use the nonparametric out-of-sample extension called, Nyström approximation (Bengio *et al*., 2004). In Figure 7, we compare these two approximations of the out-of-sample extension method for LLE with logistic regression model.

In the figure, ‘SP’ represents the former method, which applies LLE separately to training data and test data, whereas ‘N’ represents the latter method, which exploits Nyström approximation. The results for ‘LLE 2 SP’ are zeros. The result justifies that, when using highly overlapping data like the churn prediction data, we can apply LLE or LTSA to training data and test data separately.

Figures 8 and 9 show that when applying deep autoencoder to reduce the dimensionality of data, both classification models (SVM and logistic regression) produce the best performance result while LLE and LTSA give slightly better performance than the random prediction model. Thus, we can say that LLE and LTSA are unsuitable for the churn prediction data. This is not only because of relatively poor performance, but also of their excessive demands for computation and memory space.

### 4.3.Effect of the Number of Dimensionality

As mentioned above, we can expect that classification models trained by exploiting more principal components or features, to some extent, would give better performance, since ‘more features’ means ‘more information’. We compare different lower dimensions of two important dimension reduction methods-PCA and deep auto-encoder with logistic regression. We choose 2 and 10 for the other dimensions, because of the following reasons: 1) 2 makes it very simple to train and test the model, therefore makes the whole process much faster. 2) When we estimate intrinsic dimension of the data using MLE (Levina and Bickel, 2004), the result indicates dimension 8. We use 10 for the intrinsic dimensionality after rounding the number.

In both dimension reduction methods, as the reduced dimension grows, the model gives better hit rate performance 0(see Figures 10 and 11). 2-dimensional representation of PCA produces worse performance than the random 0prediction model, which shows we cannot train good classification model with 2 PCs. As opposed, 2-dimensional representation of deep auto-encoder produces works well as much as other dimensional representations. Moreover, the performance of 2-dimensional representation of deep auto-encoder is even better than 30-dimensional representation of LLE or LTSA.

### 4.4.Comparison for All Dimension Reduction Methods

Finally, in Figures 12 and 13, we compare whole dimension reduction methods except for LLE and LTSA, which give the lowest grades among the others. The result indicates that when we model logistic regression, LPP gives the best performance while when we model SVM, 0PCA gives the best performance. Reversely, however, PCA gives the worst performance among the four dimension reduction methods in logistic regression while LPP gives much worse performance in SVM. Overall, deep auto-encoder provides the most consistent performance.

Specifically, deep auto-encoder’s 30% hit rate is the only result that is greater than 0.4 for both types of models.

## 5.CONCLUSION AND DISCUSSION

In this paper, we compared six dimension reduction methods, PCA, FA, LLE, LTSA, LPP, and deep auto-encoder for classification of the churn prediction data. When we model SVM for the prediction of churn, PCA provides the best hit rate performance among the techniques. Since the churn prediction data is highly correlated, we can expect that PCA, which removes the correlation among features, would give strong performance. Also, PCA has an important advantage on its time complexity among the 0other dimension reduction methods. Meanwhile, deep auto-encoder shows the overall best performance (see Section 4.4). Because the churn prediction data we used is real-world data, nonlinear dimension reduction methods, such as deep auto-encoder, would provide more robust performance over various classification models in many real-world applications. Note that PCA, a linear dimension reduction method, does not provide the best result for logistic regression.

Besides the performance issue, out-of-sample extension should be considered when applying dimension reduction in classification. Dimension reduction methods should be applied to test data before the prediction. PCA, FA, LPP, and deep auto-encoder generate the mapping which enables natural, parametric out-of-sample extension. However, the other two nonlinear dimension reduction methods (LLE and LTSA) do not generate the mapping. These methods only provide the mapped data, thus approximation is required. In this paper, we explained that two kinds of approximation methods can be used because of the highly overlapping characteristic of the data.