1. INTRODUCTION
In recent years, the enormous digitized documents with diverse variety are flooded due to the development of information technology. Under the situation, the needs for efficient document management and unnecessary information filtering technology have increased, so that various researches on automatic classification of documents are actively conducted (Ko, 2012; Nigam et al., 2000; Ogihara et al., 2013; Sokolova and Lapalme, 2009). Common document data can be decomposed into terms by the morphological analysis, and a document can be represented by a vector according to frequency information of terms. If each document is vectorized, it is possible to apply various kinds of classifiers based on machine learning (Dumais et al., 1998; Fan et al., 2008; Jinshu et al., 2005, 2005; Liu et al., 2004; Lodhi et al., 2002; Miao et al., 2009; Mikawa et al., 2012; Mikawa and Goto, 2015; Suzuki et al., 2017; Tsuda et al., 2005; Uysal and Gunal, 2014; Yamamoto et al., 2017; Yang and Pedersen, 1997; Yong et al., 2009). For binary classification with many application examples, various methods such as support vector machine (Bishop, 2010; Tong and Koller, 2001; Joachims, 1999) can be applied. Among the binary classifiers, FFD (Fast Flux Discriminant) (Chen et al., 2014) is one of effective techniques used for large scale nonlinear classification with high performance. One advantage of this method is that the high precision of the latest nonlinear model is realized while maintaining the scalability of the linear classifier, the sparseness of features and the interpretability. Also, one of the factors that can satisfy the high accuracy of the nonlinear model is the ability to construct a classification model that considers interactions between variables.
The learning of FFD is divided into the process of considering interactions among variables and the process of actually performing classification learning. Thus, efficient learning is possible by restricting considered interactions between variables into necessary ones. That is, we need to consider the necessary interaction between variables when performing FFD to construct a classification model considering appropriate interactions among variables. Therefore, as the number of variables gets bigger, the computational complexity increases exponentially. Especially for document classification problems using many words, it become difficult to apply FFD taking account of the interaction between many variables because the learning procedure of FFD cannot be realized computationally. In order to solve this problem, we showed how to reduce the dimension of data beforehand by using mutual information in FFD (Okubo et al., 2017). As a result, the dimensions of FFD model were reduced, and classification accuracy was improved by considering interaction.
In addition, FFD uses a histogrambased kernel smoothing using a subspace containing variable combinations. This makes it possible to take account of interactions between variables. However, in this case, in the original FFD, it is necessary to create a subspace for all variables including a combination of variables with little interaction. Therefore, in the document classification problem with many words, the combination of variables which are necessary for creating the subspace increases exponentially as the dimension increases. As a result, there is a problem that the calculation amount increases exponentially.
Therefore, in this paper, we propose a method to select appropriate subspaces containing variable combinations for FFD for classification learning. Specifically, first, a degree of similarity is obtained by using the KL divergence (Cover and Thomas, 1991) between variables. Then, the obtained similarity is used to identify appropriate subspaces having two variables so that only combinations of variables with high possibility of interaction are used. As a result, it becomes possible to consider the interactions between variables in a smaller subspace, so that we can expect the reduction of the calculation cost while maintaining classification accuracy even for document classification problem using many words. In order to verify the effectiveness of the proposed method, we evaluate it from the viewpoint of computational complexity through simulation experiments using Japanese newspaper articles.
This paper is organized as follows. In Section 2, we describe the outline and problems of FFD, and the dimensions reduction method proposed so far. We describe the proposed method in Section 3. We present experimental results and discussion in Section 4. Section 5 gives conclusions.
2. PRELIMINARIES
This paper focus on FFD (Fast Flux Discriminant) proposed by (Chen et al., 2014), that is a largescale nonlinear classification with high performance. Since FFD has several advantages, it is meaningful to improve this model. In this section, we describe the outline of FFD (Chen et al., 2014), and the method of dimension reduction proposed in the previous paper (Okubo et al., 2017) as a related research.
2.1 Fast Flux Discriminant (FFD)
We assume that a dataset $\xd0=({x}_{i},\hspace{0.33em}\hspace{0.33em}{y}_{i}),\text{\hspace{0.17em}}\hspace{0.33em}i=1,\hspace{0.33em}\cdots ,\hspace{0.33em}N,\hspace{0.33em}{x}_{i}\in {R}^{D}$ is given where R^{D} is D dimension Euclidean space, and ${y}_{i}\in \{1,\hspace{0.33em}\hspace{0.33em}1\}$ is a label. The learning procedure of FFD is roughly divided into three steps. In the first step, the original data space is divided into subspaces based on a combination of variables and kernel smoothing based on the histogram in each subspace is performed. First, we normalize the given learning data. The minimum value in the d dimension is defined as
and, the maximum value is defined as
The bin length l_{d} is given by:
where b_{d} is the number of bins for the d dimension. The grid cell $B({x}_{kd})$ normalized on the basis of the above variables is obtained using the following
After allocating each training data to the corresponding grid cell, the probability of $p(y\text{}x)$ is estimated by counting the proportion of data with different labels in the grid cell B(x, d) by a histogrambased density method. (Feige and Mirrokni, 2007; Pele et al., 2013) Density estimation based on the histogram is an efficient way for arbitrary distribution without any assumptions for the statistical structure of the underlying data. However, if the number of training data is not sufficient, the histogram estimator will be with huge bias and dispersion. Furthermore, if we try to model $p(y\text{}x)$ directly by density estimation for the whole space, the number of grid cells increases exponentially with the number of dimensions, leading to the curse of dimensionality. As a result, this simple histogram method is not practical for a problem on a sample space with relatively large dimensions. To overcome this problem, let us consider expressing $p(y\text{}x)$ by the combination of density estimates for the low dimensional subspace. Here, each subspace is constructed as containing a small number (less than r ) of features. Essentially, we assume that the feature interactions are limited to r(D) features even if features can interract with one another. Since r is sufficiently small, it is assumed that the density estimation of the r dimensional subspace can still be calculated with high speed. Next, M twodimensional subspaces ${A}_{m},\hspace{0.33em}(1\le m\le M)$ are generated based on the obtained grid cell B_{d}. In FFD, r = 2 is set, and a subspace is generated based on two patterns, i.e., the case of one variable and the case of a combination of two variables. For this reason, we create a subspace with M as D(D + 1) / 2 . Then, the Gaussian kernel K^{bin} between the two grid cells in each A_{m} is obtained by using the following equation:
where $B=({B}_{1},\cdots ,{B}_{\left{A}_{m}\right})$ is a grid cell in the subspace specied by the dimensions in A_{m} . Also, h_{d} is expressed as follows:
where σ_{d} is the standard deviation of the training samples on d variables. Next, the conditional probability $p(y\text{}B)$ in y and B is obtained by using the obtained kernel K^{bin} as follows:
where ${n}_{{B}^{\prime}}(y)$ is the number of training samples with label y in B′ , and ${N}_{{B}^{\prime}}$ is the total number of training samples in B′ . And, ${G}_{{A}_{m}}$ denotes the set of all the grid cells in this subspace. Finally, a new feature ${\phi}_{m}(x)$ is obtained based on the obtained probability $p(y\text{}B)$ by using the following equation:
In the second step, feature selection of the subspace is performed based on submodular optimization. Specifically, the feature obtained in the first step is selected as follows:
where c_{i,j} is the Pearson correlation between ϕ_{i} and ϕ_{j} based on A_{i} and A_{j} respectively. a_{i} is the training accuracy of the subspace estimation for ${G}_{{A}_{i}}$, which can be computed efciently by simply checking the number of samples incorrectly labeled out of all grid cells of ${G}_{{A}_{i}}$. (α, β, μ) are all nonnegative hyperparameters. Next, for each term in equation (9), the first term is the cut function (Fujishigei, 2005), maximizing the similarity coverage of set S, so that set S is a good representative of U. The second term minimizes pairwise correlation within S, reduces redundancy and mitigates highly correlated features and collinearity problems. The third term maximizes the overall accuracy of the histogram estimation of the selected subspace. The fourth term minimizes the cardinality of S due to its sparseness. Although maximization of (9) is an NPhard problem, it shows that a good approximate boundary can be achieved.
In the third step, based on the feature value ϕ_{m}(x) in the subspace selected in the second step, learning of the linear classifier is performed. The formulation of this learning procedure is given by
Using equation (10), FFD enhances dilution over conventional L1 normalization. To see this fact we assume here that the slope of the positive w_{m} is negative. When applying a gradient descent, w_{m} tends to decrease its value. However, due to the nonnegative constraint, it cannot be less than 0. Therefore, many components tend to become 0 and this property can result in a much sparse solution than L1 normalization. The product y′ of the weight vector w and integrated data x_{id} obtained by this is taken as the predicted value of y. Then, the accuracy can be evaluated by comparing y′ and actual y.
2.2 Related Research
One of the advantages of FFD is the ability to construct a classification model that considers the interaction between variables. However, if we try to construct a classification model that actually takes into account the interaction between variables, we will consider the interaction between all the variables. Therefore, when the dimension reaches thousands to tens of thousands, the amount of computation increases exponentially, and FFD cannot be executed as a result. This is because it is actually necessary to calculate the order of the dimension of D squared or more. Therefore, for document classification problems with many words, there is a problem that the learning procedure of FFD considering the interaction all between variables cannot be computationally achieved. Therefore, in a previous research (Okubo et al., 2017) in FFD, we propose a method of reducing the dimension of data in advance. Specifically, words whose mutual information amount between categories is larger than a certain threshold are selected and used for learning. By doing this, we aimed to improve the classification accuracy by taking account of the important interactions between the variables contributing to the classification for the document classification problem using many words. Let x_{d} be some data and c_{k} be the category. At this time, the mutual information amount MI(x_{d}, C) (Tsuda et al., 2005) between the word x_{d} and the category is defined as in the following

$P({x}_{d},\hspace{0.33em}\hspace{0.33em}{c}_{k})$ : Percentage of documents that contain the word x_{d} in all documents and belong to the category c_{k}.

$P({x}_{d})$ : Percentage of documents containing the word x_{d} in all documents.

$P({c}_{k})$ : Percentage of documents belonging to the category c_{k} in all documents.
Here, we will look at each probability required to obtain mutual information. First, since binary classification is targeted, we can set K = 2 and $P({c}_{k})$ is 0.5. Next, for $P({x}_{d})$, it is assumed that the ratio of the number of data having a value higher than 0 in all N pieces of data in the d th variable. Next, for $P({x}_{d},\hspace{0.33em}{c}_{k})$, it is assumed that the ratio of the number of data having a value higher than 0 and the value of y being 1 or 1 in all N data pieces in the d dimension.
In order to verify the effectiveness of the proposed method, we applied it to the classification problem of newspaper articles and evaluated from the viewpoint of classification accuracy and others. As a result, we found that the proposed method can reduce dimension while maintaining accuracy. In addition, the classification accuracy of the case we consider the interactions in the proposed method is higher than the case we do not take those into consideration. This result shows that a classifier with higher accuracy can be created by the proposed method.
3. PROPOSED METHOD
As shown in Section 2.1, FFD takes account of the important interactions between variables contributing to classification by introducing kernel smoothing based on histogram using a subspace containing a combination of variables. In that case, a combination of two variables is used. However, in FFD, when creating a subspace, it is necessary to cover all variables including combinations of variables with little interaction. Therefore, there is a disadvantage that the computational complexity increases exponentially as dimensionality increases. Indeed, when there are D variables, it is necessary to calculate the order of D(D − 1) / 2 to create a combination of two variables. Therefore, when considering the interaction between variables in a document classification problem with many words, there is a problem that the amount of calculation becomes excessive if there are many combinations of variables with little interaction.
In this paper, we propose a new method to reduce the computational complexity of the learning phase and make it possible to apply to document classification problems with many words. First, we calculate the similarity between two variables using the KL divergence (Goto and Kobayashi, 2014) amount among the variables. Then, using the obtained similarity, selection is made in the subspace having two variables, and a subspace is created based only on the combination of variables with high possibility of interaction. By this procedure, we aim to reduce the amount of computation cost while maintaining classification accuracy, even for document classification problems with using many words, by making it possible to consider the important interactions between variables in a smaller subspace.
3.1 Calculation of Similarity between Variables
The KL (Kullback and Leibler) divergence amount which is a loss function representing the pseudo distance between probability distributions is used. The KL divergence is given by

P(i) : The probability of i selected by probability distribution P

Q(i) : The probability of i selected by probability distribution Q
However, since the KL divergence is a pseudo distance, an accurate distance between probability distributions cannot be obtained. For example, in the case of Eq. (12), there is no symmetry between the distance from P(i) to Q(i) and that from P to Q(i) . Therefore, in this paper, we apply the JS (JensenShannon) divergence obtained by adding the KL divergence.
3.2 Application to Subspace Reduction in FFD
In this paper, the JS divergence described in the previous section is obtained as the similarity between variables and is used as an index of the strength of interaction between variables. First, we use the probabilities P(i) and Q(i) in Eq. (12) as the probability of using a certain variable a in the ith document. Letting the number of times of use in all documents be L_{a} and the number of times of use in the i th document be L_{a}(i) in a variable a, the usage probability P_{a}(i) of the variable a in the i th document is expressed as follows .
Next, the JS divergence ${J}_{a,b}(i)$ of the distributions between variables a and b is obtained by rewriting Eq. (13) as follows using ${P}_{a}(i)$ obtained by the above way.

${P}_{a}(i)$ : Usage probability of variable a in i th document

${P}_{b}(i)$ : Usage probability of variable b in i th document
The JS divergence ${J}_{a,b}(i)$ obtained in this way is used as the similarity between variables a and b. In the same way, we use the JS divergence quantity to find the similarity between two variables in all variables. Then, a combination of variables in which the obtained degree of similarity is larger than a certain threshold value is selected. If this threshold value is too large, the dimension is excessively reduced, and important interactions cannot be considered. For this reason, the following experiment was conducted after searching for the optimum value of this threshold value.
4. EXPERIMENT
4.1 Data used for Experiments
In this section, we show the results of numerical experiments using the Yomiuri Shimbun data from 2015. Each document in this newspaper article data is with a category label from eight categories of politics, economics, sports, society, culture, life, crime case, science. The details of categories are shown in Table 1. Regarding learning data, since there are nine sets of 150 items per category from Table 1, the 9 data sets were used for learning, and the average of the respective results was used.
4.2 Experimental Conditions
In this numerical evaluation, the two evaluation indexes, i.e., the classification accuracy and the computational time are used. Here, the classification accuracy is defined as the ratio of test data correctly classified by the classifier, and the computational time is taken as the sum of the learning time and the verification time using the test data. In addition, FFD (Chen et al., 2014) which is a conventional method is used as a comparison method. In order to consider interaction in FFD, the dimension reduction using mutual information is performed, and the proposed method is executed after unifying the number of dimensions to 2000. Experiments are also performed on onetoone binary classification (Tsuda et al., 2005) using FFD. The threshold value in this experiment is set to be a similarity score of 12000.
4.3 Results and Discussion
First, the numbers of each subspace before and after selecting a variable are shown in Table 2 below.
The category number in Table 2 indicate the dataset of a binary classification problem to classify the category of that number and a plurality of other categories in oneto one classification. For example, the category 1 shows the dataset of the problem to be classified into the category 1 and the others. In Table 2, only the subspaces generated by the combinations of the two variables are shown. From Table 2, it can be seen that the proposed method has reduced the number of subspaces compared with the conventional method. From this, it can be considered that the combinations of variables based on the similarity by the proposed method can be reduced. The classification accuracy of each is shown in Table 3.
From Table 3, it can be seen that the accuracy of the proposed method hardly changes as compared with the comparative method. From this, it can be considered that the proposed method can eliminate unnecessary subspaces while maintaining accuracy. The reason for this is considered to be that combinations with small similarity among variables in unselected subspaces are likely to have many similarities in distribution among variables. Therefore, the possibility of interaction between variables in unselected subspace is low, and it seems to be because it does not influence classification learning much. The classification computational time of each is shown in Table 4. Furthermore, the breakdown of the computational time when executing the proposed method is shown in Table 5. Here, in Table 5, “Selection” represents the time taken to execute the selection based on the proposed method. Also, “Original FFD” represents the time obtained by subtracting the selection time based on the proposed method from the whole computational time.
From Table 4 and 5, it can be seen that the computational time of the proposed method is reduced compared with in the conventional method. Furthermore, it can be seen that the total time can be reduced even when including the computational time spent on subspace selection by the proposed method. The reason for this is considered to be that the computational time taken for the histogram estimation to take account of the interactions could be reduced because the number of subspaces to be considered was drastically reduced by the proposed method. From this fact and the results in Table 3, it can be clarified that the calculation amount could be reduced while maintaining the classification accuracy by the proposed method. Because it was possible to select only important combinations with high possibility of interactions between variables contributing to classification among various sets of variables, it became possible to construct a classification model taking account of only efficient interactions.
5. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed the learning method to take account of the important interactions between variables in FFD with less subspace. For that purpose, we showed a method of selecting combinations of variables with high possibility of interaction using similarity obtained by KL information amount. As a result, the computational complexity was reduced while maintaining the classification accuracy. We also clarified that the only important interactions could be considered more efficiently for document classification problems with many words, so that the practical way for large scale document classification problems was shown in this paper. The proposed method is considered to be effective for a problem that the dimension of the feature space becomes large like document classification.
A future task is possible to study the effectiveness of interaction in this method by applying it to classification problems other than document data.