• Editorial Board +
• For Contributors +
• Journal Search +
Journal Search Engine
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.18 No.2 pp.245-251
DOI : https://doi.org/10.7232/iems.2019.18.2.245

# Binary Document Classification Based on Fast Flux Discriminant with Similarity Measure on Word Set

Keisuke Okubo, Gendo Kumoi*, Masayuki Goto
Graduate School of Creative Science and Engineering, Waseda University, Tokyo, Japan
School of Creative Science and Engineering, Waseda University, Tokyo, Japan
Corresponding Author, E-mail: moto-aries@ruri.waseda.jp
June 28, 2018 May 7, 2019 May 22, 2019

## ABSTRACT

Fast Flux Discriminant (FFD) is known as one of the high-performance nonlinear binary classifiers, and it is possible to construct a classification model considering the interaction between variables. In order to take account of the interaction between variables, FFD introduces the histogram-based kernel smoothing using subspaces including variable combinations. However, when creating a subspace, the original FFD should cover all variables including combinations of variables with low interaction. Therefore, the disadvantage is that the calculation amount increases exponentially as the dimension increases. In this study, we calculate the similarity between variables by using KL divergence. Then, among the obtained similarities, divisions are performed for each subspace with similar variables. Through this method, we try to reduce the amount of calculation while maintaining classification accuracy by using only combinations of variables that are likely to take high interaction. Through the simulation experiments with Japanese newspaper articles, the effectiveness of our proposed method is clarified.

## 1. INTRODUCTION

In recent years, the enormous digitized documents with diverse variety are flooded due to the development of information technology. Under the situation, the needs for efficient document management and unnecessary information filtering technology have increased, so that various researches on automatic classification of documents are actively conducted (Ko, 2012; Nigam et al., 2000; Ogihara et al., 2013; Sokolova and Lapalme, 2009). Common document data can be decomposed into terms by the morphological analysis, and a document can be represented by a vector according to frequency information of terms. If each document is vectorized, it is possible to apply various kinds of classifiers based on machine learning (Dumais et al., 1998; Fan et al., 2008; Jinshu et al., 2005, 2005; Liu et al., 2004; Lodhi et al., 2002; Miao et al., 2009; Mikawa et al., 2012; Mikawa and Goto, 2015; Suzuki et al., 2017; Tsuda et al., 2005; Uysal and Gunal, 2014; Yamamoto et al., 2017; Yang and Pedersen, 1997; Yong et al., 2009). For binary classification with many application examples, various methods such as support vector machine (Bishop, 2010; Tong and Koller, 2001; Joachims, 1999) can be applied. Among the binary classifiers, FFD (Fast Flux Discriminant) (Chen et al., 2014) is one of effective techniques used for large scale nonli-near classification with high performance. One advantage of this method is that the high precision of the latest nonlinear model is realized while maintaining the scalability of the linear classifier, the sparseness of features and the interpretability. Also, one of the factors that can satisfy the high accuracy of the nonlinear model is the ability to construct a classification model that considers interactions between variables.

The learning of FFD is divided into the process of considering interactions among variables and the process of actually performing classification learning. Thus, efficient learning is possible by restricting considered interactions between variables into necessary ones. That is, we need to consider the necessary interaction between variables when performing FFD to construct a classification model considering appropriate interactions among variables. Therefore, as the number of variables gets bigger, the computational complexity increases exponentially. Especially for document classification problems using many words, it become difficult to apply FFD taking account of the interaction between many variables because the learning procedure of FFD cannot be realized computationally. In order to solve this problem, we showed how to reduce the dimension of data beforehand by using mutual information in FFD (Okubo et al., 2017). As a result, the dimensions of FFD model were reduced, and classification accuracy was improved by considering interaction.

In addition, FFD uses a histogram-based kernel smoothing using a subspace containing variable combinations. This makes it possible to take account of interactions between variables. However, in this case, in the original FFD, it is necessary to create a subspace for all variables including a combination of variables with little interaction. Therefore, in the document classification problem with many words, the combination of variables which are necessary for creating the subspace increases exponentially as the dimension increases. As a result, there is a problem that the calculation amount increases exponentially.

Therefore, in this paper, we propose a method to select appropriate subspaces containing variable combinations for FFD for classification learning. Specifically, first, a degree of similarity is obtained by using the KL divergence (Cover and Thomas, 1991) between variables. Then, the obtained similarity is used to identify appropriate subspaces having two variables so that only combinations of variables with high possibility of interaction are used. As a result, it becomes possible to consider the interactions between variables in a smaller subspace, so that we can expect the reduction of the calculation cost while maintaining classification accuracy even for document classification problem using many words. In order to verify the effectiveness of the proposed method, we evaluate it from the viewpoint of computational complexity through simulation experiments using Japanese newspaper articles.

This paper is organized as follows. In Section 2, we describe the outline and problems of FFD, and the dimensions reduction method proposed so far. We describe the proposed method in Section 3. We present experimental results and discussion in Section 4. Section 5 gives conclusions.

## 2. PRELIMINARIES

This paper focus on FFD (Fast Flux Discriminant) proposed by (Chen et al., 2014), that is a large-scale nonlinear classification with high performance. Since FFD has several advantages, it is meaningful to improve this model. In this section, we describe the outline of FFD (Chen et al., 2014), and the method of dimension reduction proposed in the previous paper (Okubo et al., 2017) as a related research.

### 2.1 Fast Flux Discriminant (FFD)

We assume that a dataset $Ð = ( x i , y i ) , i = 1 , ⋯ , N , x i ∈ R D$ is given where RD is D dimension Euclidean space, and $y i ∈ { − 1 , 1 }$ is a label. The learning procedure of FFD is roughly divided into three steps. In the first step, the original data space is divided into subspaces based on a combination of variables and kernel smoothing based on the histogram in each subspace is performed. First, we normalize the given learning data. The minimum value in the d dimension is defined as

$s t a r t d = min k = 1 , ⋯ , N x k d$
(1)

and, the maximum value is defined as

$e n d d = max k = 1 , ⋯ , N x k d$
(2)

The bin length ld is given by:

$l d = e n d d − s t a r t d b d$
(3)

where bd is the number of bins for the d dimension. The grid cell $B ( x k d )$ normalized on the basis of the above variables is obtained using the following

$B x , d = x d − s t a r t d b d$
(4)

After allocating each training data to the corresponding grid cell, the probability of $p ( y | x )$ is estimated by counting the proportion of data with different labels in the grid cell B(x, d) by a histogram-based density method. (Feige and Mirrokni, 2007; Pele et al., 2013) Density estimation based on the histogram is an efficient way for arbitrary distribution without any assumptions for the statistical structure of the underlying data. However, if the number of training data is not sufficient, the histogram estimator will be with huge bias and dispersion. Furthermore, if we try to model $p ( y | x )$ directly by density estimation for the whole space, the number of grid cells increases exponentially with the number of dimensions, leading to the curse of dimensionality. As a result, this simple histogram method is not practical for a problem on a sample space with relatively large dimensions. To overcome this problem, let us consider expressing $p ( y | x )$ by the combination of density estimates for the low dimensional subspace. Here, each subspace is constructed as containing a small number (less than r ) of features. Essentially, we assume that the feature interactions are limited to r(D) features even if features can interract with one another. Since r is sufficiently small, it is assumed that the density estimation of the r dimensional subspace can still be calculated with high speed. Next, M twodimensional subspaces $A m , ( 1 ≤ m ≤ M )$ are generated based on the obtained grid cell Bd. In FFD, r = 2 is set, and a subspace is generated based on two patterns, i.e., the case of one variable and the case of a combination of two variables. For this reason, we create a subspace with M as D(D + 1) / 2 . Then, the Gaussian kernel Kbin between the two grid cells in each Am is obtained by using the following equation:

$K b i n B , B ′ = exp − ∑ d ∈ A m B d − B ′ d 2 l d 2 2 l h d 2$
(5)

where $B = ( B 1 , ⋯ , B A m )$ is a grid cell in the subspace specied by the dimensions in Am . Also, hd is expressed as follows:

$K b i n B , B ′ = 1.06 σ d N − 1 / 5$
(6)

where σd is the standard deviation of the training samples on d variables. Next, the conditional probability $p ( y | B )$ in y and B is obtained by using the obtained kernel Kbin as follows:

$p y | B = ∑ B ' ∈ G A m n B ' y K b i n B , B ′ ∑ B ' ∈ G A m N B ' K b i n B , B ′$
(7)

where $n B ′ ( y )$ is the number of training samples with label y in B′ , and $N B ′$ is the total number of training samples in B′ . And, $G A m$ denotes the set of all the grid cells in this subspace. Finally, a new feature $φ m ( x )$ is obtained based on the obtained probability $p ( y | B )$ by using the following equation:

$φ m x = ln p y = 1 | x A m p y = − 1 | x A m$
(8)

In the second step, feature selection of the subspace is performed based on submodular optimization. Specifically, the feature obtained in the first step is selected as follows:

$max S ∑ i ∈ S ∑ j ∈ U − S c i , j − α ∑ i , j ∈ S , i ≠ j c i , j + β ∑ i ∈ S a i − μ S 2$
(9)

where ci,j is the Pearson correlation between ϕi and ϕj based on Ai and Aj respectively. ai is the training accuracy of the subspace estimation for $G A i$, which can be computed efciently by simply checking the number of samples incorrectly labeled out of all grid cells of $G A i$. (α, β, μ) are all non-negative hyper-parameters. Next, for each term in equation (9), the first term is the cut function (Fujishigei, 2005), maximizing the similarity coverage of set S, so that set S is a good representative of U. The second term minimizes pair-wise correlation within S, reduces redundancy and mitigates highly correlated features and collinearity problems. The third term maximizes the overall accuracy of the histogram estimation of the selected subspace. The fourth term minimizes the cardinality of S due to its sparseness. Although maximization of (9) is an NP-hard problem, it shows that a good approximate boundary can be achieved.

In the third step, based on the feature value ϕm(x) in the subspace selected in the second step, learning of the linear classifier is performed. The formulation of this learning procedure is given by

$min w 1 N ∑ i = 1 N 1 + exp − y i w φ x i + λ ∑ m + 1 M w m$
(10)

Using equation (10), FFD enhances dilution over conventional L1 normalization. To see this fact we assume here that the slope of the positive wm is negative. When applying a gradient descent, wm tends to decrease its value. However, due to the nonnegative constraint, it cannot be less than 0. Therefore, many components tend to become 0 and this property can result in a much sparse solution than L1 normalization. The product y′ of the weight vector w and integrated data xid obtained by this is taken as the predicted value of y. Then, the accuracy can be evaluated by comparing y′ and actual y.

### 2.2 Related Research

One of the advantages of FFD is the ability to construct a classification model that considers the interaction between variables. However, if we try to construct a classification model that actually takes into account the interaction between variables, we will consider the interaction between all the variables. Therefore, when the dimension reaches thousands to tens of thousands, the amount of computation increases exponentially, and FFD cannot be executed as a result. This is because it is actually necessary to calculate the order of the dimension of D squared or more. Therefore, for document classification problems with many words, there is a problem that the learning procedure of FFD considering the interaction all between variables cannot be computationally achieved. Therefore, in a previous research (Okubo et al., 2017) in FFD, we propose a method of reducing the dimension of data in advance. Specifically, words whose mutual information amount between categories is larger than a certain threshold are selected and used for learning. By doing this, we aimed to improve the classification accuracy by taking account of the important interactions between the variables contributing to the classification for the document classification problem using many words. Let xd be some data and ck be the category. At this time, the mutual information amount MI(xd, C) (Tsuda et al., 2005) between the word xd and the category is defined as in the following

$M I x d , C = ∑ k = 1 K P x d , c k log P x d , c k P x d P c k$
(11)

• $P ( x d , c k )$ : Percentage of documents that contain the word xd in all documents and belong to the category ck.

• $P ( x d )$ : Percentage of documents containing the word xd in all documents.

• $P ( c k )$ : Percentage of documents belonging to the category ck in all documents.

Here, we will look at each probability required to obtain mutual information. First, since binary classification is targeted, we can set K = 2 and $P ( c k )$ is 0.5. Next, for $P ( x d )$, it is assumed that the ratio of the number of data having a value higher than 0 in all N pieces of data in the d th variable. Next, for $P ( x d , c k )$, it is assumed that the ratio of the number of data having a value higher than 0 and the value of y being -1 or 1 in all N data pieces in the d dimension.

In order to verify the effectiveness of the proposed method, we applied it to the classification problem of newspaper articles and evaluated from the viewpoint of classification accuracy and others. As a result, we found that the proposed method can reduce dimension while maintaining accuracy. In addition, the classification accuracy of the case we consider the interactions in the proposed method is higher than the case we do not take those into consideration. This result shows that a classifier with higher accuracy can be created by the proposed method.

## 3. PROPOSED METHOD

As shown in Section 2.1, FFD takes account of the important interactions between variables contributing to classification by introducing kernel smoothing based on histogram using a subspace containing a combination of variables. In that case, a combination of two variables is used. However, in FFD, when creating a subspace, it is necessary to cover all variables including combinations of variables with little interaction. Therefore, there is a disadvantage that the computational complexity increases exponentially as dimensionality increases. Indeed, when there are D variables, it is necessary to calculate the order of D(D − 1) / 2 to create a combination of two variables. Therefore, when considering the interaction between variables in a document classification problem with many words, there is a problem that the amount of calculation becomes excessive if there are many combinations of variables with little interaction.

In this paper, we propose a new method to reduce the computational complexity of the learning phase and make it possible to apply to document classification problems with many words. First, we calculate the similarity between two variables using the KL divergence (Goto and Kobayashi, 2014) amount among the variables. Then, using the obtained similarity, selection is made in the subspace having two variables, and a subspace is created based only on the combination of variables with high possibility of interaction. By this procedure, we aim to reduce the amount of computation cost while maintaining classification accuracy, even for document classification problems with using many words, by making it possible to consider the important interactions between variables in a smaller subspace.

### 3.1 Calculation of Similarity between Variables

The KL (Kullback and Leibler) divergence amount which is a loss function representing the pseudo distance between probability distributions is used. The KL divergence is given by

$D K L P Q = ∑ i P i log P i Q i$
(12)

• P(i) : The probability of i selected by probability distribution P

• Q(i) : The probability of i selected by probability distribution Q

However, since the KL divergence is a pseudo distance, an accurate distance between probability distributions cannot be obtained. For example, in the case of Eq. (12), there is no symmetry between the distance from P(i) to Q(i) and that from P to Q(i) . Therefore, in this paper, we apply the JS (Jensen-Shannon) divergence obtained by adding the KL divergence.

$D J S P Q = 1 2 ∑ i P i log P i Q i + 1 2 ∑ i Q i log Q i P i$
(13)

### 3.2 Application to Subspace Reduction in FFD

In this paper, the JS divergence described in the previous section is obtained as the similarity between variables and is used as an index of the strength of interaction between variables. First, we use the probabilities P(i) and Q(i) in Eq. (12) as the probability of using a certain variable a in the ith document. Letting the number of times of use in all documents be La and the number of times of use in the i th document be La(i) in a variable a, the usage probability Pa(i) of the variable a in the i th document is expressed as follows .

$P a i = l a i L a$
(14)

Next, the JS divergence $J a , b ( i )$ of the distributions between variables a and b is obtained by rewriting Eq. (13) as follows using $P a ( i )$ obtained by the above way.

$J a , b i = 1 2 ∑ i P a i log P a i P b i + 1 2 ∑ i P b i log P b i P a i$
(15)

• $P a ( i )$ : Usage probability of variable a in i th document

• $P b ( i )$ : Usage probability of variable b in i th document

The JS divergence $J a , b ( i )$ obtained in this way is used as the similarity between variables a and b. In the same way, we use the JS divergence quantity to find the similarity between two variables in all variables. Then, a combination of variables in which the obtained degree of similarity is larger than a certain threshold value is selected. If this threshold value is too large, the dimension is excessively reduced, and important interactions cannot be considered. For this reason, the following experiment was conducted after searching for the optimum value of this threshold value.

## 4. EXPERIMENT

### 4.1 Data used for Experiments

In this section, we show the results of numerical experiments using the Yomiuri Shimbun data from 2015. Each document in this newspaper article data is with a category label from eight categories of politics, economics, sports, society, culture, life, crime case, science. The details of categories are shown in Table 1. Regarding learning data, since there are nine sets of 150 items per category from Table 1, the 9 data sets were used for learning, and the average of the respective results was used.

### 4.2 Experimental Conditions

In this numerical evaluation, the two evaluation indexes, i.e., the classification accuracy and the computational time are used. Here, the classification accuracy is defined as the ratio of test data correctly classified by the classifier, and the computational time is taken as the sum of the learning time and the verification time using the test data. In addition, FFD (Chen et al., 2014) which is a conventional method is used as a comparison method. In order to consider interaction in FFD, the dimension reduction using mutual information is performed, and the proposed method is executed after unifying the number of dimensions to 2000. Experiments are also performed on one-to-one binary classification (Tsuda et al., 2005) using FFD. The threshold value in this experiment is set to be a similarity score of 12000.

### 4.3 Results and Discussion

First, the numbers of each subspace before and after selecting a variable are shown in Table 2 below.

The category number in Table 2 indicate the dataset of a binary classification problem to classify the category of that number and a plurality of other categories in oneto- one classification. For example, the category 1 shows the dataset of the problem to be classified into the category 1 and the others. In Table 2, only the subspaces generated by the combinations of the two variables are shown. From Table 2, it can be seen that the proposed method has reduced the number of subspaces compared with the conventional method. From this, it can be considered that the combinations of variables based on the similarity by the proposed method can be reduced. The classification accuracy of each is shown in Table 3.

From Table 3, it can be seen that the accuracy of the proposed method hardly changes as compared with the comparative method. From this, it can be considered that the proposed method can eliminate unnecessary subspaces while maintaining accuracy. The reason for this is considered to be that combinations with small similarity among variables in unselected subspaces are likely to have many similarities in distribution among variables. Therefore, the possibility of interaction between variables in unselected subspace is low, and it seems to be because it does not influence classification learning much. The classification computational time of each is shown in Table 4. Furthermore, the breakdown of the computational time when executing the proposed method is shown in Table 5. Here, in Table 5, “Selection” represents the time taken to execute the selection based on the proposed method. Also, “Original FFD” represents the time obtained by subtracting the selection time based on the proposed method from the whole computational time.

From Table 4 and 5, it can be seen that the computational time of the proposed method is reduced compared with in the conventional method. Furthermore, it can be seen that the total time can be reduced even when including the computational time spent on subspace selection by the proposed method. The reason for this is considered to be that the computational time taken for the histogram estimation to take account of the interactions could be reduced because the number of subspaces to be considered was drastically reduced by the proposed method. From this fact and the results in Table 3, it can be clarified that the calculation amount could be reduced while maintaining the classification accuracy by the proposed method. Because it was possible to select only important combinations with high possibility of interactions between variables contributing to classification among various sets of variables, it became possible to construct a classification model taking account of only efficient interactions.

## 5. CONCLUSIONS AND FUTURE WORK

In this paper, we proposed the learning method to take account of the important interactions between variables in FFD with less subspace. For that purpose, we showed a method of selecting combinations of variables with high possibility of interaction using similarity obtained by KL information amount. As a result, the computational complexity was reduced while maintaining the classification accuracy. We also clarified that the only important interactions could be considered more efficiently for document classification problems with many words, so that the practical way for large scale document classification problems was shown in this paper. The proposed method is considered to be effective for a problem that the dimension of the feature space becomes large like document classification.

A future task is possible to study the effectiveness of interaction in this method by applying it to classification problems other than document data.

## ACKNOWLEDGEMENT

The authors would like to thank anonymous referees for their useful suggestions. The authors would also like to thank all members of Goto Laboratory, Waseda university, for their support of our research. A portion of this study was supported by JSPS KAKENHI Grant Numbers 26282090 and 26560167.

## Table

Data overview (Yomiuri Shimbun from 2015)

Transition of the number of partial spaces

Transition of each classification accuracy

Transition of each work time (Unit: s)

Breakdown of work time (Unit: s)

## REFERENCES

1. Bishop, C. (2010), Pattern Recognition and Machine Learning, Springer, New York.
2. Chen, W. , Chen, Y. , and Weinberger, K. Q. (2014), Fast flux discriminant for large-scale sparse nonlinear classification, Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, New York, New York, 621-630.
3. Cover, T. M. and Thomas, J. A. (1991), Elements of Information Theory, John Wiley & Sons, New York.
4. Dumais, S. , Platt, J. , Hecherman, D. , and Sahami, M. (1998), Inductive learning algorithm and representations for text categorization, Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM ’98), Bethesda, Maryland, USA, 148-155.
5. Fan, R. E. , Chang, K. W. , Hsieh, C. J. , X. R. ,Lin,, andand (2008), Liblinear: A library for large linear classification, Journal of Machine Learning Research, 9, 1871-1874.
6. Feige, U. and Mirrokni, V. S. (2007), Maximizing nonmonotone submodular functions, Proceedings of 48th Annual IEEE Symposium on Foundations of Computer Science FOCS, Providence, RI, 461-471.
7. Fujishigei, S. (2005), Submodular Functions and Optimization (2nd ed.), Annals of Discrete Mathematics, Elsevier Science, North Holland.
8. Goto, M. and Kobayashi, M. (2014), Introductory pattern recognition and machine learning (in Japanese), Corona Publishing Co., ltd.
9. Jinshu, S. , Bofeng, Z. , and Xin, X. (2006), Advances in machine learning based text categorization, Journal of Software, 17(9), 1848-1859.
10. Jianhui, W. , Hongwei, W. , Zhan, S. , and Yunfa, H. (2005), A simple and efficient algorithm to classify a large scale of texts, Journal of Computer Research and Development, 42(1), 85-93.
11. Joachims, T. (1999), Transductive inference for text classification using support vector machines, ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning, San Francisco, CA, 200-209.
12. Ko, Y. (2012), A study of term weighting schemes using class information for text classification, Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR’12), Portland, Oregon, USA, 1029-1030.
13. Liu, T. , Chen, Z. , Zhang, B. , Ma, W. , and Wu, G. (2004), Improving text classification using local latent semantic indexing, 4th IEEE International Conference on Data Mining (ICDM ’04), Brighton, UK,
14. Lodhi, H. , Saunders, C. , Shawe-Taylor, J. , Cristianini, N. , and Watkins, C. (2002), Text Classification using String Kernels, Journal of Machine Learning Research, 2, 419-444.
15. Miao, D. , Duan, Q. , Zhang, H. , and Jiao, N. (2009), Rough set based hybrid algorithm for text classification, Expert Systems with Applications, 36(5), 9168-9174.
16. Mikawa, K. , Ishida, T. , and Goto, M. (2012), An optimal weighting method in supervised learning of linguistic model for text classification, Industrial Engineering & Management Systems, 11(1), 87-93.
17. Mikawa, K. and Goto, M. (2015), Regularized distance metric learning for document classification and its application, Journal of Japan Industrial Management Association, 66(2E), 190-203.
18. Nigam, K. , McCallum, A. K. , Thrun, S. , and Mitchell, T. (2000), Text classification from labeled and unlabeled documents using EM, Machine Learning, 39(2-3), 103-134.
19. Ogihara, T. , Mikawa, K. , Goto, M. , and Hosoya, G. (2013), Multi-valued document classification based on generalized bradley-terry classifiers utilizing accuracy information, China-USA Business Review, 12(9), 911-917.
20. Okubo, K. , Kumoi, G. , and Goto, M. (2017), A study on dimensional method for bivalent classification using FFD, IPSJ SIG Technical Reports, Sapporo, Hokkaido, 2017-MPS-115(16), 1-2.
21. Pele, O. , Taskar, B. , Globerson, A. , and Werman, M. (2013), The pairwise piecewise-linear embedding for efficient non-linear classification, Proceedings of The 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 205-213.
22. Sokolova, M. and Lapalme, G. (2009), A systematic analysis of performance measures for classification tasks, Information Processing & Management, 45(4), 427-437.
23. Suzuki, L. , Mikawa, K. , and Goto, M. (2017), Multivalused classification of text data based on an ECOC approach using a ternary orthogonal table, Industrial Engineering & Management Systems, 16(2), 155-164.
24. Tong, S. and Koller, D. (2001), Support vector machine active learning with applications to text classification, Journal of Machine Learning Research, 2, 45-66.
25. Tsuda, Y. , Yamagishi, H. , Ishida, T. , and Hirasawa, S. (2005), Text classification using feature selection based on mutual information, The 4th Technology Information on Forum, Tokyo, 4(2), 65-66.
26. Uysal, A. K. and Gunal, S. (2014), The impact of preprocessing on text classification, Information Processing & Management, 50(1), 104-112.
27. Yamamoto, Y. , Mikawa, K. , and Goto, M. (2017), A proposal for classification of document data with unobserved categories considering latent topics, IndustrialEngineering & Management Systems, 16(2), 165-174.
28. Yang, Y. and Pedersen, J. O. (1997), A comparative study on feature selection in text categorization, ICML '97 Proceedings of the Sixteenth International Conference on Machine Learning, San Francisco, CA, 412-420.
29. Yong, Z. , Youwen, L. , and Shixiong, X. (2009), An improved KNN text classification algorithm based on clustering, Journal of Computers, 4(3), 230-237.