• Editorial Board +
• For Contributors +
• Journal Search +
Journal Search Engine
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.20 No.2 pp.148-162
DOI : https://doi.org/10.7232/iems.2021.20.2.148

# A Feature Selection Method for Classifying Highly Similar Text Documents

Jeenyoung Kim, Daiki Min*
Graduate School (Big Data Analytics Program), Ewha Womans University, Seoul, Republic of Korea
School of Business, Ewha Womans University, Republic of Korea
*Corresponding Author, E-mail: dmin@ewha.ac.kr
January 28, 2021 April 20, 2021 May 19, 2021

## ABSTRACT

In the era of big data, the importance of data classification is increasing. However, when it comes to classifying text documents, several obstacles degrade classification performance. These include multi-class documents, high levels of similarity between classes, class size imbalance, high dimensional representation space, and a low frequency of unique and discriminative features. To overcome these obstacles and improve classification performance, this paper proposes a novel feature selection method that effectively utilizes both unique and overlapping features. In general, feature selection methods have ignored unique features that occur only one class because of low frequency while it provides better discriminative-power. On the contrary, overlapping features, which are found in several classes with high frequency, have been also less preferred because of low discriminative-power. The proposed feature selection method attempts to use these two types of features as complementary with aims to improve overall classification performance for highly similar text documents. Extensive numerical analysis have been conducted for three benchmarking datasets with a support vector machine (SVM) classifier. The proposed method showed that not only the class with high similarity but also the general classification performance is superior to the conventional feature selection methods, such as the global feature set, local feature set, discriminative feature set, and information gain.

## 1. INTRODUCTION

Today we are in the era of Big Data. Big Data refers to a large variety of structured and un-structured data. Massive amounts of diverse data are produced daily, owing to the development of ICT, the spread of smart devices, the widespread use of social network services (SNS), and the diffusion of internet of things (IoT) devices. According to global market research firm IDC, the world created roughly 16.1 zettabytes (ZB) of data in 2016 and by 2025 will produce 163ZB, reflecting a tenfold increase (Reinsel et al., 2017). In this environment, the effective organization, retrieval, and mining of massive amounts of data are becoming more and more important. Automatic text classification, also known as text categorization, is one of the technologies widely used for the above purposes (Chen et al., 2016).

The classification of unstructured text data has attracted growing interest in the field of expert and intelligent systems, because it is fundamentals of intelligent system for extracting valuable information from text documents. In the field of expert systems, text classification refers to the process of automatically classifying a text dataset according to a particular standard or category system. It is a key technology for the effective management of knowledge and information (Zhou et al., 2018). Spam filtering is perhaps the most common application of this technology, such as filtering is a necessary component of all client and server mail software (Jadon and Sharma, 2017). Text classification also plays a vital role in the promotion of businesses (Rehman et al., 2018). In product quality management, customer feedback is the most effective and valuable. So capturing helpful reviews about a product from social media, classifying and analyzing them for product quality management are very valuable (Jiang et al., 2017).

In reality, most data are multi-classed. Multi-class classification is more complicated and difficult to compute than binary class classification, in which data is classified into two classes (Kang et al., 2015). For this reason, approaches to decompose the multi-class classification problem into several binary classification problems have been studied. This decomposition strategy is divided into hierarchical classification and at classification depending on whether hierarchy is used. Flat classification is divided into the one-versus-one (OVO) and one-versus-rest (OVR) methods (Yang et al., 2011;Liu et al., 2016).

Although flat classification has been mentioned as a standard for multi-class classification, classification performance is guaranteed only when independence between classes is assumed. However, in today’s data environment, which is composed of hundreds or more multi-class and is not guaranteed to be independent, at classification is not only inefficient in terms of learning but also results in low classification performance when there are many similarities between classes.

As shown in Table 1, binary-tree based classification, which is a kind of hierarchical classification, can improve classification efficiency because the number of classifiers required for learning and classification is significantly reduced compared to flat classification. Using binary-tree based classification, N-1 binary classifiers needed to be trained for an N class problem, but only at most log2N binary-classifiers are required to be consulted to classify a sample (Madzarov et al., 2009;Wu and Zhu, 2015). But OVO and OVR needs N (N-1) = 2, N binary classifiers to train and to test respectively (Chen et al., 2009;Chatterjee et al., 2019). For this reason, hierarchical classification has been naturally emphasized and has been studied extensively in pattern recognition and image recognition (Silla and Freitas, 2011;Fu et al., 2018;Seo and Shin, 2019).

On the other hand, feature selection in text classification is treated as an important issue (Peng and Liu, 2018) because classification performance is determined by the text representation scheme rather than the classifier (Leopold and Kindermann, 2002) in the text domain represented by high-dimensional features (Lin et al., 2013;Taşcı and Güngör, 2013). Feature selection aims to reduce high-dimensional input space through discriminative feature selection (Guru et al., 2018;Rehman et al., 2017). Thus, effective feature selection is essential for efficient and accurate learning in classification problems (Forman, 2003;Cai et al., 2018).

A commonly-used filter feature selection is done through a statistical process. In feature selection based on frequencies such as term frequency (TF), document frequency (DF), and term frequency-inverse document frequency (TF-IDF) (Sabbah et al., 2017;Aizawa, 2003;Manochandar and Punniyamoorth, 2018), unique features that only occur infrequently in one class are ignored. On the other hand, feature selection based on discriminative power such as Information Gain (Quinlan, 1986) tends to ignore the overlapping feature with low discriminative power distributed among several classes. However, unique and overlapping features can be expected to improve classification performance by complementing their disadvantages: low frequency and low discriminative power.

To alleviate this problem, this paper proposes a novel feature selection method for text classification. This method considers both discriminative power and representative power at the same time, not in at structure but in hierarchy for text classification, and selects unique and overlapping features that are ignored in conventional methods. The purpose of this study is to improve overall classification performance by improving classification performance of highly-similar classes by using the abovementioned unique and overlapping features under conditions such as high dimensional input space, multi-class problems, high levels of class similarity and unbalanced class size.

To do this, we perform the following two steps: The first step, hierarchy generation, builds a binary tree based on the similarity of the classes. The second step, feature selection, selects unique and overlapping features from nodes of hierarchy. All documents are transformed into feature vectors using the selected features and these vec-tors form the input data for the SVM. The SVM is then used to evaluate the effectiveness of the proposed feature selection method.

The main contributions of this paper are the following. First, we classify documents hierarchically considering similarity among classes, which is caused by the structure of disturbance factors in classification performances, and focus on improving the classification performance of highly similar classes. Second, with respect to feature selection, previous studies focused on the method of finding features with higher discriminative power among the features with frequency levels above the cutoff threshold in order to improve classification accuracy. This study, however, focuses on finding overlapping features with high discriminative power based on unique and discriminative features that are low in frequency but unique to each class. Third, in a previous study on hierarchical classification (Du et al., 2018), each classifier used a static method that learned the same feature. This study, however, uses a dynamic method that learns different features for each level. Finally, we demonstrate the applicability of the proposed method using abstracts of academic papers on rainwater harvesting published up to 2017 as experimental data. In addition, we test the proposed method with the 20 Newsgroups, which is an open data set for document classification1.

## 2. RELATED WORKS

### 2.1 Hierarchical Classification

Classification using machine learning techniques consists of inducing a function f(x) from a data set composed of pairs (x, y), where x is a set of inputs and corresponding class y∈1,…,k . When k > 2, the classification problem is referred to as a multi-class problem (Lorena et al., 2008). The two methods for solving multi-class classification problems are the flat and hierarchical classification methods. Several studies have shown that hierarchical classification is superior to at classification in terms of efficiency and classification performance in multi-class classification (Tegegnie et al., 2017). There are two methods of performing a hierarchical classification: using a predetermined hierarchy and generating a new hierarchy. In the latter case, a variety of clustering algorithms are used for layer generation.

Madzarov (Madzarov et al., 2009) constructed a binary decision tree using a clustering algorithm based on kernel space distance instead of input space distance for the sake of consistency between clustering models and SVM. Duan and Keerthi (Duan and Keerth, 2005) gener-ated a hierarchy using a k-means algorithm based on similarities between documents. They also applied a relaxation strategy that delayed the decision until classification was certain in order to mitigate the blocking problem (Sun et al., 2004) that arises in hierarchical classification. Unlike the previous studies that used clustering algorithms to generate hierarchies based on distances, Silva- Palacios (Silva-Palacios et al., 2018) proposed hierarchically decomposing the original multi-class problem by constructing a tree-like hierarchy of specialized classifiers, deriving a similarity matrix from the false classification rate given by the first learned classifier.

Beyan and Fisher (Beyan and Fisher, 2015) meanwhile proposed a new hierarchical decomposition method for imbalanced data sets. The hierarchy is constructed using the similarity of labeled data subsets at each level of the hierarchy with different levels being built by different data and feature subsets. Affinity propagation clustering (Guan et al., 2010) is used to partition the data while outlier detection is utilized to detect minority class samples. Naik and Rangwala (Naik and Rangwala, 2019). proposed a method for creating a data-driven extensible hierarchy to compensate for the lack of coherence, which is a problem of hierarchies defined by a domain expert. It is a method to modify and extend existing hierarchies defined by experts by repeating clustering based on the average similarity of documents in class pairings.

### 2.2 Feature Selection

Feature selection provides an effective way to solve this problem by removing irrelevant and redundant data, which can reduce computation time, improve learning accuracy, and facilitate a better understanding for the learning model or data (Cai et al., 2018). Feature selection methods are divided into the filter, wrapper, and embedded methods (Bharti and Singh, 2015). The filter method performs statistical analysis without using a classification algorithm to select a discriminant feature, while the wrapper and embedded methods use a learning algorithm (Kohavi and John, 1997). Although the wrapper and embedded methods can achieve higher accuracy than the filter method, both methods result in computation cost increases and the feature set is biased toward the learning algorithm that was used.

Since the filter method selects features considering only the characteristics of the document, it is possible to obtain a feature set more quickly than with the wrapper and embedded methods and one not biased toward the learning algorithm (Bharti and Singh, 2015). For this reason, the filter method is widely used to reduce dimensionality (Rehman et al., 2017). The filter-based method is divided into global feature selection, where the feature is assigned one score, and local feature selection, where the feature is assigned multiple scores based on classes.

As an example of global feature selection, Bharti and Singh (Bharti and Singh, 2015) used term variance (TV) (Liu et al., 2005) and document frequency (DF) as feature evaluation methods. They proposed a method of reducing dimensionality without a loss of information via feature selection in choosing top-N ranked and intersecting features of TV and DF, and feature extraction using principal component analysis (PCA). As an example of local feature selection, Guru (Guru et al., 2018) proposed a method for generating feature sets by unifying top-N ranked features in each class by chi-square (Chen and Chen, 2011), mutual information (Perrin and Petry, 1999), binormal separation (Forman, 2003), and discriminative features selection (Zong et al., 2015). They argued that doing so automatically eliminates redundant features.

Features with small-size classes or low frequency features maybe only partially selected or missed entirely in global feature selection (Agnihotri et al., 2017). To overcome this problem, Uysal (Uysal, 2016) proposed a method using both global and local feature selection methods. Odds ratio, one of the local feature evaluation methods, was used to assign class membership and a positive or negative value to features, while the global feature evaluation method was used to derive the discriminative power. Classes are represented by selecting the same number of negative and positive features for each class without regard to class size. However, the method of Uysal has a problem in its imbalanced class sizes. In other words, important features may be excluded if the same number of features are selected for a class with a large number of features and a varied distribution of features. In order to solve this problem, Agnihotri (Agnihotri et al., 2017) proposed a method for selecting a various number of features that also considers the term distribution of each class using the method of Uysal’s method.

As mentioned above, most studies on feature selection focus on the method of evaluating the discriminative power of features without considering the structural characteristics of each class. In addition, most of the evaluation methods are restricted to features above the reference frequency, so that discriminative features below the threshold are sometimes excluded, or tend to ignore overlapping features with low discriminative power appearing in several classes. There is a limit to classifying classes with high levels similarity through these existing methods. In order to overcome these limitations, this study focuses on identifying the structure through the rate of discriminating and overlapping features in each class, and solving the factors disturbing classification performance in the structure.

## 3. THE PROPOSED METHOD

This section describes the proposed feature selection method followed by the hierarchy generation procedure. The major notations are shown in Table 2.

### 3.1 Hierarchy Construction

The hierarchy construction procedure starts with a single node that contains all classes. Because we attempt to build a binary tree, the node is divided into two subnodes by applying k-means clustering. We use the term document matrix of each class to compute cosine similarity between two classes (Jadon and Sharma, 2017;Rehman et al., 2017), which is less sensitive to the curse of dimensionality, and then perform k-means computations.

The hierarchy generation procedure iteratively divides a subsequent node k into two sub-clusters, $C 1 k$ and $C 2 k$ . At each iteration, we define $C 1 k$ and $C 2 k$ by applying k-means clustering that assigns classes to either $C 1 k$ or $C 2 k$ based on the similarity. We repeat the procedure until each node contains only a single class, which means that $C 1 k$ and $C 2 k$ are equivalent if node k is a leaf.

### 3.2 Feature Selection

We define three types of feature sets (i.e., unique feature, discriminative feature and overlapping feature) at each node of the binary tree constructed in Section 3.1. Among features with a high weight in the class, features overlapping with other classes may have lower discrimin-ative power than unique features that occur only one class. On the other hand, the unique or discriminative features have high discriminative power but are low frequency.

This paper uses these three types of features as complementary to reduce dimensions and overcome the curse of dimensionality. We propose the discriminative and overlapping feature selection (DOFS) method, which is a hybrid approach that combines term-weighting and wordembedding methods in the process of selecting features. In particular, we design a term-weighting formula for overlapping features, and the high-ranked features are embedded in learning. First, we provide the definitions of three types of features.

Definition 1. (Unique feature; Uf) Unique feature appears only in one class at node k, which is defined as Equation (1). For convenience, hereafter we drop index k and consider that C1 and C2 are interchangeable.

(1)

Definition 2. (Discriminative feature; Df) Discriminative feature occurs only in one of two child nodes but should appear in all classes belong to the child node. The discriminative feature set at node k is represented as Equation (2).

(2)

If a node is composed of a single class (i.e., leaf node), the discriminative feature simply becomes unique feature. The discriminative features of nodes consisting of multiple classes are overlapped throughout the classes in one of the child nodes C1 or C2 .

Definition 3. (Overlapping feature, Of) Overlapping feature appears in both C1 and C2 and is defined as Equation (3).

(3)

If the entire overlapping feature at each node is used for classification, the effect of dimension reduction is weakened. Thus, we calculate the degree of Discrimination (doD) in order to select some of them. $p ( t | C 1 )$ is the probability of observing feature t within node C1 , which is the degree of representation of t in node C1 . A feature with a high degree of representation in two other classes has low discriminative power between these two classes. On the other hand, $p ( C 1 | t )$ is the probability of node C1 given the presence of feature t, which is the degree of contribution of feature t to node C1 . If feature t appears only either in C1 or C2 , both the degree of contribution and discriminative ability are high but the degree of representation becomes low when the frequency is low. Taking these properties into consideration, the doD of each feature t in the node C1 (or C2 ) is calculated by considering both the degree of representation and discrimination of the node according to Equation (4).

$d o D ( t , C i ) = p ( t | C i ) × p ( C i | t ) , i = 1 , 2$
(4)

Since the hierarchy is a binary tree type, an overlapping feature t∈Of has two doD values (i.e., 1 doD(t,C1) and doD(t,C2) ). However, in order to evaluate the discriminative power of the feature, two doDs must be derived as a single value. In this study, we use the absolute difference of two doDs as Equation (5) in order to allocate a higher order to features with extreme contrast between two doDs.

$d o D ( t ) = | d o D ( t , C 1 ) − d o D ( t , C 2 )$
(5)

The final feature set for classification consists of a combination of unique features (Uf), discriminative features (Df) and overlapping features (Of). We use all unique and discriminative features, but a part of overlapping features are selectively used with aims to reduce dimensionality. For the purpose of determining how many overlapping features should be used, the classification performance is monitored by adding overlapping features one by one in descending order of doD(t) . When classification performance is at its maximum, the number of overlapping features to be used for learning is determined. Thus, the number of overlapping features added at each node k is possibly different.

Figure 1 illustrates an example of a binary-tree at node k, which has two sub-clusters $C 1 k$ and $C 2 k$ . $C 1 k$ and $C 2 k$ contains two (i.e., t1 and t2) and four (i.e., t2, t3, t4 and t5) features, respectively. Based upon this example, we show how to obtain the discriminative power of the feature given in Equation 5.

According to Equation (1), $U f 1 = { t 1 }$ and $U f 2 = { t 3 , t 4 , t 5 }$. The definitions of Df and Of provide $D f = { t 1 , t 3 , t 4 , t 5 }$ and $O f = { t 2 }$. After identifying the features, we calculate conditional probabilities between features and classes, $p ( t |C )$ and $p ( C |t )$ . The next step is to obtain doDs (i.e., degree of Discrimination), which are defined in Equation (4) and Equation (5). Table 3 summarizes the conditional probabilities and doDs for each feature t. It is noteworthy that doD (t) is proposed for evaluating overlapping features, and thus doD (t2) is only available in Table 3.

### 3.3 Classification Evaluation

In general, the document classification process in-cludes several stages such as data collection, data pre-processing and labelling, feature construction, feature selection and classifier design. The main purpose of this study is to propose not a method for choosing the best classifiers or others but a method for constructing and selecting good features. The proposed feature selection and embedding procedure described in Section 3.2 is applicable independent of classifiers, although performance may vary among different classifiers. Naive Bayes, Decision Tree, and non-recursive neural networks are typical classifiers with a short training time but lower performance than linear classifiers (Do and Ng, 2005). In addition, since classification problems that deal with high dimensions, such as text, tend to be linearly separated (Agnihotri et al., 2017), the Linear Support Vector Machine is used as the classifier in this paper.

Macro-F1 and micro-F1 are well known as classification performance metrics. In macro-averaging, F-measure is computed for each class within the dataset and then the average over all classes is obtained. In this way, equal weight is assigned to each class regardless of class frequency (Uysal and Gunal, 2012). Computation of Macro-F1 can be formulated as Equation (6) when there are K classes.

$Macro − F 1 = ∑ k = 1 K F k K , F k = 2 × P k × R k P k + R k$
(6)

where pair (Pk, Rk) corresponds to precision and recall values of class k, respectively..

On the other hand, in micro-averaging, F-measure is computed globally without class discrimination. Hence, all classification decisions in the entire test dataset are considered. In case that the classes in a collection are biased, large classes would dominate small ones in micro-averaging. A computation of Micro-F1 can be formulated as Equation (7).

$Micro − F 1 = 2 × P × R P + R$
(7)

where pair of (P, R) corresponds to precision and recall values over all the classification decisions within the entire dataset, and not individual classes (Uysal and Gunal, 2014).

Finally, we measure Macro-F1, Micro-F1, precision, and recall by conducting 10-fold cross-validation. The cross-validation procedure divides 70% of the entire dataset into training set and the remaining 30% into test set. The classification model is trained and validated on the training dataset, and we evaluate the aforementioned performance measures for the test dataset.

## 4. EXPERIMENT

Figure 2 illustrates the main ow of the proposed feature selection method. The first phase is to collect data, and the data preprocessing phase follows. Discriminative and overlapping features are organized based upon the preprocessed data. We then generate a binary tree, which defines a hierarchy of multiple classes. The next phase to train classifiers on the binary tree with the proposed dynamic feature selection method (see Section 4.3). Finally, numerical experiments are conducted to evaluate the proposed method.

### 4.1 Data Description

Three datasets were used to evaluate the performance of the proposed model. The first dataset comprises abstracts of academic papers on rainwater harvesting (RH), a technology intended to facilitate adaptation to climate change. The second dataset is the 5Newsgroup (5NG), which consists of five newsgroups on the subject of computer from 20Newsgroup. The third dataset is the 20Newsgroup (20NG), which is one of the international standard datasets for text categorization, text mining and information retrieval research (Zhou et al., 2018). To satisfy the experimental conditions, we randomly sampled 20 newsgroups into an imbalanced dataset. Table 4 summarizes these datasets. Here, it should be mentioned that discriminative features are temporally used for defining child nodes in the process of generating hierarchy, and thus Table 4 only includes unique and overlapping features.

RH includes abstracts of 300 papers published up to 2017, and consists of four classes, which are related to rainwater harvesting feasibility, and location, urban use, rural use, and water quality. The largest class, Class A is two times larger than the smallest class, Class D, and the distribution of documents is imbalanced, as is the case with 20NG. 5NG resembles a uniform distribution. In the pre-processing step, we removed numbers, punctuations, all stop words, and rare words that occur fewer than four or five times. Meanwhile, the letters are converted to lower-case and words are lemmatized. Furthermore, to remove non-English words, we used the Grady Augmented dictionary in the qdapDictionaries package of R software (Rinker, 2013). For term weighting, we used DF weighting, which is a simple metric and independent of the class labels. It counts the number of documents in which a term appears. Despite its simplicity, it is known that its performance is similar to that of IG and CHI if the keyword number is not too low (Taşcı and Güngör, 2013;Zhou et al., 2016). Finally, RH, 5NG and 20NG are represented using the document frequency method with 2,155, 8,606, and 17,774 features, respectively.

In RH, unique features that exist only in one class take a higher proportion than the overlapping features that appear in several classes, whereas 5NG and 20NG are the opposite. In RH, class B has the highest proportion of overlapping features (77.1%) and class C has the highest proportion of unique features (33.3%). In 5NG, the overlapping features and the unique features are more numerous in classes C and A, respectively. RH, 5NG and 20NG have average DF of 5.4, 15.9 and 23.3 respectively, and over 99% of their unique features do not reach the average DF show in Table 6. In the experiment, we exclude the features of which the frequency is one with aims to reduce computational burden. Hence, the number of unique features used in the experiment may possibly be less than discriminative features in leaf nodes.

In order to evaluate the performance of the proposed method, the conventional methods such as global feature selection, local feature selection, discriminative feature selection, and information gain feature selection are constructed as follows.

• •Based on document frequency, the top 20% features are selected according to the Pareto principle for features for which document frequencies are greater than 3 in RH, 4 in 5NG and 5 in 20NG, and are called the global feature set (GFS). The GFS of RH contains 252 features, and their document frequency is 12 or higher. Of these, only three unique features are included. The GFSs of 5NG and 20NG consists of 769 and 1,582 features, respectively, and their document frequencies are 39 and 58 or more, respectively.

• •Local feature set (LFS) includes the top 15% of features in each class based on the document frequency criteria. We apply 15% instead of 20% to match the number of features with GFS.

• •Unique feature set (UFS) contains no overlapping feature, and we select the same number of unique features as GFS.

• •Information gain feature set (IGFS) selects the same number of features as GFS with information gain, which is a typical global feature selection method.

In GFS and LFS, the proportion of unique features is significantly lower than that of overlapping features. This is because they ignore unique features having a relatively low frequency; GFS and LFS select features based on frequency. On the other hand, IGFS contains a relatively higher proportion of unique features.

### 4.2 Binary-Tree Generation

Table 8 shows the cosine similarity of the class pairs to generate the binary-tree. In RH, the similarity of class A-B pairs is highest at 0.86, and the similarity of class CD pairs with a high proportion of unique feature is lowest at 0.42. In particular, if we pair with class D, we have a relatively low degree of similarity compared to the other pairs, which can be attributed to the heterogeneous nature of class D. In 5NG, all class pairs show a similarity of 0.7 or more. According to the procedure described in Section 3.1, as a result of the k-means operation based on the cosine similarity, a binary tree was generated as shown in Figure 3 and Figure 4. In RH, class D with the most heterogeneous property is decomposed in the first level, and classes A and B with the highest similarity are decomposed at the end. In 5NG, classes C and D with the highest similarity are decomposed at the first level, and classes A and E are decomposed at the end.

### 4.3 Feature Selection and Training

The final feature set for classifier learning is made up by adding overlapping features based on the discriminative features of the nodes in each level. We used the dynamic feature selection method, which means that the number of features and components used for learning are different at each level. The order of addition of the overlapping features depends on the descending order of doD obtained from Equation (5), and the number is the time to obtain the maximum classification performance.

As shown in Table 9, the number of features maximizing classification performance (i.e., F1 value) of each class in RH is 343 at level 1, 377 at level 2, and 92 at level 3. At levels 1 and 2, the proposed method used more features than GFS, but fewer features were used at level 3. 5NG used 570 features at level 1, 573 and 674 features at level 2, and 753 features at level 3. At all levels, fewer features were used than GFS. Without considering the so-called blocking problem or error propagation problem (Hernández et al., 2014), wherein high-level misclassifications are transmitted to lower levels, the proposed method outperforms IGFS and resulted in the overall performance of 0.792, 0.895, 0.900 for RH, 5NG and 20NG respectively.

Figure 5 and Figure 6 show the changes in classification performance when the number of over-lapping features varies. As shown in Figure 5 representing the per-formance of RH, class A, C and D show stable perfor-mance improvements as the number of overlapping features increases. However, class B experiences a decrease in classification performance when we add too many overlapping features. This suggests that the addition of the overlapping features plays a negative role in class B, which has a relatively high proportion of overlapping features. 5NG show dramatic performance improvements as the number of overlapping features increases and reach to a stale value when adding 50 or more overlapping features (see Figure 6).

### 4.5 Results

The performance evaluation of the proposed method (DOFS) consists of two aspects: the performance improvement of hierarchical classification and that of dynamic feature selection. The former can be confirmed by comparing the performances of conventional methods (GFS, LFS, UFS, IGFS) before and after applying the binary-tree. The latter can be verified by comparing the results of DOFS and conventional methods after the binary- tree application.

The following three conclusions are drawn from the results of existing methods summarized in Tables 10 and 11. First, at classification has a higher classification performance than hierarchical classification without dynamic feature selection. This means that hierarchical classification is possibly efficient in terms of the number of classifiers used, but does not guarantee better classification performance. The use of static feature selection method helps explain this observation. There is a limit to performance improvements with a method that uses the same features while ignoring the level characteristics. Therefore, in order to improve classification performance, dynamic feature selection considering the characteristics of each class is required.

Second, a feature set that is a mixture of overlapping features and unique features produces better classification performance than a single feature set. For example, in 5GN, the classification performance of GFS and UFS that are composed of only overlapping features or unique features are 0.763 and 0.395 respectively, which are lower than 0.796 of IGFS that contains both overlapping and unique features. This raises the necessity of considering simultaneously the high frequency and uniqueness rather than only one of them when selecting features.

Third, there is no positive correlation between the number of features and classification performance, which means that classification performance is sensitive to the number of features, but adding more features does not always guarantee better performance. For example, LFS with a relatively large number of features showed lower classification performance than IGFS. In addition, the classification performance of GFS, UFS, and IGFS using the same number of features is also different. These results suggest that classification performance is correlated not only with the number of features but also with class structure such as the composition ratio of unique features and the overlapping features. This is in line with our second conclusion.

Comparison of the conventional method and proposed method leads to the following conclusions. In the case of RH and 20NG, the DOFS outperforms the conventional methods even though the same hierarchical classification is applied, as shown in the H and DOFS columns in Tables 10 and 11. This is regarded as an effect of dynamic feature selection in which features are selected by considering the characteristics of each level of the binary tree. In particular, the average classification performance of Class B, which has a relatively high proportion of overlapping features, is lower than the expected value of 0.5 before using the classifiers of conventional methods. However, we improve the classification performance of Class B to 0.615 by using the dynamic feature selection method, which has a positive effect on overall classification performance and shows the highest F1 value. The same result can be seen in the talk.religion.misc class of 20NG. In 5NG, the performance Second, a feature set that is a mixture of overlapping features and unique features produces better classification performance than a single feature set. For example, in 5NG, the classification performance of GFS and UFS that are composed of only overlapping features or unique features are 0.763 and 0.395 respectively, which are lower than 0.796 of IGFS that contains both overlapping and unique features. This raises the necessity of considering simultaneously the high frequency and uniqueness rather than only one of them when selecting features.

Third, there is no positive correlation between the number of features and classification performance, which means that classification performance is sensitive to the number of features, but adding more features does not always guarantee better performance. For example, LFS with a relatively large number of features showed lower classification performance than IGFS. In addition, the classification performance of GFS, UFS, and IGFS using the same number of features is also different. These results suggest that classification performance is correlated not only with the number of features but also with class structure such as the composition ratio of unique features and the overlapping features. This is in line with our second conclusion.

Comparison of the conventional method and proposed method leads to the following conclusions. In the case of RH and 20NG, the DOFS outperforms the conventional methods even though the same hierarchical classification is applied, as shown in the H and DOFS columns in Tables 10 and 11. This is regarded as an effect of dynamic feature selection in which features are selected by considering the characteristics of each level of the binary tree. In particular, the average classification performance of Class B, which has a relatively high proportion of overlapping features, is lower than the expected value of 0.5 before using the classifiers of conventional methods. However, we improve the classification performance of Class B to 0.615 by using the dynamic feature selection method, which has a positive effect on overall classification performance and shows the highest F1 value. The same result can be seen in the talk.religion.misc class of 20NG. In 5NG, the performance of DOFS is higher than that of GFS, LFS, and UFS but slightly lower than that of IGFS. However, DOFS has an advantage in terms of efficiency, as it uses about 14% fewer features than IGFS.

## 5. CONCLUSIONS

In this study, a binary-tree is constructed to improve classification performance of high similarity classes, and a method for dynamic feature selection according to the characteristics of levels is proposed. An extensive experimentation has been conducted benchmarking datasets using 4 different feature selection methods with support vector machine. The proposed method showed that classification performance not only for classes with high similarity but also as a whole is superior to the conventional methods. We also demonstrated the efficiency of classification by using fewer features than the conventional methods.

There are several major findings in this research as follows.

• •From the viewpoint of the classification method, at classification has better performance than hierarchical classification without dynamic feature selection for the data used in the experiment as shown in Tables 10 and 11, which compare the classification performance.

• •The dynamic feature selection method is more effective than the static method for improving classification performance in the scheme of hierarchical classification. In Tables 10 and 11, the proposed dynamic feature selection method (i.e., DOFS) has better performance than other methods that use a hierarchy (see columns H in Tables 10 and 11).

• •When selecting features, considering both frequency and discriminative power is more effective than considering just one of them in terms of improving classification performance. Tables 10 and 11 show that the proposed DOFS method, which attempts to consider frequency and discriminative power, outperforms GFS and UFS, which select features based only on frequency and discriminative power, respectively.

• •Classification performance is correlated not only with the number of features, but also with the class structure, such as the composition ratio of unique and overlapping features and class size. Figure 6 shows that adding more features improves the classification performance, but the performance varies across different classes as shown in Tables 10 and 11. For example, applying DOFS for the RH dataset provides a better Macro-F1 score for Class C or D than for Class A or B, because more unique features are used to classify Class C or D.

The findings give us some interesting directions for future research. First, this study determines the best number of features by observing the changes in classification performance as the number of features increases. However, the numerical analysis shows that the relationship between the number of features and the classification performance does not necessarily have a positive correlation. For the future research, it is desired to develop a method with aims to find an optimal number of features by dividing the features into several blocks according to the degree of uniqueness or overlap and investigating their relations.

Second, the focus of this paper is finding good features, and we simply considered a linear SVM as a classifier. As described in Section 3.3, a few more classifiers such as Naive Bayes, decision tree, and non-recursive neural networks have been widely used. Therefore, this paper should be expanded upon through investigation of which classifier best fits into the proposed feature selection method. In this sense, it is worth considering the use of Deep Neural Networks (DNN) and/or the contextual representations of text documents (e.g., Word2Vec). Our preliminary tests showed that DNN provides no significant improvement in the classification performance, but a growing body of the literature has recently suggested the DNN-based method (Zhang et al., 2017).

Lastly, this paper used the cosine similarity and formed the hierarchy in a top-down fashion. Although the method performs efficiently, it would be worthwhile to investigate better methods for generating the hierarchy.

## ACKNOWLEDGEMENT

This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2017S1A5A2A03067552)

## Figure

An illustrative example of a binary tree.

Research flow.

The binary-tree of rainwater harvesting.

The binary-tree of 5newsgroup.

The classification performance of rainwater harvesting.

The classification performance of 5newsgroup.

## Table

Number of classifiers

Notation summary

An example of feature selection

Summary of datasets

Feature distribution in each class

Document frequency of unique feature

Feature distribution and classification Result according to feature selection

Cosine similarity in each class pair

Number of features in each level

Classification performance based on binary-tree: RH and 5NG

Classification performance based on binary-tree: 20NG

## REFERENCES

1. Agnihotri, D. , Verma, K. , and Tripathi, P. (2017), Variable global feature selection scheme for automatic classification of text documents, Expert Systems with Applications, 81, 268-281.
2. Aizawa, A. (2003), An information-theoretic perspective of tf–idf measures, Information Processing & Management, 39(1), 45-65.
3. Beyan, C. and Fisher, R. (2015), Classifying imbalanced data sets using similarity based hierarchical decomposition, Pattern Recognition, 48(5), 1653-1672.
4. Bharti, K. K. and Singh, P. K. (2015), Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering, Expert Systems with Applications, 42(6), 3105-3114.
5. Cai, J. , Luo, J. , Wang, S. , and Yang, S. (2018), Feature selection in machine learning: A new perspective, Neurocomputing, 300, 70-79.
6. Chatterjee, S. , Jose, P. G. , and Datta, D. (2019), Text classification using SVM enhanced by multithreading and CUDA, International Journal of Modern Education & Computer Science, 11(1), 11-23.
7. Chen, J. , Wang, C. , and Wang, R. (2009), Adaptive binary tree for fast SVM multiclass classification, Neurocomputing, 72(13-15), 3370-3375.
8. Chen, K. , Zhang, Z. , Long, J. , and Zhang, H. (2016), Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert Systems with Applications, 66, 245-260.
9. Chen, Y. T. and Chen, M. C. (2011), Using chi-square statistics to measure similarities for text categorization, Expert systems with applications, 38(4), 3085-3090.
10. Do, C. B. and Ng, A. Y. (2005), Transfer learning for text classification, Advances in Neural Information Processing Systems, 18, 299-306.
11. Du, Y. , Liu, J. , Ke, W. , and Gong, X. (2018), Hierarchy construction and text classification based on the relaxation strategy and least information model, Expert Systems with Applications, 100, 157-164.
12. Duan, K. B. and Keerthi, S. S. (2005), Which is the best multiclass SVM method? An empirical study, International Workshop on Multiple Classifier Systems, Springer, Berlin, Heidelberg, 278-285.
13. Forman, G. (2003), An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, 3, 1289-1305.
14. Fu, R. , Li, B. , Gao, Y. , and Wang, P. (2018), CNN with coarse-to-fine layer for hierarchical classification, IET Computer Vision, 12(6), 892-899.
15. Guan, R. , Shi, X. , Marchese, M. , Yang, C. , and Liang, Y. (2010), Text clustering with seeds affinity propagation, IEEE Transactions on Knowledge and Data Engineering, 23(4), 627-637.
16. Guru, D. S. , Suhil, M. , Raju, L. N. , and Kumar, N. V. (2018), An alternative framework for univariate filter based feature selection for text categorization, Pattern Recognition Letters, 103, 23-31.
17. Hernández, J. , Sucar, L. E. , and Morales, E. F. (2014), Multidimensional hierarchical classification, Expert systems with applications, 41(17), 7671-7677.
18. Jadon, E. and Sharma, R. (2017), Data mining: Document classification using Naive Bayes classifier, International Journal of Computer Applications, 167(6), 13-16.
19. Jiang, C. , Liu, Y. , Ding, Y. , Liang, K. , and Duan, R. (2017), Capturing helpful reviews from social media for product quality improvement: A multi-class classification approach, International Journal of Production Research, 55(12), 3528-3541.
20. Kang, S. , Cho, S. , and Kang, P. (2015), Constructing a multi-class classifier using one-against-one approach with different binary classifiers, Neurocomputing, 149, 677-682.
21. Kohavi, R. and John, G. H. (1997), Wrappers for feature subset selection, Artificial Intelligence, 97(1-2), 273-324.
22. Leopold, E. and Kindermann, J. (2002), Text categorization with support vector machines. How to represent texts in input space?. Machine Learning, 46(1), 423-444.
23. Lin, Y. S. , Jiang, J. Y. , and Lee, S. J. (2013), A similarity measure for text classification and clustering, IEEE Transactions on Knowledge and Data Engineering, 26(7), 1575-1590.
24. Liu, K. H. , Zeng, Z. H. , and Ng, V. T. Y. (2016), A hierarchical ensemble of ECOC for cancer classification based on multi-class microarray data, Information Sciences, 349-350, 102-118.
25. Liu, L. , Kang, J. , Yu, J. , and Wang, Z. (2005), A comparative study on unsupervised feature selection methods for text clustering, In 2005 International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, China, 597-601.
26. Lorena, A. C. , De Carvalho, A. C. , and Gama, J. M. (2008), A review on the combination of binary classifiers in multiclass problems, Artificial Intelligence Review, 30(1-4), 19-37.
27. Madzarov, G. , Gjorgjevikj, D. , and Chorbev, I. (2009), A multi-class SVM classifier utilizing binary decision tree, Informatica, 33(2), 233-241.
28. Manochandar, S. and Punniyamoorthy, M. (2018), Scaling feature selection method for enhancing the classification performance of Support Vector Machines in text mining, Computers & Industrial Engineering, 124, 139-156.
29. Naik, A. and Rangwala, H. (2019), Improving large-scale hierarchical classification by rewiring: A data-driven filter based approach, Journal of Intelligent Information Systems, 52(1), 141-164.
30. Peng, L. and Liu, Y. (2018), Feature selection and overlapping clustering-based multilabel classification model, Mathematical Problems in Engineering, 2018, 1-12.
31. Perrin, P. and Petry, F. (1999), An information-theoretic based model for large-scale contextual text processing, Information Sciences, 116(2-4), 229-252.
32. Quinlan, J. R. (1986), Induction of decision trees, Machine learning, 1(1), 81-106.
33. Rehman, A. , Javed, K. , and Babri, H. A. (2017), Feature selection based on a normalized difference measure for text classification, Information Processing & Management, 53(2), 473-489.
34. Rehman, A. , Javed, K. , Babri, H. A. , and Asim, M. N. (2018), Selection of the most relevant terms based on a max-min ratio metric for text classification, Expert Systems with Applications, 114, 78-96.
35. Reinsel, D. , Gantz, J. , and Rydning, J. (2017), Data age 2025: The evolution of data to life-critical, Don’t Focus on Big Data, 2.
36. Rinker, T. W. (2013), qdapDictionaries: Dictionaries to Accompany the qdap Package, University at Buffalo: Buffalo, NY, USA.
37. Sabbah, T. , Selamat, A. , Selamat, M. H. , Al-Anzi, F. S. , Viedma, E. H. , Krejcar, O. , and Fujita, H. (2017), Modified frequency-based term weighting schemes for text classification, Applied Soft Computing, 58, 193-206.
38. Seo, Y. and Shin, K. S. (2019), Hierarchical convolutional neural networks for fashion image classification, Expert Systems with Applications, 116, 328-339.
39. Silla, C. N. and Freitas, A. A. (2011), A survey of hierarchical classification across different application domains, Data Mining and Knowledge Discovery, 22(1), 31-72.
40. Silva-Palacios, D. , Ferri, C. , and Ramírez-Quintana, M. J. (2018), Probabilistic class hierarchies for multiclass classification, Journal of Computational Science, 26, 254-263.
41. Sun, A. , Lim, E. P. , Ng, W. K. , and Srivastava, J. (2004), Blocking reduction strategies in hierarchical text classification, IEEE Transactions on Knowledge and Data Engineering, 16(10), 1305-1308.
42. Taşcı, Ş. and Güngör, T. (2013), Comparison of text feature selection policies and using an adaptive framework, Expert Systems with Applications, 40(12), 4871-4886.
43. Tegegnie, A. K. , Tarekegn, A. N. , and Alemu, T. A. (2017), A comparative study of flat and hierarchical classification for Amharic news text using SVM, International Journal of Information Engineering & Electronic Business, 9(3), 36-42.
44. Uysal, A. K. (2016), An improved global feature selection scheme for text classification, Expert systems with Applications, 43, 82-92.
45. Uysal, A. K. and Gunal, S. (2012), A novel probabilistic feature selection method for text classification, Knowledge-Based Systems, 36, 226-235.
46. Uysal, A. K. and Gunal, S. (2014), Text classification using genetic algorithm oriented latent semantic features, Expert Systems with Applications, 41(13), 5938-5947.
47. Wu, C. , Liu, F. , and Zhu, B. (2015), Control chart pattern recognition using an integrated model based on binary-tree support vector machine, International Journal of Production Research, 53(7), 2026-2040.
48. Yang, W. Y. , Lu, B. L. , and Kwok, J. T. (2011), Incorporating cellular sorting structure for better prediction of protein subcellular locations, Journal of Experimental & Theoretical Artificial Intelligence, 23(1), 79-95.
49. Zhang, C. , Liu, C. , Zhang, X. , and Almpanidis, G. (2017), An up-to-date comparison of state-of-the-art classification algorithms, Expert Systems with Applications, 82, 128-150.
50. Zhou, H. , Guo, J. , Wang, Y. , and Zhao, M. (2016), A feature selection approach based on interclass and intraclass relative contributions of terms, Computational Intelligence and Neuroscience,
51. Zhou, H. , Zhang, Y. , Liu, H. , and Zhang, Y. (2018), Feature selection based on term frequency reordering of document level, IEEE Access, 6, 51655-51668.
52. Zong, W. , Wu, F. , Chu, L. K. , and Sculli, D. (2015), A discriminative and semantic feature selection method for text categorization, International Journal of Production Economics, 165, 215-222.
 Do not open for a day Close