• Editorial Board +
• For Contributors +
• Journal Search +
Journal Search Engine
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.16 No.2 pp.155-164
DOI : https://doi.org/10.7232/iems.2017.16.2.155

# Multi-Valued Classification of Text Data Based on an ECOC Approach Using a Ternary Orthogonal Table

Leona Suzuki*, Kenta Mikawa, Masayuki Goto
Graduate School of Creative Science and Engineering, Waseda University, JAPAN
Department of Information Science, Shonan Institute of Technology, Kanagawa, JAPAN
School of Creative Science and Engineering, Waseda University, JAPAN
Corresponding Author, davinci.1031@suou.waseda.jp
March 27, 2016 October 7, 2016 November 14, 2016

## ABSTRACT

Because of the advancements in information technology, a large number of document data has been accumulated on various databases and automatic multi-valued classification becomes highly relevant. This paper focuses on a multivalued classification technique that is based on Error-Correcting Output Codes (ECOC) and which combines several binary classifiers. When predicting the category of a new document data, the outputs of the binary classifiers are combined to produce a predicted value. It is a known problem that if two category sets have an imbalanced amount of training data, the prediction accuracy of a binary classifier is low. To solve this problem, a previous study proposed to employ the Reed-Muller (RM) codes in the context an ECOC approach for resolving the imbalance in the cardinality of the training data sets. However, RM codes can equalize the amount of between training data of two category sets only for a specific number of categories. We want to provide a method that can be employed for a multi-valued classification with an arbitrary number of categories. In this paper, we propose a new configuration method combining binary classifiers with categories, which are not used for classification. This method allows us to reduce the amount of training data for each binary classifier while improving the balance of the training data between two category sets for each binary classifier. As a result, the computational complexity can be decreased. We verify the effectiveness of our proposed method by conducting a document classification experiment.

## 1.INTRODUCTION

Because of the advancements in information technology, a large number of document data from different levels of business processes has been accumulated on various databases. For instance, industrial management deals with various kinds of digital document data and analyzing the textual data can provide new insights for management purposes. Because the amount of digital text data is large, automatic multi-valued classification is important. Multi-valued classification is a classification problem with more than three category labels. Usually, the documents are classified into multi-valued categories, and, therefore, multi-valued classification is need for document classification. There are two approaches for multi-valued classification problems. One approach is to constructs a unique multi-valued classifier, and the second approach produces a combination of binary classifiers (Dietterich and Bakiri, 1995; Huang et al., 2006; Ikeda, 2010). We focus on the multi-valued classification based on Error-Correcting Output Codes (ECOC) which is an effective method combining several binary classifiers (Dietterich and Bakiri, 1995). The ECOC approach uses a numerical table called a code table whose rows represent categories and the columns represent configurations of each binary classifier. A set of categories used for binary classification is called a category set. When predicting the category of new text data, the outputs of all binary classifiers are combined to acquire a predicted value.

There are two approaches for constructing a code table. The first approach is the adaptive generation method (Escalera et al., 2008; Pujol et al., 2006; Zhang, 2015; Zhong and Cheriet, 2013) and the second approach is the non-adaptive method (Ogihara et al., 2013; Oyama et al., 2008). The former approach changes the code table while learning the training data. The latter constructs the code table and fixes before starting the training phase, which means that the configuration of all binary classifiers can be obtained before the training data is given. Therefore, this method has the advantage of allowing parallel computation for each binary classifier. In this study, we focus on the code table construction of the latter approach.

However, if the amount of training data is imbalanced between two category sets, the prediction accuracy of the binary classifier is low. A previous study proposed an approach based on the Reed-Muller (RM) code for resolving the imbalance between the numbers of training data in the context of ECOC approach (Ogihara et al., 2013). However, this approach can equalize the amount of training data of two category sets only for a specific number of categories. We want to provide a method that can be employed for a multivalued classification with an arbitrary number of categories.

In this paper, we propose a new configuration method that combines binary classifiers with several categories, which are not used for classification. This method enables us to improve the balance of the training data between two categories to enhance the classification accuracy. As a result, the amount of training data learned for each binary classifier can be reduced, and, the computational complexity can be decreased. We verify the effectiveness of our proposed method by conducting a document classification experiment.

## 2.PRELIMINARIES

### 2.1.Classification using RVM

Relevance Vector Machine (RVM) is a method based on the probabilistic approach for solving regression and classification (Tipping, 2001). For classification, the RVM method determines the probability of belonging to a certain category as output and has a relatively high prediction accuracy. Let x be a feature vector in the defined feature space and c∈{c1, c2} be a binary category label. A set of N training document samples is denoted by ${ ( x n , t n ) } n = 1 N$, where tn ∈{c1, c2}. The probability of the category label ck (k =1, 2) conditioned on x is expressed by using logistic regression as follows.(1)(2)

$p ( c = c k | x ) = 1 1 + exp ( − f ( x ) )$
(1)

$f ( x ) = ∑ i = 1 , t i = c k N a i K ( x , x i )$
(2)

where $a i ∼ N ( 0 , γ i − 1 )$ and $γ 1 − 1 , γ 2 − 1 , ⋯ , γ N − 1$ are obtained by maximizing a posterior probability of $γ = ( γ 1 − 1 , γ 2 − 1 , ⋯ , γ N − 1 )$, which is a hyper-parameter. Moreover K(⋅,⋅) is a kernel function which calculates the inner product of two input data mappings to a higher dimensional space, and ai represents a weight parameter. By maximizing the posterior probability, almost all $γ i − 1$ become 0. We call a vector xi with a non-zero γi value a relevance vector (RV). The function f (x) is decided by using the relevance vectors (RVs). The classification accuracy of the RVM method is high, and it possesses other desirable properties. On the other hand, the computational complexity of the RVM method for learning a set of training data is also high. If we apply the RVM method to a model with M basis functions, the computational complexity of evaluating the inverse matrix of size M reaches O(M3) . Let K be the number of categories. For a multi-valued classification using a unique RVM classifier, we have the disadvantage that the computational learning complexity is K3 times higher than that of the binary RVM method.

### 2.2.ECOC Approach

In this paper, we focus on the multi-valued classification problem with more than three category labels. Let K be the number of categories and, C ={c1, c2, …, cK} the set of category labels. The ECOC approach is a multivalued classification method that applies the error correcting technique from code theory to automatic classification. This approach estimates the category ck (k =1,…, K) of the new unlabeled data by combining the outputs of all binary classifiers. Each binary classifier sorts the data into a category set based on a code table which is configured with the two binary digits 0 and 1. We denote a code table by W, where W is a K×R matrix and the number of binary classifiers is denoted by R. The column vectors of W represent the configurations of the binary classifiers, which defines the classification rule between the set of categories corresponding to the digit 1 in the column vector of W and the set of categories corresponding to the digit 0 in the same vector. Thereby, when we reverse the binary digits in the two binary classifiers, we obtain identical classifiers. Moreover, the vector of the i -th row is called the codeword of the category ci and is denoted by $W c i$.

In addition to the conventional binary code table, which uses only the binary digits 0 and 1, there is also a ternary code table, which allows categories that are not used for classification (Escalera et al., 2010), and are represented by an asterisk symbol ‘∗’. If $W c i ⋅ r$ is ∗, then the training data of category ci is not used for learning the binary classifier r . Hence, when a ternary code table with entries 0, 1, and * is used, the amount of training data and the computational complexity decrease compared to using a binary code table, because some categories which are not used for the binary classification.

As mentioned above, we focus on the RVM method for binary classification. Let Gr be the output of the r -th binary classifier (r =1,…, R) and $W c i ⋅ r$ be the r -th element of $W c i$ . The classification criterion is defined as follows as(3)

$c ^ = arg max c i ∏ r = 1 R G r W c i ⋅ r ( 1 − G r ) 1 − W c i ⋅ r$
(3)

The classification accuracy of the ECOC approach is high if the Hamming distances between the codewords are large. The Hamming distance dH (x, y) between x and y is given by(4)(5)

$d H ( x , y ) = ∑ i = 1 d δ ( x i , y i )$
(4)

$δ ( x i , y i ) = { 0 , x i = y i 1 , x i ≠ y i$
(5)

This condition is true because a codeword is a vector corresponding to a category on the R-dimensional space, and the Hamming distance is a measure of the distance between two vectors. Thus, two categories with a large Hamming distance are more easily distinguishable. Therefore, when we increase the number of classifiers, and the classification accuracy rises. However, the computational complexity increases at the same time.

## 3.CONVENTIONAL METHODS

In this section, we describe the conventional code table generation methods proposed in past studies. As we mentioned above, we can classify these methods into two main categories, namely adaptive methods and nonadaptive methods. Adaptive methods change the code table while adapting the training data. Non-adaptive methods construct the code table and fix it before starting the training phase. Section 3.1 describes the adaptive methods and Section 3.2 describes the non-adaptive methods.

There are many adaptive methods. These methods construct the code tables without considering the Hamming distances between the codewords. Crammer and Singer (2002) proposed a method that searches a code table with a small norm and minimizing empirical loss. Another type of adaptive methods is based on tree structure. Pujol et al. (2006) proposed Discriminant ECOC, a method based on discriminant tree structures. This method constructs a hierarchical code table that maximizes a discriminative criterion based on mutual information (Cover and Thomas, 2012). Other ECOC methods also use this discriminant tree structure (Escalera et al., 2006, 2007; Pujol et al., 2008; Xue et al., 2015). Moreover, some methods derive sub-categories from original set of categories and use the sub-category information to simplify the discrimination of the binary classifiers (Bouzas et al., 2010; Pujol et al., 2008; Zhang, 2015). There are many adaptive methods proposed in the literature (Escalera et al., 2009a; Escalera et al., 2011), but they change the code table while adapting the training data. Therefore, these adaptive methods cannot compute the binary classifiers in parallel and the computational complexity of learning is high.

In this paper, we focus on non-adaptive generation methods of code tables because they allow the parallel computation of the binary classifiers. In the following, we present the standard non-adaptive generation methods of code tables.

#### 3.2.1.One-vs.-Rest and One-vs.-One

The one-vs.-rest methods and the one-vs.-one methods are the standard non-adaptive methods. The one-vs.- rest methods learn the K binary classifiers. Each of the K classifiers sorts the input data into a single category or the remaining categories. The one-vs.-one use classifiers, which sort all conceivable pairs of categories. These methods are simple but effective (Rifkin and Klautau, 2004).

#### 3.2.2.Random Code

When generating random codes, the sorting into the category sets is determined randomly under certain rules. There is no guarantee that random code is of high performance because it is based on a random strategy. There fore, code is used for benchmark methods and the initial matrix of the code tables’ generation method by iterative processing (Chmielnicki, 2015).

There are two types of random code, namely dense random codes and sparse random codes (Allwein et al., 2000). When generating dense random code, you start with a large number of binary random code tables whose elements 0 and 1 were chosen randomly. The code tables have ⎡⎢10log2K ⎤⎥ columns and ⎡⎢ x⎤⎥ is the smallest integer not smaller than x . You then need to choose the code table with the highest minimum Hamming distance between the codewords. Generating of sparse random code is the same as generating dense random code except that the code tables’ randomly chosen elements can be 0, 1, or ∗. The probability of choosing a ∗ is 0.5 and the probability of choosing a 0 or 1 is 0.25. The code tables have ⎡⎢15log2K ⎤⎥ binary classifiers.

#### 3.2.3.Exhaustive Code

The exhaustive code is a binary code table that contains all conceivable binary classifiers (Dietterich and Bakiri, 1995). The configuration method of the exhaustive code is as follows.

• (1) All elements of $W c i$ are set to $W c i$ =1.

• (2) $W c 2$ consists of 2K −2 zeros, followed by 2K−2 −1 ones.

• (3) $W c 3$ consists of 2K −3 zeros, followed by 2K −3 ones, followed by 2K −3 zeros, and, followed by 2K−3 −1 ones.

• (4) Similarly to the above, $W c k$ contains alternating runs of 2Kk zeros and ones.

Table 1 is an example of the exhaustive code when K = 4. The number of binary classifiers R is 24−1 −1= 7. Since the number of classifiers is the highest in all binary code tables, the classification performance is relatively high for a given category number K = 4 However, the computational cost is the highest of the ECOC approach because the number of classifiers is large.

#### 3.2.4.BCH Code

In the field of coding theory, the BCH code is known as an effective code (Cover and Thomas, 2012). It is created based on error correction capability and code length. Here, v, κ, τ denote the codeword length, the number of information bits and the number of error correction bits. The corresponding BCH code is denoted by (v, κ, τ). When the BCH code is used for the ECOC approach, ν and 2τ +1 correspond to R and the minimum Hamming distance among the codewords. Therefore, the minimum Hamming distance can be chosen freely in the BCH code.

As mentioned above, in the ECOC approach, two categories with a large Hamming distance are more easily distinguishable. Therefore, the BCH code is suitable for creating a code table because the minimum Hamming distance of the codewords can be chosen freely. Table 2 shows an example of (15, 5, 3) BCH code in case of K = 8

#### 3.2.5.Reed-Muller Code

According to the field of coding theory, the RM (Reed-Muller) code possesses a beautiful structure. It is also suitable for the ECOC approach (Cover & Thomas, 2012). An example of an RM code in the case of K = 8 is shown in Table 3.

To apply the RM code, the number of categories must be a power of 2, and the number of digits 0 and digits 1 in each column of the code table must be equal to K/2. Therefore, if two categories have the same amount of training data, then the amount of training data for each category set is the same. In addition, the RM code has two more properties. First, the Hamming distance between the codewords representing each category is relatively large. Second the Hamming distance between each classifier is equal. Therefore, when the number of catego- ries is a power of 2, the ECOC approach with the RM code can combine classifiers with a high classification accuracy.

## 4.PROPOSED METHOD

In this paper, we propose a new configuration method that combines binary classifiers with several categories that are not used for classification. After outlining our proposal, we present the details of the proposed algorithm for constructing the table.

### 4.1.The Outline of the Proposed Method

When the RM code is used for the ECOC approach, the K categories must be equally divided into two category sets. Therefore, if K is an odd number, then this approach cannot be applied. In most cases, we can solve this problem by dividing all categories into three category sets and equalizing the amount of training data between two of the three category sets. Ternary code tables are configured with three category sets. Therefore, we propose the ECOC method based on a ternary code table whose configuration represents the classification between two category sets with the same amount of training data. The remaining category set is not used in the classification.

### 4.2.The Exploratory Configuration Method of Ternary Code Tables

In this paper, we focus on equally dividing the categories into three category sets and on the classifying method to construct a code table such that two out of three category sets are classified. A code table dividing all categories equally into three category sets can be created from a numerical table with an equal amount of the three elements in each column. For example, the ternary orthogonal table used in the experimental design satisfies this condition. An orthogonal table, which is also called an orthogonal array, is often used in design experiments. The table is arranged that all ordered tuples of the symbols appear equally often. A ternary orthogonal table is an orthogonal table with the symbols 0, 1, and 2. Table 4 gives an example of a ternary orthogonal table. the ordered tuple Means the combination of the symbol for a set of two columns. For example, the ordered tuple on the first row is (0, 10), and on the second row, it is (0, 1) for the set of first and second columns. All ordered tuples are given by (0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2) and every tuples appears just once for every combination of two columns. Because every tuple appears equally often in an orthogonal table, a symbol appears equally often in any column.

To raise the classification accuracy by increasing the number of classifiers in the proposed method, let the ternary orthogonal table be the initial matrix. By adding column vectors whose three elements are equal sequentially to the initial ternary matrix, we can expect to acquire the desirable ternary code table. By adding column vectors like this, the number of classifiers increases and the classification accuracy can be improved. If the arbitrary column vectors are added to the initial code table directory, two problems occur. First one is that the minimum Hamming distances between any two codewords may become relatively small because an arbitrary addition does not take into account the Hamming distance between categories. Second a classifier similar to the classifiers in the original code table may be added.

To solving first problem, we focus on the two categories that are most similar, i.e., the pair of categories whose Hamming distance is the smallest. The smaller the Hamming distance, the smaller the difference between the categories. Therefore, we choose the vector that makes the minimum Hamming distance larger. For solving the second problem, the similarities between the classifiers are calculated based on the Hamming distance to explore a dissimilar classifier from any one of the other classifiers. If the Hamming distance between two classifiers is 0 or K, the classifiers are identical binary classifiers (i.e., the classification rule must be the same when using those classifiers) because the two binary classifiers with 0 and 1 reversed become identical to each other. This means that the two classifiers whose Hamming distance is close to 0 or K have a high similarity. Therefore, in this study, the similarity is defined as follows. If the Hamming distance between the column vectors is K / 2 or more, the similarity is defined by the Hamming distance. Otherwise, the similarity is defined by K-Hamming distance. That is, the closer the Hamming distance between two classifiers is to K / 2, the less similar the classifiers. Based on this similarity measure, we choose the column vector with the smallest similarity and the minimum number of similar classifiers.

In the next step, an obtained ternary numerical table whose elements are 0, 1, or 2 is converted to the ternary code table whose elements are 0, 1, or ∗. The method of conversion is that one of the elements 0, 1, or 2 is converted to ∗, and the other two elements are converted to 0 and 1. The conversion has three patterns. In the proposed method, all patterns are implemented, and the three code tables that are generated by converting are combined into one code table.

### 4.3.The Algorithm for Constructing the Code Table

In this section, we describe the details of the proposed algorithm for constructing the code table. First, we create a ternary orthogonal table. Ternary orthogonal tables with K or more rows are prepared. From these tables, we select the one with the minimum number of rows. Let KH be the number of rows in the selected orthogonal table. Next, we add the column vector to the selected orthogonal table according to two criteria. The first criterion is to enlarge the minimum Hamming distance between the rows. The second criterion is to minimize the maximum similarity and the number of classifiers with maximum similarity. The added vector is a K dimensional ternary vectors with KH / 3 elements of symbol of 0, 1, and 2. After adding the vector until a given threshold, the obtained ternary numerical table with elements are 0, 1, and 2 is converted to a ternary code table with elements 0, 1, and ∗. In the following algorithm, α is the threshold for adding columns and, dH (x, y) is the Hamming distance between x and y.

• Step 1) Let H be a ternary orthogonal table with KH rows and NH columns, hi (i =1,…, NH ) the column vector of H and wk (k =1,…, KH ) the row vector of H.

• Step 2) The column candidates are all ternary column vectors which are neither identical nor complementary to each other and have an equal number of three elements, namely KH / 3. Let $h ′ j ( h ′ j , 1 , ⋯ , h ′ j , K H ) ( j = 1 , ⋯ , N )$ be the jth column of the candidates, where N is the number of the column candidates.

• Step 3) The similarity between hi and h'j is calculated as follows.(6)

$d ′ H ( h i , h ′ j ) = { d H ( h i , h ′ j ) , if d H ( h i , h ′ j ) ≥ K K − d H ( h i , h ′ j ) , otherwise$
(6)

• Step 4) Let M (h'j) be the maximum similarity between hi and h'j , and let Mmin be the minimum of M (h'j). For each j, M (h'j) and Mmin are calculated as follows.(7)(8)

$M min = min j M ( h ′ j )$
(7)

$M ( h ′ j ) = max i d ′ H ( h i , h ′ j )$
(8)

If Mminα, go to step 5, otherwise go to step 9.

• Step 5) Let NM (j) be the number of the classifier which is most similar and $N min M$ be a minimum of NM (j). NM (j) and $N min M$ are calculated as follows.(9)(10)

$N min M = min j ∈ M ( h j i ) = M min N M ( j )$
(9)

$N M ( j ) = | { i ∈ M ( h ′ j ) = d ′ H ( h i , h ′ j ) } |$
(10)

where |A| is the number of element A.

• Step 6) For all k1 and k2 , the Hamming distance between k1 and k2 –th row $d H ( w k 1 , w k 2 )$ are calculated.

• Step 7) Let ($( k 1 min , k 2 min )$) be the pair of categories whose Hamming distance of codewords is minimal. $( k 1 min , k 2 min )$ is calculated as follows.(11)

$( k 1 min , k 2 min ) = arg min k 1 , k 2 d H ( w k 1 , w k 2 )$
(11)

• Step 8) NH = NH + 1. For all j, h'j satisfying $M ( h ′ j ) = M min$ , $N M ( j ) = N min M$ and $h j , k 1 min ≠ h ′ j , k 2 min$ is added to H. If there are more than one candidate, the column is added randomly from these candidates.

• Step 9) For matrix H with elements are 0, 1, and 2, each of the elements is converted in the following three ways: {0→0, 1→1, 2→*}, {0→1, 1→*, 2→0}, {0→*, 1→0, 2→1}. Then, the three matrices are combined into one matrix. The combined matrix is the code table of the proposed algorithm.

Table 5 shows an example of a code table which is created by the proposed method. Each column has an equal amount of elements 0 and 1. In addition, the training data gets smaller in each classifier. Therefore, the computational complexity can be decreased in each clas- sifier. If K < KH , the code table can be derived by selecting K rows from the code table created with the above described algorithm. The K rows can be selected in different ways, e.g., randomly of by deleting codewords based on a criterion a minimum Hamming distance.

An analyst who applies the proposed method should choose the threshold α by considering the situation. If the time complexity of the learning phase should not be large or if a large amount of training data is used for learning the binary classifiers, then the threshold α should be small. If we want to increase the classification accuracy even if the time complexity of the learning phase is larger, the threshold α can be set high. However, when high threshold α is used, binary classifiers tend to be added that are highly similar to others. Therefore, it can cause to decrease the classification accuracy. Usually, it is better to set a small threshold α. Because the threshold α means a yardstick to decide how many classifiers are added, it is easy to understand the meaning of the threshold instinctively. If we have to set the number of added classifiers directly, this setting can affect the classification performance. Using the threshold α, the number of added classifiers is decided automatically by checking the similarities between the classifiers. On our previous experiments with several settings, the setting of the threshold α does not strongly affect the performance of the classification. However, the number of classifiers decides the computational complexity in the learning phase. An analyst can decide the threshold α by considering both, classification accuracy and computational complexity in practice.

Escalera et al. (2009b) pointed out that it is better to use a simple Hamming distance for evaluating the distances between codewords of ternary code tables because the simple Hamming distances do not consider the differences in the number of ∗ among the codewords. However, the algorithm can solve this problem in step 9 by converting the three elements in three ways. Therefore, using the simple Hamming distance in the proposed algorithm does not affect the above problem.

## 5.EXPERIMENTS

### 5.1.Experimental Conditions

In order to verify the performance of the proposed method, we conducted a simulation experiment using Japanese newspaper articles. We used the Mainichi Newspapers published in 2010 with nine categories. All articles belong to only one category. For each category, we use 100 and 200 articles as the training data and 100 articles are used as random test data at. As evaluation criteria, we employ the accuracy rate and the computational complexity. The computational complexity is calculated for both, normal processing and parallel processing. These evaluation experiments are repeated five times and the average of the five times is used for evaluation. The comparison approaches are one-vs.-rest, one-vs.one, the (15, 5, 3) BCH code, the Exhaustive code, the RM code, and the method using only the ternary orthogonal table.

The proposed approach is the code table for which we added the column vectors to the ternary orthogonal table. Because the number of rows of the (15, 5, 3) BCH code is 32, it is necessary to narrow down the rows to the number of categories. For this experiment, we repeated the random selection of nine rows ten times and used the mean. Regarding the RM code, we randomly chose nine row vectors from the RM code of 16 categories, because the RM code cannot be applied directly to the nine categories. The result of the RM code was the average ten repetitions. The column vectors are added randomly in the proposed approach if there are candidates of added columns. Hence, the result of the proposed approach was the average over ten repetitions. The threshold α is set to 6 from the viewpoint of easiness of comparison with previous methods.

### 5.2.The Result of Experiment and Discussion

The number of classifiers is shown in Table 6. Table 7 and Table 8 show the result of using 100 and 200 articles as training data, respectively. Figure 1 shows the relation between the number of added classifiers and the accuracy rate in learning 100 articles.

Table 7 and Table 8 show that the exhaustive code has the highest accuracy rate when the number of training data is 100 and that the exhaustive code and the proposed method have the highest accuracy rate when the number of the training data is 200 in each category. The computational times of the normal process are in proportion to the number of classifiers. In the parallel process, the computational times of the ternary code tables are less than that of the binary code tables. Because the one-vs.-rest and one-vs.-one methods are simple, the computational complexity of these methods is relatively low, but the classification accuracy is also low. Especially, the one-vs.-rest method is inferior to the to the orthogonal method on precision and computational complexity. This is because the orthogonal table provides a code table, which suppresses the imbalance of the training data between the categories, and this property leads to good classification accuracy. At the same time, the computational complexity is low because a ternary orthogonal table is used for code table construction and a part of the training data is not used for learning of each binary classifier.

The performance of the BCH code is almost equivalent to that of the RM code. In this experiment, the number of classifiers of the method based on the BCH code is identical with that of the RM code. In coding theory, it is well known that the structure of the RM code is beautiful, but the BCH code has a high ability of superior error correction. In the ECOC method, the classification accuracy of each binary classifier can be the same and is affected by the balance of the training data between two category sets. The rate of the training data between two category sets for a binary classifier of the RM code is near 1:1, so that the classification performance of the RM code comes to the same level as the classification performance of the BCH code.

The proposed method was able to achieve a high accuracy and decrease the computational time. The reason why the proposed method has a lower computational time, especially in the case of parallel processing, is that the proposed method needs less training data for each classifier by using a ternary code table. From Figure 1, Table 7 and Table 8, we can conclude that the computational time of the proposed method can become lower than that of the RM code while keeping the accuracy rate of the proposed method and the RM code by changing the number of added classifiers.

In the following, we focus on the accuracy rates of each binary classifier. Table 9 and Table 10 show the maximum, the minimum, the average and the standard deviation of the accuracy rate in each binary classifier. In Tables 9 and Table 10, max, min, ave and SD stand for the accuracy rate of maximum, minimum, average and standard deviation respectively.

Table 9 and Table 10 show that the proposed method has a high average and a small variation. The exhaustive code has a high average and a high variation. The RM code has a low average and a small variation. The proposed method used a ternary code table while the other methods used a binary code table. The binary classifiers of the proposed method deal with sub-problems that are divided from the binary classifications with nine to six categories. The classifications with six categories are easier than the classifications with nine categories. Therefore, dividing the problem into sub-problems is one of the reasons why the accuracy rates of the binary classifiers be- come high.

The exhaustive code includes binary classifiers which are a small number of categories -vs-. many categories like 1-vs.-rest. In the category set with the small number of categories, the data of a different property does not mix. This provides a relatively high accuracy. On the other hand, in the exhaustive code, we also use the binary classifiers with two category sets that have various categories like four-vs-five classifiers. The accuracy of each classifier becomes relatively low. As mentioned above, each binary classifier has a ratio of the number of categories between the category sets. The difference of the ratio among the binary classifiers causes different accuracy rates of the binary classifiers and a high variance of the accuracy rates in exhaustive code. All, all binary classifiers of the proposed method are 3-vs.-3 classifiers. The ratios of the number of categories do not differ among all binary classifiers. Therefore, in the proposed method, the variance among the accuracy rates of the binary classifiers becomes small.

In this study, we use the equation (3) as a classification criterion. In this equation, it is assumed that the outputs of all binary classifiers can be treated equally. In addition, the variance among the accuracy rates of the binary classifiers of the proposed method is small. Therefore, the high accuracy rates of each binary classifiers and the small variance are the reasons why the overall accuracy rate of the proposed method is equivalent to the one of the exhaustive code, even though the number of binary classifiers is smaller than the number of the exhaustive code.

## 6.CONCLUSION AND FUTURE WORKS

In this paper, we focus on the classification of digital document data based on the ECOC approach. We proposed a new algorithm for constructing a code table, which can equalize the number of training data between category sets which the RM code cannot equalize. In addition, the proposed method can reduce the variance among the accuracy rates of binary classifiers and allows the analysts to change the number of binary classifiers freely. The experiments show the effectiveness of our proposed method concerning accuracy and computational time. Future work will expand the number of categories and apply the orthogonal table to other than ternary tables.

## ACKNOWLEDGMENTS

The authors would like to express their gratitude to Prof. S. Hirasawa, Mr. Gendo Kumoi, Dr. Haruka Yamashita, and all members of Goto Laboratory, Waseda University for their helpful comments on this research. A part of this study was supported by JSPS KAKENHI Grant Numbers 26282090 and 26560167.

## Figure

Relation between the number of added classifiers and the accuracy rate.

## Table

Exhaustive code (K = 4)

(15, 5, 3) BCH code (K = 8)

RM code (K = 8)

Example of an orthogonal table

Example of the proposed method’s code table

Numbers of classifiers

Result of training data 100

Result of training data 200

Accuracy rate of each classifier in training data 100

Accuracy rate of each classifier in training data 200

## REFERENCES

1. Allwein E L , Schapire R E , Singer Y (2000) Reducing multiclass to binary: A unifying approach for margin classifiers , Journal of Machine Learning Research, Vol.1 ; pp.113-141
2. Bouzas D , Arvanitopoulos N , Tefas A (2010) Optimizing subclass discriminant error correcting output codes using particle swarm optimization , The 2010 International Joint Conference on Neural Networks, ; pp.1-7
3. Chmielnicki W (2015) Creating effective error correcting output codes for multiclass classification , In Hybrid Artificial Intelligent Systems, ; pp.502-514
4. Cover T M , Thomas J A (2012) Elements of Information Theory, John Wiley & Sons,
5. Crammer K , Singer Y (2002) On the learnability and design of output codes for multiclass problems , Machine Learning, Vol.47 (2) ; pp.201-233
6. Dietterich T G , Bakiri G (1995) Solving multiclass learning problems via error-correcting output codes , Journal of Artificial Intelligence Research, Vol.2 ; pp.263-286
7. Escalera S , Masip D , Puertas E , Radeva P , Pujol O (2011) Online error correcting output codes , Pattern Recognition Letters, Vol.32 (3) ; pp.458-467
8. Escalera S , Pujol O , Radeva P (2010) On the decoding process in ternary error-correcting output codes , IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.32 ; pp.120-134
9. Escalera S , Pujol O , Radeva P (2006) ECOCONE: A novel coding and decoding strategy , Proceedingsof 18th International Conference on Pattern Recognition, Vol.3 ; pp.578-581
10. Escalera S , Pujol O , Radeva P (2007) Boosted landmarks of contextual descriptors and forest- ECOC: A novel framework to detect and classify objects in cluttered scenes , Pattern Recognition Letters, Vol.28 ; pp.1759-1768
11. Escalera S , Pujol O , Radeva P (2009a) Recoding error-correcting output codes , In Multiple ClassifierSystems, Springer, ; pp.11-21
12. Escalera S , Pujol O , Radeva P (2009b) Separability of ternary codes for sparse designs of errorcorrecting output codes , Pattern Recognition Letters , Vol.30 (3) ; pp.285-297
13. Escalera S , Tax D M , Pujol O , Radeva P , Duin R P (2008) Subclass problem-dependent design for error-correcting output codes , IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.30 (6) ; pp.1041-1054
14. Huang T K , Weng R C , Lin C J (2006) Generalized bradley-terry models and multi-class probability estimates , Journal of Machine Learning Research, Vol.7 ; pp.85-115
15. Ikeda S (2010) Combining binary machines for multiclass: Statistical model & parameter estimation , Proceedings of the Institute of Statistical Mathematics, Vol.58 ; pp.157-166
16. Ogihara T , Mikawa K , Goto M (2013) Multivalued classification of text data based on ECOC approach considering parallel processing , Proceedings of The 14th Asia Pacific Industrial Engineering and Management Systems Conference APIEMS 2013,
17. Oyama Y , Takenouchi T , Ishii S (2008) A hierarchical multi-class classification method based on error-correcting output coding , Journal of the Institute of Electronics, Information and Communication Engineers, Vol.197 ; pp.337-342
18. Pujol O , Escalera S , Radeva P (2008) An incremental node embedding technique for error correcting output codes , Pattern Recognition, Vol.41 ; pp.713-725
19. Pujol O , Radeva P , Vitria J (2006) Discriminant ECOC: A heuristic method for application dependent design of error correcting output codes , IEEETransactions on Pattern Analysis and Machine Intelligence, Vol.28 ; pp.1007-1012
20. Rifkin R , Klautau A (2004) In defense of one-vsall classification , Journal of Machine Learning Research, Vol.5 ; pp.101-141
21. Tipping M E (2001) Sparse bayesian learning and the relevance vector machine , Journal of Machine Learning Research, Vol.1 ; pp.211-244
22. Xue A , Wang X , Song Y , Lei L (2015) Discriminant error correcting output codes based on spectral clustering , Pattern Analysis and Applications, ; pp.1-19
23. Zhang X L (2015) Heuristic ternary error-correcting output codes via weight optimization and layered clustering-based, Approach , IEEE Transactions on Cybernetics, Vol.45 ; pp.289-301
24. Zhong G , Cheriet M (2013) Adaptive errorcorrecting output codes , Proceedings of The Twenty-Third international joint conference on Artificial Intelligence, ; pp.1932-1938AAAI Press