1. INTRODUCTION
Automatic classification (clustering) involves dividing a set of objects into subsets (groups) so that objects from the same subset are more similar to each other than objects from different subsets by any criterion. The process of classifying objects of the set into certain groups (subsets) depends on the general characteristics of objects and the methods by which they are divided.
Let us consider a practical example. To exclude the emergence of unreliable products intended for installation in highly reliable equipment (for example, onboard equipment of a spacecraft with a long period of active existence), the entire component base passes through specialized test cen ters (Kazakovtsev et al., 2017;Kazakovtsev et al., 2020). These centers perform operations of full incoming inspection of all products, diagnostic nondestructive testing and selective destructive physical analysis. The detection of the homogeneous production batches of products as part of shipped prefabricated batches is one of the most important stages of such testing (Kazakovtsev et al., 2017). Without identifying homogeneous batches made in identical production conditions, we cannot extend the results of the destructive physical analysis to the entire shipped production batch.
One of the simplest and most popular automatic classification models is the Minimum SumofSquared Errors Clustering (MSSC) model also called kmeans (Gribel and Vidal, 2019). The goal is to find k points (centers or centroids) X_{1}, … X_{k} in a ddimensional space, so that the sum of the squared distances from known points (data vectors) A_{1},…, A_{N} to the nearest of the resulting points (centers) reaches its minimum (MacQueen, 1967):
Here, k is the number of groups to be found, N is the number of objects. The kmeans algorithm requires k to be prespecified. Moreover, this algorithm requires the initial centers to be specified, and its result depends on this choice. In addition, the distance function (metric) highly influences the obtained result (division into subsets). The group (cluster) is a subset of data vectors of which a specific center is the nearest.
The first genetic algorithm for solving a similar problem (the discrete pmedian problem) was proposed by Hosage and Goodchild (1986). Its improved version, algorithm (Bozkaya et al., 2002) gives fairly accurate results, however, its convergence rate is not high. In Alp et al. (2003), Alp, Erkut and Drezner presented a faster simple genetic algorithm with a special “greedy” recombination (crossover) procedure, which also gives accurate results. These algorithms solve discrete problems. Subsequently, algorithms with a greedy agglomerative procedure were adapted to solve the continuous MSSC problem (Neema et al., 2011). In such algorithms (Maulik and Bandyopadhyay, 2000), the authors encode solutions (chromosomes) in their genetic algorithms as sets of centers represented by their coordinates (vectors of real numbers) in a multidimensional space. A very limited set of genetic operators has been proposed for such algorithms with real coding.
Analysis of the modern literature (Rand, 1971;De Maesschalck et al., 2000;McLachlan, 1999;Xing et al., 2003) shows that the existing solutions in the field of automatic classification of objects in a multidimensional space of features have either high accuracy, or ensure the stability of the result with multiple runs of the randomized algorithm, or high convergence speed, but not all these qualities at the same time. Algorithms have been developed for solving the MSSC and kmedian problems only for the most common distance measures (Euclidean, Manhattan). However, taking into account the space of features for a specific practical problem, a correct choice of the distance measure can lead to an increase in the accuracy of the automatic classification problem. In this research, we use the Rand Index (RI) (1971) as a measure of accuracy related to the model adequacy as well as the achieved objective function value (1) as a measure of the clustering accuracy related to the algorithm efficiency.
It is extremely difficult to improve the automatic classification result with increased requirements for the accuracy and stability of the result using known algorithms without a significant increase in time costs. When solving practical automatic classification problems with multidimensional data, for example, when solving the problems of identifying homogeneous batches of industrial products and the adequacy of the models and the accuracy of the automatic classification result highlights many questions. However, it is still possible to develop algorithms that provide further improvement of the result based on the selected model such as the MSSC model.
In a multidimensional space of features, there is often a correlation between individual features and groups of features. The use of correlation dependences can be used by moving from a search in space with a Euclidean or rectangular metric to a search in a space with the Mahalanobis metric (De Maesschalck et al., 2000;McLachlan, 1999;Xing et al., 2003). The squared Mahalanobis distance D_{M} is defined as follows:
where X is a vector of measured feature values, μ is a vector of averaged values (cluster center), С is the covariance matrix.
The purpose of this research is to improve the accuracy and stability of the result in solving problems of automatic classification. The idea is to use the Mahalanobis distance measure with the averaged estimate of the covariance matrix in the MSSC clustering problem to reduce the number of the automatic classification errors in comparison with the other known clustering optimization models, as well as to improve the accuracy and the stability of the solution in terms of the achieved objective function value (1) for a fixed execution time.
Data. In this study, we used data of quality control tests of the shipped batches of electronic components (Orlov and Fedosov, 2016) intended for installation in highly reliable equipment. The data was obtained from a specialized test center. Each data vector (data point) is a set of measured product parameters. The original batch of products consists of several homogeneous batches, in accordance with the manufacturer’s markings, which are often absent in practical tasks. The total number of products is up to several thousands. In each batch, the product is described by 205 measured parameters.
Model and algorithm with the Mahalanobis distance. The results of computational experiments on automatic classification of industrial products with the kmedoid and MSSC models, in which the Mahalanobis metric is applied, show an increase in clustering accuracy with automatic classification into 2 to 6 clusters and a small number of objects and informative features.
We propose to calculate the average estimate of the covariance matrix for homogeneous batches of products (based on prelabeled data) and use it instead of the covariance matrix in equation (2). Such an averaged matrix can be calculated with the use of a training sample (prelabeled data):
where n_{j} is the number of objects in the jth batch, n is the total number of the objects in the training set, C_{j} are covariance matrices of the separate batches in the training sets.
Let us denote model (1) with distances (2) and covariance matrix C calculated according to (3), as Mahalanobis Minimum SumofSquared Errors Clustering (MMSSC) model.
We propose an algorithm for automatic classification of objects based on the MMSSC model with the adjustment of the Mahalanobis distance measure parameter (covariance matrix) based on the training sample:
Algorithm 1. MMSSC clustering problem solving with covariance matrix training

Step 1. Solve a MSSC problem with Euclidean distances and obtain k clusters (here, k is an estimation of the number of homogeneous groups, given in advance);

Step 2. For each cluster, calculate its center μ_{i}. In the cases of both Mahalanobis and Euclidean distances, center coordinates are calculated as the averaged object coordinates are calculated as the averaged object coordinates in the Cith cluster, i =1,…, k:


where m is the number of objects in the cluster, X_{ij} are values of the ith feature (j=1, ., m);

Step 3. Calculate the averaged covariance matrix (3). If the average estimate of the covariance matrix is degenerate, go to Step 4, otherwise go to Step 5.

Step 4. Increase the number of clusters (k←k+1) and repeat Steps 1 and 2. Form new clusters with the use of Euclidean distances:


where n is the number of features. Go to Step 3.

Step 5. Match each point to the nearest centroid using the square of the Mahalanobis distance (2) with the averaged estimate of the covariance matrix C (3) to form new clusters.

Step 6. Repeat from Step 2 until all the clusters stay unchanged.
In our research, we performed three series of experiments with industrial product test data.
1^{st}series. The training sample corresponds to the working sample for which the clustering was carried out.
2^{nd}series. The training and work samples are not the same. In practice, a test center can use retrospective data on deliveries and testing of products of the same type as a training sample.
3^{rd}series. The training and working samples also do not match, and the results of the automatic classification of products (results of the kmeans algorithm in the multistart mode with Euclidean distances) were used as the training sample.
In each of the experiment series, for each working sample, we ran the kmeans algorithm 30 times with each of the five clustering models.
M1 model is the MSSC model with Mahalanobis distances, the covariance matrix is calculated for the whole training sample.
MС model is the MSSC model with distances (2) where C is the correlation matrix.
M2 model is the MSSC model with Mahalanobis distances and averaged covariance matrix (i.e. new MMSSC model).
MR model is the MSSC model with Manhattan distances.
ME is the MSSC model with Euclidean distances.
For each model, the minimum (Min), maximum (Max), mean values (Average), standard deviation (Std.Dev) of the Rand index (RI) and the objective function, as well as the values of the coefficients of variation (V) and the range (R) of the target functions are shown in Table 1. In this work, the following data were used for computational experiments. We composed a mixed batch from 7 homogeneous production batches (prelabeled data). Batch 1 contains 71 objects, Batch 2 contains 116 objects, 1867 objects in Batch 3, 1250 objects in Batch 4, 146 objects in Batch 5, 113 objects in Batch 6, and 424 objects in Batch 7.
It was found that the new model M2 with an average estimate of the covariance matrix shows the best accuracy among the presented models in almost all series of experiments using the Rand index (RI) and in all cases it exceeds the ME model, where the Euclidean distance is used. It was also shown in experiments that in most cases the coefficient of variation of the objective function values is higher for the ME model, where the Euclidean measure of distance is used, and also that the coefficient of the range of the objective function values has the highest values for the M2 model, where the Mahalanobis distances are used with an average covariance matrix.. Therefore, to obtain consistently good values of the objective function (1), multiple runs of the kmeans algorithm or the use of other algorithms based on the MSSC model (for example, jmeans (Shkaberina et al., 2020) or greedy heuristic algorithms (Kazakovtsev et al., 2020) are required.
New genetic algorithm for the MMSSC problem. With the known parameters of the Mahalanobis distance, we can use various optimization algorithms to solve the MMSSC problem without a training sample, which differ significantly in the achieved value of the objective function (1). Our new algorithm improves the accuracy of solving the MMSSC clustering problem and the stability of the result within a limited execution time. In this section, by the accuracy of the algorithm we mean exclusively the achieved value of the objective function, without taking into account the indicators of the adequacy of the model and the correspondence of results of the algorithm to the real separation of objects, if they are known.
For genetic algorithms (GA) solving the MSSC clustering (kmeans) problem with real coding of solutions, a very limited set of possible genetic operators is known, including mutation operators. For example, in (Maulik and Bandyopadhyay, 2000), the authors encode solutions (chromosomes) in their GA as sets of centroids represented by their coordinates (vectors of real numbers) in a multidimensional space. Every chromosome undergoes mutation at a fixed probability ω. The procedure (operator) of mutation is as follows.
In our work, we replaced this mutation procedure for the kmeans genetic algorithm with a singlepoint crossover with the following mutation procedure.
This procedure is used with probability equal to 1, after each crossover operator.
Results of running Algorithm 2 with original mutation probability 0.01 and its version with Algorithm 3 as the mutation operator are shown on Figure 1. The new mutation procedure is fast and effective in comparison with uniform random mutation procedure GA, and shows high convergence rate of the objective function.
Algorithm 2. Uniform random mutation procedure GA for MMSSC clustering problem

STEP 1. Generating a random number b∈(0,1] with the uniform distribution;

STEP 2. IFb< ωTHEN the chromosome will mutate. If the position of the current centroid is ʋ, after mutation it becomes:


Sings «+» and «» have the same probability. The coordinates of the centroid are shifted randomly.
Algorithm 3. New mutation procedure in the GA for the MMSSC problem

STEP 1. Generating a random initial solution S = {X_{1} ... X_{k}};

STEP 2. Application of the kmeans algorithm to S for obtaining local optimum S’;

STEP 3. Applying the simple onepoint crossover procedure to the mutated solution S’ from the population and S for obtaining the new solution S’’;

STEP 4. Application of the kmeans algorithm to S’’for obtaining local optimum S’’;

STEP 5. IFF(S’’)<F(S’) THENS’← S’’.
Greedy genetic algorithms and many other evolutionary algorithms for the MSSC clustering problem dispense with the mutation operator. The idea of the greedy agglomerative heuristic procedure is to combine two known solutions into one invalid solution with an excess number of centroids, and then the number of centroids is successively reduced. At each iteration, the centroid is removed, which shows the smallest increase in the objective function value (1).
Algorithm 4. Basic Greedy Agglomerative Heuristic Procedure

Required: initial number of clusters K, needed number of clusters k, k<K, initial solution $S=\{{X}_{1},\mathrm{...},{X}_{K}\},\hspace{0.17em}\text{re}\leftS\right=K$.

STEP 1. Improve initial solution with kmeans algorithm

WHILEK>k

FOR each ${i}^{\text{'}}\in \left\{\overline{1,K}\right\}$do:

END WHILE

STEP 3. Select a subset S_{elim} of centers n_{elim}, S_{elim} ∈S , $\left{S}_{elim}\right={n}_{elim}$ with the minimal values of corresponding variables ${F}_{{i}^{\text{'}}}^{\text{'}}$, ${n}_{elim}=\mathrm{max}\{1,0.2\cdot (\leftS\rightk)\}$.

STEP 4. Obtain the new solution $S\leftarrow S/{S}_{elim}$, K=K1. Improve solution with kmeans algorithm.

END WHILE
In addition, the initial solution S can be obtained by combining two known solutions. Algorithms 5 and 6 modify the initial solution with the second known solution. Greedy Procedure 1 supplements the first set in turn with each element from the second set. Greedy Procedure 2 combines both sets.
Algorithm 5. Greedy Procedure 1 with partial merge

Required: Two sets of cluster centers: ${S}^{\text{'}}=\{{X}_{{}_{1}}^{\text{'}},\mathrm{...},{X}_{{}_{K}}^{\text{'}}\}\hspace{0.17em}\text{and}\hspace{0.17em}{S}^{\text{'}\text{'}}=\{{X}_{{}_{1}}^{\text{'}\text{'}},\mathrm{...},{X}_{{}_{K}}^{\text{'}\text{'}}\}$

FOR: each i' ∈{1,K}

STEP 1. Merge S’ and a single item of the set ${S}^{\text{'}\text{'}}:S\leftarrow S\cup \{{X}_{1}^{\text{'}\text{'}},\mathrm{...},{X}_{K}^{\text{'}\text{'}}\}$.

STEP2. Run Algorithm 4 with initial solution S and save the obtained results.

END FOR

STEP 3. Return the best of the solutions saved in STEP 2.
Algorithm 6. Greedy Procedure 2 with full merge of sets

Required: Two sets of cluster centers ${S}^{\text{'}}=\{{X}_{{}_{1}}^{\text{'}},\mathrm{...},{X}_{{}_{K}}^{\text{'}}\}\hspace{0.17em}\text{and}\hspace{0.17em}{S}^{\text{'}\text{'}}=\{{X}_{{}_{1}}^{\text{'}\text{'}},\mathrm{...},{X}_{{}_{K}}^{\text{'}\text{'}}\}$

Step 1. Combine two sets of cluster centers $S\leftarrow {S}^{\text{'}}\cup {S}^{\text{'}\text{'}}$.

Step 2. Run the Algorithm 4 with initial solution S.
The basic genetic algorithm for kmeans problems is described as follows:
Algorithm 7. GA with realnumber alphabet for the MSSC problem

Required: Initial population size N_{POP}

STEP 1. Generate N_{POP} initial solutions ${S}_{1},\mathrm{...},{S}_{{N}_{POP}}$, where $\left{S}_{i}\right=k,\hspace{0.17em}\text{and}\hspace{0.17em}\{{S}_{1},\mathrm{...},{S}_{{N}_{POP}}\}$ is a randomly selected subset of a set of data vectors. Improve each initial solution with the kmeans algorithm and store the corresponding obtained MSSC values (1) to variables ${f}_{k}\leftarrow F({S}_{k}),\hspace{0.17em}k=\overline{1,{N}_{POP}}$.

LOOP

STEP 2.IF stop condition is satisfied, THENSTOP. Return solution ${S}_{{i}^{*}},{i}^{*}\in \left\{\overline{1,{N}_{POP}}\right\}$ with minimal value ${f}_{{i}^{*}}$.

STEP 3. Randomly choose 2 indexes ${k}_{1},{k}_{2}\in \left\{\overline{1,{N}_{POP}}\right\}{k}_{1}\ne {k}_{2}$.

STEP4.Run crossover procedure:

S ←Crossingover .

STEP5. Run mutation procedure: ( ) c c S ← Mutation S .

STEP6.Runchosen selection procedure to change population set.

END LOOP
At STEP 6, the following algorithm is proposed.
Algorithm 8. Tournament selection

STEP 1. Randomly select 2 indices ${k}_{4},{k}_{5}\in \left\{\overline{1,{N}_{POP}}\right\},{k}_{4}\ne {k}_{5}$.

STEP 2. IF ${f}_{{k}_{4}}>{f}_{{k}_{5}}$ THEN ${S}_{{k}_{4}}\leftarrow {S}_{c},{f}_{{k}_{4}}\leftarrow F({S}_{c})$, ELSE ${S}_{{k}_{5}}\leftarrow {S}_{c},{f}_{{k}_{5}}\leftarrow F({S}_{c})$.
A GA with a greedy heuristic for pmedian problems and MSSC clustering problems can be described as follows.
Algorithm 9. GA with greedy heuristic for pmedian problems and MSSC clustering problems (modifications GMSSCFULL, GMSSCONE and GMSSCMIX)*

Required: Population size N_{POP}.

STEP 1. Set N_{iter} ← 0. Choose a set of initial solutions $\{{S}_{1},\mathrm{...},{S}_{{N}_{POP}}\}$, where S_{i}=k. Improve each initial solution with the kmeans algorithm and save the corresponding obtained values of the objective function(1) to variables ${f}_{k}\leftarrow F({S}_{k}),\hspace{0.17em}k=\overline{1,{N}_{POP}}$. In this work, the initial value of the population NPOP=5.

LOOP

STEP 2. IF stop condition is satisfied, THEN STOP. Return solution ${S}_{{i}^{*}},{i}^{*}\in \left\{\overline{1,{N}_{POP}}\right\}$ with minimal value ${f}_{{i}^{*}}$, ELSE set the population size as follows: ${N}_{iter}\leftarrow {N}_{iter}+1$;

${N}_{POP}\leftarrow \mathrm{max}\{{N}_{POP,}[\sqrt{1+{N}_{iter}}]\}$; IFN_{POP} was changed,

THEN generate a new N_{POP was} described in STEP 1.

STEP 3. Randomly select 2 indices ${k}_{1},{k}_{2}\in \left\{\overline{1,{N}_{POP}}\right\}$.

STEP 4. Run Algorithm 5 (for GMSSCONE) or Algorithm6 (for GMSSCFULL) with solutions ${S}_{{k}_{1}}$ и ${S}_{{k}_{2}}$. For GMSSCMIX, Algorithm5 or Algorithm 6 are chosen at random with equal probability. Get a new solution. S_{c}.

STEP 5. S_{c} ← Mutation (S_{c}) by default, the mutation procedure is not used.

STEP 6. RunAlgorithm 5

ENDLOOP

*GMSSCONE – GA with greedy partial join heuristics, GMSSCFULL–GA with greedy complete union heuristic, GMSSCMIX–random choice of Algorithms 5 or 6.
This algorithm uses a dynamically growing population. In our new version of step 5, the crossoverlike mutation operator is proposed.
Algorithm 10. New mutation operator for STEP 5 of Algorithm 9 (modifications GMSSCFULLMUT, GMSSCONEMUT and GMSSCMIXMUT)

STEP 1. Run the kmeans algorithm on a randomly selected initial solution to obtain a solution S’.

STEP 2. Run Algorithm 5 (for GMSSCONE) or Algorithm 6 (for GMSSCFULL) with solution S_{c}иS″. Get new solution ${S}_{c}^{\text{'}}$.

STEP 3. IF $F({S}_{c}^{\text{'}})<F({S}_{c})$, THEN ${S}_{c}\leftarrow {S}_{c}^{\text{'}}$.
In our experiments (Table 2),new modifications of three genetic algorithms (GMSSCFULLMUT, GMSSCONE MUT and GMSSCMIXMUT) were compared to other known jmeans and kmeans algorithms (in multistart mode), genetic algorithms without mutation (GMSSCFULL, GMSSCONE and GMSSCMIX), automatic classification algorithms for the MMSSC clustering problem with a combined application of search algorithms with alternating randomized neighborhoods formed by the use of greedy agglomerative heuristic procedures (kGHVNS1, kGHVNS2, kGHVNS3), and also for the jmeans problem (jmeans GHVNS1, jmeans GHVNS2).For all data sets, 30 attempts were made to run each of the algorithms. For each algorithm it was calculated the minimum value (Min), maximum value (Max), mean value (Average) and standard deviation (Std.Dev.) of the objective function.
The best values of new algorithm (*) are given in a bold font, the best values of the known algorithms are given in italic, the bestachieved values of the objective function are underlined (Table 2). To confirm the statistical significance of the advantages ( ↑⇑ ) and disadvantages ( ↓⇓ ) of new algorithms over known algorithms, the MannWhitney Utest ( ↑↓↕ ) and ttest ( ⇑⇓⇕ ) were used.
Our experiments show that the genetic algorithms with greedy agglomerative crossover operator built in accordance with new idea of mutation procedure outperform the genetic algorithms without mutation by the obtained objective function value for the MMSSC clustering problem.
2. CONCLUSION
The proposed new MMSSC model for automatic classification of samples of industrial products and new algorithm based on the MMSSC optimization model with Mahalanobis distances and a trained covariance matrix enables us to reduce the proportion of errors (increase the Rand index) when identifying homogeneous production batches of industrial products. The presented genetic algorithm for the MMSSC clustering problem with an idea of using a greedy heuristic procedure in several genetic operators demonstrates more accurate and stable results (achieved values of the MMSSC objective function) in a fixed execution time in comparison with the other algorithms.