• Editorial Board +
• For Contributors +
• Journal Search +
Journal Search Engine
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.19 No.2 pp.476-483
DOI : https://doi.org/10.7232/iems.2020.19.2.476

# Analysis of Purchase History Data Based on a New Latent Class Model for RFM Analysis

Qian Zhang, Haruka Yamashita*, Kenta Mikawa, Masayuki Goto
Graduate School of Creative Science and Engineering, Waseda University Tokyo, Japan
Department of Information and Communication, Sophia University Tokyo, Japan
Department of Information Science, Shonan Institute of Technology, Kanagawa, Japan
Department of Creative Science and Engineering, Waseda University Tokyo, Japan
*Corresponding Author, E-mail: h-yamashita-1g8@sophia.ac.jp
September 20, 2017 May 15, 2020 May 20, 2020

## ABSTRACT

Recently, it has become easier to make use of various kinds of information on customers (e.g. customers’ purchase history), due to the development of information technology. Especially in the marketing field, in fact, many companies try to employ customer segmentation for the services customization which leads to increase customer loyalty and to keep high customer retention. One of the well-known approaches for the customer analysis based on purchase history data is the RFM analysis. The RFM analysis is usually used to segment customers into several groups by using three variables; how long it has been since their last purchase, how many times they purchased, and how much they spent. However, the conventional method of the RFM analysis did not assume a generative model. Therefore, when applying to an actual data set and scoring each index of R, F, M scores, several problems occur. The main problem is that an analyst should arbitrarily decide the threshold for the scores of RFM. On the other hand, in the field of machine learning, the probabilistic latent semantic analysis is widely used for soft clustering. The latent class model enables us to cluster customers into latent classes and to calculate the assignment probabilities of each customer to each latent class. In this paper, we propose a new latent class model for the RFM analysis based on the purchase history data. The proposed model enables to decide the scoring of RFM and segment customers automatically, and the soft clustering approach helps the interpretation of the result. Furthermore, the proposed model takes account of the generation model of RFM scores. From the result of actual data analysis, it became clear that it is possible to extract latent classes that express the statistical characteristics of data well. Given a generative model estimated from the given data, it is also possible to predict future purchase behaviors of customers or to generate virtual data for simulation analysis and make decisions based on the result. We verify the effectiveness of our model by analyzing a real purchase history data of a Japanese major retail company.

## 1. INTRODUCTION

It has recently become easier to use various kinds of customer information, such as customers’ purchase history and ID card information, due to the development of information technology. Market segmentation by using a wide variety of data, e.g., customers’ purchase history, is especially useful in the marketing field to devise various strategies to improve a business’ performance (Beane and Ennis, 1987). Many companies employ customer segmentation to develop the individual promotion of each customer segment.

Generally, it costs five times more to acquire a new customer than it does to keep an existing one, and the majority of a company’s business earnings comes from existing customers. Hence, customer segmentation for service customization can lead to an increase in customer loyalty, and can ensure high customer retention with the least cost consumption (Khajvand et al., 2011;Wei et al., 2012). The traditional RFM-variables can be applied to identify a target group of customers for direct mails and used in the functional relationship to model direct mail response (Bult and Wansbeek, 1995). Chang and Tsai (2011) proposed the group RFM analysis, taking into account the characteristics of purchased products so that the calculated RFM values for customers are strongly related to their purchased items. Hu and Yeh (2014) developed the RFM analysis algorithm to discover a complete set of RFM patterns that can approximate the set of RFM customer patterns without customer identification information.

As mentioned above, RFM analysis is widely applied to many problems and in many cases, the score is analyzed by clustering approach; however, since the conventional method of the RFM analysis did not assume a generative model, there are three problems: (i) an analyst should arbitrarily decide how to decide the threshold for scores of RFM, (ii) the distribution of RFM scores is not considered to decide the cluster, and (iii) the interpretation of hard clustering approach may be difficult. When we set 4 thresholds for dividing 3 variables to five levels separately, there is no scoring rule; therefore, the analysis result may be arbitrary. Also, when we decide the cluster of each data based on the RFM score, the decision of cluster is separately to the decision of RFM scoring; therefore, the unified approach of deciding RFM score and clustering should be required. Furthermore, when we make cluster using conventional RFM analysis, in general, hard clustering approach is applied; however, the interpretation based on the hard clustering approach leads difficulty especially for the sample which seems to have features of plural clusters.

On the other hand, in the field of machine learning, the probabilistic latent semantic analysis (PLSA) is widely used for soft clustering problems (Hofmann and Puzicha, 1999;Hofmann, 1999). The PLSA is a powerful statistical technique to analyze co-occurrence data, which was originally used in information retrieval and related areas. Recently, it has been used to predict customers’ purchase behavior based on latent user preferences. The PLSA is a probabilistic model, introducing a latent variable that can represent latent classes of user preferences for product items. Therefore, it enables us to cluster customers into latent classes, and to calculate each customer’s assignment probabilities to each latent class. However, time and computational complexity are necessary to estimate the parameters of the assumed model, when the model is directly applied to big data, such as users’ purchase history data for all product items. If the concept of RFM analysis and the merit of the latent class models are combined, the soft clustering methodology based on the RFM analysis should be useful for customer segmentation in practice. The idea to integrate PLSA and Group RFM analysis has been discussed by Apichottanakula et al. (2013) to segment customers by sparse purchasing history data. They applied PLSA to the sales data for each pork processor product, then each item is analyzed based on the RFM analysis, finally the two analysis results are integrated to interpretation. The results through their analysis are useful for companies in the pork product industry. However, a statistical model for the RFM analysis was not directly constructed since they integrated two methods (i.e., PLSA approach and RFM analysis) directly.

In this paper, we propose a new latent class model for the RFM analysis based on the purchase history data. The proposed model, which is a generative model of the RFM scores, enables us to construct the customer segments based on RFM variables from a statistical viewpoint. The EM algorithm is applied to the proposed latent class model to estimate suitable parameters. Hence, the proposed model enables to decide the scoring of RFM and segment customers at the same time, and the soft clustering approach helps the interpretation of result. The clusters of the proposed model are constructed by taking account of the distribution of RFM scores because the proposed model is a probabilistic model which is learned a training data set. In other words, the analytics enables to analyze customers’ purchasing data more adequate than the conventional works and is expected to support marketing decisions stronger. The main scientific value of this research is to propose a generative model that enables a famous RFM analysis in the marketing field. A generative model enables us to investigate principles and laws by a logical way. Given a generative model estimated from the given data, it is also possible to predict future purchase behaviors of customers or to generate virtual data for a simulation analysis and make decisions based on the result. The effectiveness of the proposed model was clarified through the analysis of actual data. It is now possible to generate latent classes that better represent the statistical properties of actual data. We demonstrate the analysis by using a real data from a major Japanese retail company to verify the proposed method’s effectiveness. This data was provided by the 2015 Data Analyzing Competition, held by the Joint Association Study Group of Management Science in Japan. This case study is expected to help for practical users.

## 2. PRELIMINARIES

### 2.1 RFM Analysis

RFM analysis is a marketing technique for segmenting customers into several groups based on recency, frequency, and monetary values. The marketing technique used involves how long it has been since the customer’s last purchase (recency), how many times the customer purchased (frequency), and how much the customer spent (monetary). This method is widely used to conduct appropriate customer segmentation for personalization services (Birant, 2011), and to identify customers who are more likely to respond to promotions (McCarty and Hastak, 2007).

Analysts must decide the score of the RFM variables previously and calculate the total scores of customers for RFM analysis. An example of the score and threshold is illustrated in Table 1. By using Table 1, the score of each customer can be calculated (e.g., if a customer’s last purchase was within a week, the customer’s purchase frequency is more than 40, and the customer spends money more than 100,000 yen, then the all scores for this customer are 5). If the customer purchase history dataset is noted as per the example in Table 2, then the result of the calculated value-scores shown in Table 3 is acquired.

RFM analysis assigns value-scores to each customer based on their past behaviors. By using the Table 1 explained above, a maximum of 125 different scores (5×5×5) can be assigned to each customer. Therefore, a customer’s score can range from (5, 5, 5) as the highest, to (1, 1, 1) as the lowest. The best customer score is (5, 5, 5), and the customers with this score have purchased most recently, most frequently and have spent the most money. Conversely, the worst customer value-score is (1, 1, 1), and the customers with this score have purchased least recently, least frequently, and have spent the least amount of money. Typical customer groups can be constructed using this method.

However, the customer segment depends on the threshold to delimit the RFM variables, and analysts must decide those thresholds. It is based on an arbitrary decision of the analysts. In conventional RFM approaches, there are three main issues:

• (i) For conducting RFM analysis, analysts have to decide the threshold for scores of R, F, and M to make clusters; however, the analysis result may be arbitrary. This is because the result can be changed with a different threshold.

• (ii) When analysts decide the cluster of each data based on the RFM scores, the distribution of RFM scores is not considered to decide the cluster.

• (iii) When analysts make clusters by using the conventional RFM analysis, in general, the hard clustering approach is applied; however, the interpretation based on the hard clustering approach leads difficulty especially for the sample which seems to have features of plural clusters.

In this study, we try to solve these issues by our proposed model. The analytics enables to analyze customers’ purchasing data more adequate than the conventional works and is expected to support marketing decision stronger.

### 2.2 Probabilistic Latent Semantic Analysis (PLSA)

The probabilistic latent semantic analysis (PLSA) is a technique for one of the topic models, and it was initially used for text-based applications, such as information retrieval or text clustering. This model is a probabilistic latent class model, and it assumes latent classes between users (customers) who have similar preferences, and product items that have a similar purchase tendency. This model additionally assumes that the users and the product items belong to each latent class stochastically; that is, it allows that they belong to several different latent classes. The diversity of the user preferences and the tendency of product items are represented based on this assumption. Here, let $u r ( r = 1 , … , m )$ be users, $a j ( j = 1 , … , n )$ be the product items, and $z k ( k = 1 , … , K )$ be the latent classes. The graphical model of PLSA is described in Figure 1.

The co-occurrence event of the user ur and the product item aj can be modeled by the probabilities P(zk) and the conditional probabilities P(ur | zk) and P(aj | zk) in the PLSA. The probabilistic model is formulated by the following equation:

$P ( u r , a j ) = ∑ k = 1 K P ( z k ) P ( u r | z k ) P ( a j | z k ) ,$
(1)

where P(zk) satisfies $∑ k = 1 K P ( z k ) = 1$.

### 2.3 Related Works

RFM analysis is one of the most well-known and effective marketing tools. In recent years, several methods of clustering have been reported using RFM variables. For example, Tsai and Chiu (2004) used a designated RFM model to analyze the relative profitability of each customer cluster, which is made by combining a clustering algorithm with a purchase-based similarity measure. They demonstrated, through a practical marketing implementation, the effectiveness of their proposed method, including the RFM profitability analysis. Liu and Shih (2005) proposed an approach that applied two hybrid methods: a weighted RFM-based method and the preference-based collaborative filtering method, and found recurring patterns. Niyagas et al. (2006) combined an association rule-mining technique and the RFM analysis method to analyze e-banking usage of historical data for a bank in Thailand. Cheng and Chen (2009) utilized the RFM model to yield quantitative value as input attributes, applied the k-means algorithm to cluster customer value, and employed rough sets to find classification rules. Wang (2010) used RFM analysis to validate the proposed method base on a hybrid approach that incorporates kernel-induced fuzzy clustering techniques. Hosseini et al. (2010) proposed a method based on an expanded RFM model by joining the weighted RFM-based method to the k-means algorithm, applied in DM with the koptimum according to the Davies-Bouldin index. Wei et al. (2012) extended the RFM model to the LRFM (length, recency, frequency, and monetary) model for a children’s dental clinic in Taiwan to segment its dental patients. The self-organizing maps (SOM) technique is adopted to make clusters.

In this paper, we apply the latent class model to RFM variables to cluster customers. The latent class model enables customers’ soft clustering without any decision of thresholds for RFM scores. With this proposed model, we can expect to solve the three issues shown in the subsection 2.1.

The latent class model is one of the effective probabilistic models in marketing. The effectiveness of the latent class model, which is a discrete type of latent variable model, has been widely identified across research literature (Green et al., 1976;Swait and Adnmowicz, 2001;Bhatnagar and Ghose, 2004;Train, 2009). Latent customer clusters can be modeled by introducing a latent class model, and this assumption is consistent with marketing models (Train, 2009). This paper attempts to combine the latent class segmentation analysis and the RFM analysis. Note that, there is a study that integrated PLSA and RFM analysis (Apichottanakula et al., 2013) to segment customers for pork processor products as we stated in Introduction. This study applied PLSA to the sales data for each pork processor product, then each item is analyzed based on the RFM analysis, finally the two analysis results are integrated to interpretation; therefore, they combined two methods (i.e., PLSA approach and RFM analysis) directly. On the other hand, this paper constructs a latent class model for the RFM analysis to decide the scoring of RFM statistically. There is a clear difference between two studies.

## 3. PROPOSED METHOD

The PLSA assumes a latent class between users and product items. In this study, we propose a new latent class model for RFM analysis using the feature values of R, F, and M variables simultaneously for customer segmentation. We propose a new latent class model, a modified PLSA based on the conventional model, to represent customer purchasing behaviors. A method based on the EM algorithm is used to estimate the parameters by incorporating purchase history.

### 3.1 Proposed Method

In this section, we propose a new latent class model as follows.

We denote the R, F, M variables as xni (i ∈{1,…, L}, n={1, 2, 3}) , where n=1 is a representation of the R variable, n = 2 is a representation of the F variable, and n = 3 is a representation of the M variable. The purchase behavior of the customer i is then denoted by a vector $( x 1 i , x 2 i , x 3 i )$. The proposed probabilistic model is formulated by equation (2):

$P ( x 1 i , x 2 i , x 3 i ) = ∑ k = 1 K P ( z k ) P ( x 1 i | z k ) P ( x 2 i | z k ) P ( x 3 i | z k )$
(2)

The graphical model of our proposed method is shown by Figure 2.

This model is described by the conditional probabilities $P ( x n i | z k )$ of three variables in RFM analysis. That is, the model assumes that the three variables are conditionally independent one another given by a latent class. Denoting its average as μnk and its variance as $σ n k 2$, the probabilities $P ( x n i | z k )$ can be represented by the following equation:

$P ( x n i | z k ) = 1 2 π σ n k 2 exp [ − ( x n i − μ n k ) 2 2 σ n k 2 ]$
(3)

The average μnk and the variance $σ n k 2$ depend on zk and have the subscript k because these conditional probabilities are conditioned by a latent class zk.

### 3.2 The Method of Parameter Estimation

These parameters can be estimated from the purchase history data to maximize the log-likelihood written in Equation (4).

(4)

Note that $P( x 1 i , x 2 i , x 3 i )$, given by Equation (4), includes the latent value zk, which cannot be observed. It is necessary to employ an iterative procedure, such as the EM algorithm, as the estimator of these model parameters, including the latent variable zk, cannot be formulated analytically. The EM algorithm is the method that estimates the parameter based on the maximum likelihood principle, by using an iterative procedure with only the observed data. This algorithm contains two steps, which are the expectation step (E-step) and the maximizing step (M-step). The E-step calculates the conditional expectation from the observed data. The M-step maximizes the conditional expectation of the log-likelihood, which is calculated in the previous E-step. By iterating these two steps, the log-likelihood finally converges to a local maximum and the estimated parameters are then produced. Each step of this algorithm for the proposed model is formulated by the following equations:

[The E-step]

The probabilities $P ( z k | x 1 i , x 2 i , x 3 i )$ are updated after estimating each parameter in the M-step.

[The M-step]

(6)

$μ n k = ∑ i = 1 N x n i P ( z k | x 1 i , x 2 i , x 3 i ) ∑ i = 1 N P ( z k | x 1 i , x 2 i , x 3 i )$
(7)

$σ n k 2 = ∑ i = 1 N ( x n i − μ n k ) 2 P ( z k | x 1 i , x 2 i , x 3 i ) ∑ i = 1 N P ( z k | x 1 i , x 2 i , x 3 i )$
(8)

The EM algorithm is stopped when the log-likelihood function of equation (4) converges.

## 4. ANALYSIS OF PURCHASE HISTORY DATA BASED ON PROPOSED METHOD

These parameters of the proposed model are usually estimated from the purchase history data. In this section, we analyze the real purchase history data of a major Japanese retail company by applying the proposed model. The data for this demonstration was provided by the 2015 Data Analyzing Competition, held by the Joint Association Study Group of Management Science.

### 4.1 Data and Analysis Settings

The proposed method is applied to real purchase history data to verify the effectiveness of our proposal. The Joint Association Study Group of Management Science in Japan is used. We used purchase history data stored from July 1, 2013, to June 30, 2014, as the training data, which is provided by 10 stores from a major Japanese retail company. The number of customers is I =113,381. In addition, the number of the model’s latent classes is set as K = 20, because of the empirical method.

### 4.2 Result and Discussion

The result of 113,381 customers’ RFM data is noted in Table 4. Table 4 illustrates the 20 latent classes, with the averages of RFM variables and customers’ rate of belonging.

For example, the latent class1 is the segment with a prior purchase made 110.9 days ago, a purchase frequency of 4.0 times, the total money spent on purchases is 5,798.30 yen, and the customers’ rate in this segment is 3.44%. We make the result visible to represent it intelligibly. The result is visualized in Figure 3, which shows the 20 customer segments; the size of circles indicates the rate of customers belonging to each latent class.

In contrast, in segment c16, which refers to the worst customers, and segments c17 and c19, which refer to bad customers, the rate of customers belonging is lower in the smaller stores than in the larger stores. This result demonstrates that smaller stores possess more frequent customers. The differences in stores’ characteristics are investigated in this manner, and such information is useful for developing a channel and branch strategy.

The best customers are demonstrated by the circle in the upper left because the averages of their F and M variables are the largest, and the average of their R variable is the smallest. The worst customers are adversely displayed by the circle at the lower front because the averages of their F and M variables are the smallest, and the average of their R variable is the largest.

Moreover, the number of worst customers is relatively large. In addition, the circle in the middle denotes estranged customers, as the averages of all R, F, and M variables are relatively large. Additionally, the proposed model can be utilized for the analysis of the differences between stores’ characteristics. Here, we compare two stores with the largest numbers of customers, and two stores with the smallest numbers of customers. Table 5 displays the characteristics of each segment, and the rate of customers belonging to each segment of the four stores. In segment c18, which refers to the best customers, and segment c5, which refers to good customers, the rate of customers belonging is higher in smaller stores than in larger stores.

In this section, we clarified that it is possible to create customer segments based on latent classes. The proposed model avoids the conventional RFM analysis’ defective point, that analysts must decide the proper thresholds to give discrete score for the three variables, as in Table 1. Therefore, it is conceivable that the proposed model creates customer segments from a statistical viewpoint, and the size of each customer segment is automatically estimated. The customer numbers in each segment are important to consider promotional strategies.

## 5. CONCLUSION AND FUTURE WORK

In this paper, we proposed a new latent class model based on the PLSA and RFM analysis. In the conventional RFM models, since the conventional method for RFM analysis did not assume a generative model, there are mainly three problems: (i) how to decide the threshold for scores of RFM, (ii) The distribution of RFM scores are not considered to decide the cluster, and (iii) The interpretation of hard clustering approach may be difficult. Our approach solves three issues of the conventional RFM analysis. Solving the issues helps the more adequate analysis of customers’ purchasing data, and the analytics enables to analyze customers’ purchasing data more adequate, and is expected to support marketing decision stronger. The effectiveness of our proposed method is clarified by the demonstration used real data of a major Japanese retail company. The customers are segmented by latent classes, and the characteristics of each segment can be analyzed by the estimated model. Moreover, we compared the rate of typical customers between several stores, and discovered that a smaller shop has a higher rate of good customers, and has a lower rate of bad customers. The demonstration using real purchasing history data in a Japanese company is considered to be helpful for practical users. This is because case studies will be useful information for readers who work for a company. The proposed method is based on a generative model of the R, F, and M scores. This is a merit for an analyst and it is also possible to making use of the estimated model for the prediction of the future purchasing behaviors of customers. This model can also be utilized for generation of artificial data for simulation analysis and decision making based on the result. The original RFM analysis is not based on a generative model, so it doesn’t have such benefits.

Future work is to investigate other suitable probability distributions instead of the normal distribution; for example, the exponential distribution may be effective because the data involves non-negativity. In addition, future works could also involve analyzing the latent classes and clarifying customers’ preferences.

## ACKNOWLEDGMENTS

The authors wish to acknowledge all members of Goto Laboratory, Waseda University, and the Joint Association Study Group of Management Science for their support of our research. This study was partially supported by JSPS KAKENHI, Grant Numbers 26282090 and 26560167.

## Figure

Graphical model of PLSA.

The proposed model.

A visualization of the experiment results.

## Table

A standard of delimiting for RFM analysis

An example of customer data

The result of value-scores to customer

Results of the experiment

The comparison of 4 stores

## REFERENCES

1. Apichottanakula, A. , Goto, M. , and Pathumnakul, S. (2013), Applications of group RFM and pLSA models for customer segmentation and characteristic analysis in Thai pork processor, a manuscript in reseach projects, Thailand.
2. Beane, T. P. and Ennis, D. M. (1987), Market segmentation: A review, European Journal of Marketing, 21(5), 20-42.
3. Bhatnagar, A. and Ghose, S. (2004), A latent class segmentation analysis of e-shoppers, Journal of Business Research, 57(7), 758-767.
4. Birant, D. (2011), Data mining using RFM analysis, Knowledge-Oriented Applications in Data Mining, In Tech, Rijeka, Croatia, 92-108.
5. Bult, J. R. and Wansbeek, T. (1995), Optimal selection for direct mail, Marketing Science, 14(4), 378-394.
6. Chang, H. C. and Tsai, H. P. (2011), Group RFM analysis as a novel framework to discover better customer consumption behavior, Expert Systems with Applications, 38(12), 14499-14513.
7. Cheng, C. H. and Chen, Y. S. (2009), Classifying the segmentation of customer value via RFM model and RS theory, Expert Systems with Applications, 36(3), 4176-4184.
8. Green, P. E. , Carmone, F. J. , and Wachspress, D. P. (1976), Consumer segmentation via latent class analysis, Journal of Consumer Research, 3(3), 170-174.
9. Hofmann, T. (1999), Probabilistic latent semantic analysis, Proceedings of the fifteenth International Joint Conference on Uncertainty in Artificial Intelligence, 289-296.
10. Hofmann, T. and Puzicha, J. (1999), Latent class models for collaborative filtering, Proceedings of the sixteenth International Joint Conference on Artificial Intelligence, 688-693.
11. Hosseini, S. M. S. , Maleki, A. , and Gholamian, M. R. (2010), Cluster analysis using data mining approach to develop CRM methodology to assess the customer loyalty, Expert Systems with Applications, 37(7), 5259-5264.
12. Hu, Y. H. and Yeh, T. W. (2014), Discovering valuable frequent patterns based on RFM analysis without customer identification information, Knowledge-Based Systems, 61, 76-88.
13. Khajvand, M. , Zolfaghar, K. , Ashoori, S. , and Alizadeh, S. (2011), Estimating customer lifetime value based on RFM analysis of customer purchase behavior: Case study, Procedia Computer Science, 3, 57-63.
14. Liu, D. R. and Shih, Y. Y. (2005), Integrating AHP and data mining for product recommendation based on customer lifetime value, Information & Management, 42(3), 387-400.
15. McCarty, J. A. and Hastak, M. (2007), Segmentation approaches in data-mining: A comparison of RFM, CHAID, and logistic regression, Journal of Business Research, 60(6), 656-662.
16. Niyagas, W. , Srivihok, A. , and Kitisin, S. (2006), Clustering e-banking customer using datamining and marketing segmentation, ECTI Transaction CIT, 2(1), 63-69.
17. Swait, J. and Adnmowicz, W. (2001), The influence of task complexity on consumer choice: A latent class model of decision strategy switching, Journal of Consumer Research, 28(1), 135-148.
18. Train, K. E. (2009), Discrete Choice Methods with Simulation (2nd ed.), Cambridge University Press, Cambridge.
19. Tsai, C. Y. and Chiu, C. C. (2004), A purchase-based market segmentation methodology, Expert Systems with Applications, 27(2), 265-276.
20. Wang, C. H. (2010), Apply robust segmentation to the service industry using kernel induced fuzzy clustering techniques, Expert Systems with Applications, 37(12), 8395-8400.
21. Wei, J. T. , Lin, S. Y. , Weng, C. C. , and Wu, H. H. (2012), A case study of applying LRFM model in market segmentation of a children’s dental clinic, Expert Systems with Applications, 39(5), 5529-5533.