1. INTRODUCTION
It has recently become easier to use various kinds of customer information, such as customers’ purchase history and ID card information, due to the development of information technology. Market segmentation by using a wide variety of data, e.g., customers’ purchase history, is especially useful in the marketing field to devise various strategies to improve a business’ performance (Beane and Ennis, 1987). Many companies employ customer segmentation to develop the individual promotion of each customer segment.
Generally, it costs five times more to acquire a new customer than it does to keep an existing one, and the majority of a company’s business earnings comes from existing customers. Hence, customer segmentation for service customization can lead to an increase in customer loyalty, and can ensure high customer retention with the least cost consumption (Khajvand et al., 2011;Wei et al., 2012). The traditional RFMvariables can be applied to identify a target group of customers for direct mails and used in the functional relationship to model direct mail response (Bult and Wansbeek, 1995). Chang and Tsai (2011) proposed the group RFM analysis, taking into account the characteristics of purchased products so that the calculated RFM values for customers are strongly related to their purchased items. Hu and Yeh (2014) developed the RFM analysis algorithm to discover a complete set of RFM patterns that can approximate the set of RFM customer patterns without customer identification information.
As mentioned above, RFM analysis is widely applied to many problems and in many cases, the score is analyzed by clustering approach; however, since the conventional method of the RFM analysis did not assume a generative model, there are three problems: (i) an analyst should arbitrarily decide how to decide the threshold for scores of RFM, (ii) the distribution of RFM scores is not considered to decide the cluster, and (iii) the interpretation of hard clustering approach may be difficult. When we set 4 thresholds for dividing 3 variables to five levels separately, there is no scoring rule; therefore, the analysis result may be arbitrary. Also, when we decide the cluster of each data based on the RFM score, the decision of cluster is separately to the decision of RFM scoring; therefore, the unified approach of deciding RFM score and clustering should be required. Furthermore, when we make cluster using conventional RFM analysis, in general, hard clustering approach is applied; however, the interpretation based on the hard clustering approach leads difficulty especially for the sample which seems to have features of plural clusters.
On the other hand, in the field of machine learning, the probabilistic latent semantic analysis (PLSA) is widely used for soft clustering problems (Hofmann and Puzicha, 1999;Hofmann, 1999). The PLSA is a powerful statistical technique to analyze cooccurrence data, which was originally used in information retrieval and related areas. Recently, it has been used to predict customers’ purchase behavior based on latent user preferences. The PLSA is a probabilistic model, introducing a latent variable that can represent latent classes of user preferences for product items. Therefore, it enables us to cluster customers into latent classes, and to calculate each customer’s assignment probabilities to each latent class. However, time and computational complexity are necessary to estimate the parameters of the assumed model, when the model is directly applied to big data, such as users’ purchase history data for all product items. If the concept of RFM analysis and the merit of the latent class models are combined, the soft clustering methodology based on the RFM analysis should be useful for customer segmentation in practice. The idea to integrate PLSA and Group RFM analysis has been discussed by Apichottanakula et al. (2013) to segment customers by sparse purchasing history data. They applied PLSA to the sales data for each pork processor product, then each item is analyzed based on the RFM analysis, finally the two analysis results are integrated to interpretation. The results through their analysis are useful for companies in the pork product industry. However, a statistical model for the RFM analysis was not directly constructed since they integrated two methods (i.e., PLSA approach and RFM analysis) directly.
In this paper, we propose a new latent class model for the RFM analysis based on the purchase history data. The proposed model, which is a generative model of the RFM scores, enables us to construct the customer segments based on RFM variables from a statistical viewpoint. The EM algorithm is applied to the proposed latent class model to estimate suitable parameters. Hence, the proposed model enables to decide the scoring of RFM and segment customers at the same time, and the soft clustering approach helps the interpretation of result. The clusters of the proposed model are constructed by taking account of the distribution of RFM scores because the proposed model is a probabilistic model which is learned a training data set. In other words, the analytics enables to analyze customers’ purchasing data more adequate than the conventional works and is expected to support marketing decisions stronger. The main scientific value of this research is to propose a generative model that enables a famous RFM analysis in the marketing field. A generative model enables us to investigate principles and laws by a logical way. Given a generative model estimated from the given data, it is also possible to predict future purchase behaviors of customers or to generate virtual data for a simulation analysis and make decisions based on the result. The effectiveness of the proposed model was clarified through the analysis of actual data. It is now possible to generate latent classes that better represent the statistical properties of actual data. We demonstrate the analysis by using a real data from a major Japanese retail company to verify the proposed method’s effectiveness. This data was provided by the 2015 Data Analyzing Competition, held by the Joint Association Study Group of Management Science in Japan. This case study is expected to help for practical users.
2. PRELIMINARIES
2.1 RFM Analysis
RFM analysis is a marketing technique for segmenting customers into several groups based on recency, frequency, and monetary values. The marketing technique used involves how long it has been since the customer’s last purchase (recency), how many times the customer purchased (frequency), and how much the customer spent (monetary). This method is widely used to conduct appropriate customer segmentation for personalization services (Birant, 2011), and to identify customers who are more likely to respond to promotions (McCarty and Hastak, 2007).
Analysts must decide the score of the RFM variables previously and calculate the total scores of customers for RFM analysis. An example of the score and threshold is illustrated in Table 1. By using Table 1, the score of each customer can be calculated (e.g., if a customer’s last purchase was within a week, the customer’s purchase frequency is more than 40, and the customer spends money more than 100,000 yen, then the all scores for this customer are 5). If the customer purchase history dataset is noted as per the example in Table 2, then the result of the calculated valuescores shown in Table 3 is acquired.
RFM analysis assigns valuescores to each customer based on their past behaviors. By using the Table 1 explained above, a maximum of 125 different scores (5×5×5) can be assigned to each customer. Therefore, a customer’s score can range from (5, 5, 5) as the highest, to (1, 1, 1) as the lowest. The best customer score is (5, 5, 5), and the customers with this score have purchased most recently, most frequently and have spent the most money. Conversely, the worst customer valuescore is (1, 1, 1), and the customers with this score have purchased least recently, least frequently, and have spent the least amount of money. Typical customer groups can be constructed using this method.
However, the customer segment depends on the threshold to delimit the RFM variables, and analysts must decide those thresholds. It is based on an arbitrary decision of the analysts. In conventional RFM approaches, there are three main issues:

(i) For conducting RFM analysis, analysts have to decide the threshold for scores of R, F, and M to make clusters; however, the analysis result may be arbitrary. This is because the result can be changed with a different threshold.

(ii) When analysts decide the cluster of each data based on the RFM scores, the distribution of RFM scores is not considered to decide the cluster.

(iii) When analysts make clusters by using the conventional RFM analysis, in general, the hard clustering approach is applied; however, the interpretation based on the hard clustering approach leads difficulty especially for the sample which seems to have features of plural clusters.
In this study, we try to solve these issues by our proposed model. The analytics enables to analyze customers’ purchasing data more adequate than the conventional works and is expected to support marketing decision stronger.
2.2 Probabilistic Latent Semantic Analysis (PLSA)
The probabilistic latent semantic analysis (PLSA) is a technique for one of the topic models, and it was initially used for textbased applications, such as information retrieval or text clustering. This model is a probabilistic latent class model, and it assumes latent classes between users (customers) who have similar preferences, and product items that have a similar purchase tendency. This model additionally assumes that the users and the product items belong to each latent class stochastically; that is, it allows that they belong to several different latent classes. The diversity of the user preferences and the tendency of product items are represented based on this assumption. Here, let ${u}_{r}(r=1,\hspace{0.17em}\dots ,\hspace{0.17em}m)$ be users, ${a}_{j}(j=1,\hspace{0.17em}\dots ,\hspace{0.17em}n)$ be the product items, and ${z}_{k}(k=1,\hspace{0.17em}\dots ,\hspace{0.17em}K)$ be the latent classes. The graphical model of PLSA is described in Figure 1.
The cooccurrence event of the user u_{r} and the product item a_{j} can be modeled by the probabilities P(z_{k}) and the conditional probabilities P(u_{r}  z_{k}) and P(a_{j}  z_{k}) in the PLSA. The probabilistic model is formulated by the following equation:
where P(z_{k}) satisfies ${\sum}_{k=1}^{K}\text{P}\left({z}_{k}\right)=1$.
2.3 Related Works
RFM analysis is one of the most wellknown and effective marketing tools. In recent years, several methods of clustering have been reported using RFM variables. For example, Tsai and Chiu (2004) used a designated RFM model to analyze the relative profitability of each customer cluster, which is made by combining a clustering algorithm with a purchasebased similarity measure. They demonstrated, through a practical marketing implementation, the effectiveness of their proposed method, including the RFM profitability analysis. Liu and Shih (2005) proposed an approach that applied two hybrid methods: a weighted RFMbased method and the preferencebased collaborative filtering method, and found recurring patterns. Niyagas et al. (2006) combined an association rulemining technique and the RFM analysis method to analyze ebanking usage of historical data for a bank in Thailand. Cheng and Chen (2009) utilized the RFM model to yield quantitative value as input attributes, applied the kmeans algorithm to cluster customer value, and employed rough sets to find classification rules. Wang (2010) used RFM analysis to validate the proposed method base on a hybrid approach that incorporates kernelinduced fuzzy clustering techniques. Hosseini et al. (2010) proposed a method based on an expanded RFM model by joining the weighted RFMbased method to the kmeans algorithm, applied in DM with the koptimum according to the DaviesBouldin index. Wei et al. (2012) extended the RFM model to the LRFM (length, recency, frequency, and monetary) model for a children’s dental clinic in Taiwan to segment its dental patients. The selforganizing maps (SOM) technique is adopted to make clusters.
In this paper, we apply the latent class model to RFM variables to cluster customers. The latent class model enables customers’ soft clustering without any decision of thresholds for RFM scores. With this proposed model, we can expect to solve the three issues shown in the subsection 2.1.
The latent class model is one of the effective probabilistic models in marketing. The effectiveness of the latent class model, which is a discrete type of latent variable model, has been widely identified across research literature (Green et al., 1976;Swait and Adnmowicz, 2001;Bhatnagar and Ghose, 2004;Train, 2009). Latent customer clusters can be modeled by introducing a latent class model, and this assumption is consistent with marketing models (Train, 2009). This paper attempts to combine the latent class segmentation analysis and the RFM analysis. Note that, there is a study that integrated PLSA and RFM analysis (Apichottanakula et al., 2013) to segment customers for pork processor products as we stated in Introduction. This study applied PLSA to the sales data for each pork processor product, then each item is analyzed based on the RFM analysis, finally the two analysis results are integrated to interpretation; therefore, they combined two methods (i.e., PLSA approach and RFM analysis) directly. On the other hand, this paper constructs a latent class model for the RFM analysis to decide the scoring of RFM statistically. There is a clear difference between two studies.
3. PROPOSED METHOD
The PLSA assumes a latent class between users and product items. In this study, we propose a new latent class model for RFM analysis using the feature values of R, F, and M variables simultaneously for customer segmentation. We propose a new latent class model, a modified PLSA based on the conventional model, to represent customer purchasing behaviors. A method based on the EM algorithm is used to estimate the parameters by incorporating purchase history.
3.1 Proposed Method
In this section, we propose a new latent class model as follows.
We denote the R, F, M variables as x_{ni} (i ∈{1,…, L}, n={1, 2, 3}) , where n=1 is a representation of the R variable, n = 2 is a representation of the F variable, and n = 3 is a representation of the M variable. The purchase behavior of the customer i is then denoted by a vector $({x}_{1i},\hspace{0.17em}{x}_{2i},\hspace{0.17em}{x}_{3i})$. The proposed probabilistic model is formulated by equation (2):
The graphical model of our proposed method is shown by Figure 2.
This model is described by the conditional probabilities $P({x}_{ni}{z}_{k})$ of three variables in RFM analysis. That is, the model assumes that the three variables are conditionally independent one another given by a latent class. Denoting its average as μ_{nk} and its variance as ${\sigma}_{nk}^{2}$, the probabilities $P({x}_{ni}{z}_{k})$ can be represented by the following equation:
The average μ_{nk} and the variance ${\sigma}_{nk}^{2}$ depend on z_{k} and have the subscript k because these conditional probabilities are conditioned by a latent class z_{k}.
3.2 The Method of Parameter Estimation
These parameters can be estimated from the purchase history data to maximize the loglikelihood written in Equation (4).
Note that $\text{P(}{x}_{1i},\hspace{0.17em}{x}_{2i},\hspace{0.17em}{x}_{3i}\text{)}$, given by Equation (4), includes the latent value z_{k}, which cannot be observed. It is necessary to employ an iterative procedure, such as the EM algorithm, as the estimator of these model parameters, including the latent variable z_{k}, cannot be formulated analytically. The EM algorithm is the method that estimates the parameter based on the maximum likelihood principle, by using an iterative procedure with only the observed data. This algorithm contains two steps, which are the expectation step (Estep) and the maximizing step (Mstep). The Estep calculates the conditional expectation from the observed data. The Mstep maximizes the conditional expectation of the loglikelihood, which is calculated in the previous Estep. By iterating these two steps, the loglikelihood finally converges to a local maximum and the estimated parameters are then produced. Each step of this algorithm for the proposed model is formulated by the following equations:
[The Estep]
The probabilities $P({z}_{k}{x}_{1i},\hspace{0.17em}{x}_{2i},\hspace{0.17em}{x}_{3i})$ are updated after estimating each parameter in the Mstep.
[The Mstep]
The EM algorithm is stopped when the loglikelihood function of equation (4) converges.
4. ANALYSIS OF PURCHASE HISTORY DATA BASED ON PROPOSED METHOD
These parameters of the proposed model are usually estimated from the purchase history data. In this section, we analyze the real purchase history data of a major Japanese retail company by applying the proposed model. The data for this demonstration was provided by the 2015 Data Analyzing Competition, held by the Joint Association Study Group of Management Science.
4.1 Data and Analysis Settings
The proposed method is applied to real purchase history data to verify the effectiveness of our proposal. The Joint Association Study Group of Management Science in Japan is used. We used purchase history data stored from July 1, 2013, to June 30, 2014, as the training data, which is provided by 10 stores from a major Japanese retail company. The number of customers is I =113,381. In addition, the number of the model’s latent classes is set as K = 20, because of the empirical method.
4.2 Result and Discussion
The result of 113,381 customers’ RFM data is noted in Table 4. Table 4 illustrates the 20 latent classes, with the averages of RFM variables and customers’ rate of belonging.
For example, the latent class1 is the segment with a prior purchase made 110.9 days ago, a purchase frequency of 4.0 times, the total money spent on purchases is 5,798.30 yen, and the customers’ rate in this segment is 3.44%. We make the result visible to represent it intelligibly. The result is visualized in Figure 3, which shows the 20 customer segments; the size of circles indicates the rate of customers belonging to each latent class.
In contrast, in segment c16, which refers to the worst customers, and segments c17 and c19, which refer to bad customers, the rate of customers belonging is lower in the smaller stores than in the larger stores. This result demonstrates that smaller stores possess more frequent customers. The differences in stores’ characteristics are investigated in this manner, and such information is useful for developing a channel and branch strategy.
The best customers are demonstrated by the circle in the upper left because the averages of their F and M variables are the largest, and the average of their R variable is the smallest. The worst customers are adversely displayed by the circle at the lower front because the averages of their F and M variables are the smallest, and the average of their R variable is the largest.
Moreover, the number of worst customers is relatively large. In addition, the circle in the middle denotes estranged customers, as the averages of all R, F, and M variables are relatively large. Additionally, the proposed model can be utilized for the analysis of the differences between stores’ characteristics. Here, we compare two stores with the largest numbers of customers, and two stores with the smallest numbers of customers. Table 5 displays the characteristics of each segment, and the rate of customers belonging to each segment of the four stores. In segment c18, which refers to the best customers, and segment c5, which refers to good customers, the rate of customers belonging is higher in smaller stores than in larger stores.
In this section, we clarified that it is possible to create customer segments based on latent classes. The proposed model avoids the conventional RFM analysis’ defective point, that analysts must decide the proper thresholds to give discrete score for the three variables, as in Table 1. Therefore, it is conceivable that the proposed model creates customer segments from a statistical viewpoint, and the size of each customer segment is automatically estimated. The customer numbers in each segment are important to consider promotional strategies.
5. CONCLUSION AND FUTURE WORK
In this paper, we proposed a new latent class model based on the PLSA and RFM analysis. In the conventional RFM models, since the conventional method for RFM analysis did not assume a generative model, there are mainly three problems: (i) how to decide the threshold for scores of RFM, (ii) The distribution of RFM scores are not considered to decide the cluster, and (iii) The interpretation of hard clustering approach may be difficult. Our approach solves three issues of the conventional RFM analysis. Solving the issues helps the more adequate analysis of customers’ purchasing data, and the analytics enables to analyze customers’ purchasing data more adequate, and is expected to support marketing decision stronger. The effectiveness of our proposed method is clarified by the demonstration used real data of a major Japanese retail company. The customers are segmented by latent classes, and the characteristics of each segment can be analyzed by the estimated model. Moreover, we compared the rate of typical customers between several stores, and discovered that a smaller shop has a higher rate of good customers, and has a lower rate of bad customers. The demonstration using real purchasing history data in a Japanese company is considered to be helpful for practical users. This is because case studies will be useful information for readers who work for a company. The proposed method is based on a generative model of the R, F, and M scores. This is a merit for an analyst and it is also possible to making use of the estimated model for the prediction of the future purchasing behaviors of customers. This model can also be utilized for generation of artificial data for simulation analysis and decision making based on the result. The original RFM analysis is not based on a generative model, so it doesn’t have such benefits.
Future work is to investigate other suitable probability distributions instead of the normal distribution; for example, the exponential distribution may be effective because the data involves nonnegativity. In addition, future works could also involve analyzing the latent classes and clarifying customers’ preferences.