• About Us +
• Editorial Board +
• For Contributors +
• Journal Search +
Journal Search Engine
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.20 No.1 pp.48-60
DOI : https://doi.org/10.7232/iems.2021.20.1.48

# A Study on Customer Purchase Behavior Analysis Based on Hidden Topic Markov Models

Mio Hotoda, Gendo Kumoi*, Masayuki Goto
Graduate School of Creative Science and Engineering, Waseda University, Tokyo, Japan
School of Creative Science and Engineering, Waseda University, Tokyo, Japan
*Corresponding Author, E-mail: moto-aries@ruri.waseda.jp
April 14, 2020 November 25, 2020

## ABSTRACT

Along with recent developments of Internet society, purchasing actions on E-commerce (hereinafter called “EC”) sites have become common for many consumers. On the other hand, it is known that the conversion rate (hereinafter called “CVR”) on EC sites is usually several percent at most. Therefore, many EC sites desire effective measures to improve CVR. In general, a user browses several pages on an EC site before he/she decide to purchase an item and it is considered that users’ intentions are reflected in their page transition tendency on an EC site. If a model analyzing the page transition data can extract users’ purchasing intentions, it enables to utilize the information for making a good promotion measure. Here, it is sometimes better to assume latent classes behind the users’ page transitions to understand their purchase intentions, because there are usually not only several user groups with different preferences but also plural states of purchasing intentions. However, previous models either assume the same latent topic on each page in the same session or assume a latent topic for each page every time. These models cannot handle situations where users’ intentions may change during browsing but not change frequently from page to page. In this study, we propose a purchasing behavior analysis model based on Hidden Topic Markov Models (HTMM). The proposed method can divide users’ browsing sequence into multiple subsequences with the same statistical characteristics according to latent topics estimated from page transitions. Then, the purchase probability of each latent topic can be obtained by using the purchase results obtained from the actual browsing history data together. By the proposed model, the purchase probabilities become possible to estimate the purchase intention of the users in real time and the information is effective for considering marketing measures. In this study, an experiment using real browsing history data is carried out and the effectiveness of the proposed method is demonstrated.

## 1. INTRODUCTION

In recent years, with the development of Internet society, the purchasing of products through E-commerce (hereinafter called “EC”) sites has become common for many consumers and the market scale on the Internet economy is expanding (Ministry of Economy, Trade and Industry of Japan, 2019). Because any kinds of user actions on the Internet can be stored as access log data, browsing histories of consumers are now being accumulated in various databases owned by EC sites. Under such background, recent studies (Bassi, 2007; Burt and Sparks, 2003; Wedel and Kamakura, 2012), pointed out the im-portance of the Web marketing measures making use of the enormously accumulated data. For example, Bhatnagar and Ghose (2004) showed that the impact of prices on people’s purchasing behavior on EC sites is lower than that in actual stores.

On the other hand, it is well known that the conver-sion rates (hereinafter called “CVR”; probability of pur-chase) on EC sites are usually several percent at most (Kairos Marketing Inc., 2014). Therefore, measures to improve CVR are required on many EC sites in practice. For example, the improvement of CVR can be expected by grasping the user’s willingness to purchase and taking measures at an effective timing, or by grasping an im-portant page that is likely to lead to purchase and guiding the user to that page.

In general, a user browses several pages on an EC site (such as searching for items, referring to item information and reviews, and so on) before he/she decide to purchase an item It is, therefore, considered that users’ intentions are reflected in the page transition tendency on an EC site. If a model analyzing the page transition data can estimate users’ purchasing intentions, it enables to utilize the information for making a good promotion measure to improve CVR. Here, it is better to assume latent classes behind the users’ page transitions to understand their purchase intentions. This is because there are usually not only several user groups with different preferences but also plural states of purchasing intentions. In this study assuming that each browsing page sequence on a site is generated depending on the user’s intention, we propose a method estimating the unobservable intention from the observable browsing page by utilizing a latent class model. In previous researches, many latent class models like Latent Segment Markov Chain models (hereinafter called “LSMC”) (Dias and Vermunt, 2007; Matsuzaki et al., 2017), Hidden Markov Model (hereinafter called “HMM”) (Kawazu et al., 2016), and Latent Dirichlet allocation (hereinafter called “LDA”) (Jin et al., 2004) have been proposed for analyzing the users’ browsing and purchasing behaviors. These models assume the same latent class on each page in the same session (a series of browsed pages from when a user accesses an EC site until they leave) or the situation that the latent class changes every time in each viewing page in the same session. However, it is quite possible that users’ intentions change during browsing activities in a session although it is unlikely that the intentions frequently change from page to page in practice. Therefore this study proposes a purchasing behavior analysis model based on Hidden Topic Markov Models (hereinafter called “HTMM”) (Gruber et al., 2007), which is a model that combines the ideas of LDA and HMM to model users’ page transition behaviors by analyzing browsing history data stored on the EC site.

HTMM was originally proposed as a document generation model based on the concept of applying hidden Markov model features to LDA that does not consider the context of the sentence. HTMM usually assumes multiple latent classes for a series of sentences. That is, it is assumed that the words in a same sentence belong to a same latent class. However, unlike LDA and HMM applied to document data, we should allow a different latent class on each site viewing but assume that successive site viewings are more likely to have a same latent class. On this assumption, it is possible to divide a series of site viewings into multiple subsequences with the same characteristics. Therefore, by applying HTMM to browsing history data, the browsing sequence data on each session is divided into subsequences with the same characteristics by HTMM.

In addition, we propose a method to calculate the purchase rate of each latent topic by using purchase results obtained from actual data together. Based on this calculated purchase probability and the latent class obtained from the user’s current browsing history data, it becomes possible to estimate the purchase intention of the user in real time. In addition, by applying HTMM to page transition data, it is also possible to detect change points of the latent class of each user. This makes it possible to understand what page transition occurs when the purchase intention of the user changes. In order to clarify the effectiveness of the proposed method, we will conduct experiments using actual browsing history data to demonstrate the effectiveness of the proposed method.

## 2. PREPARATION

In this study, the browsing history data accumulated on the database of an EC site are treated. In this section, we explain the purchasing and browsing actions of users on EC sites and define the characteristics of EC sites targeted in this paper.

In recent years, the purchase of products through EC sites has become popular, and the market scale of the Internet economy is expanding. On an EC site, users can put the item of interest in the shopping cart, conduct order operations (input of personal information, selection of payment method, etc.), and then purchase. At this time, each user usually transits and browses various types of pages on the EC site. On an EC site, it is possible to acquire data, including what kind of browsing behaviors, such as page transition, users did in addition to when and who purchased or did not purchase. The huge data have been accumulated in databases, and it is possible to analyze more-detailed customer purchase behavior. By applying the proposed model to the actual data of an EC site, we demonstrate the analysis of customer purchase behavior considering browsing behaviors on the EC site.

### 2.1 Overview of Analysis for EC Sites

In recent years, the purchase of products through EC sites has become popular, and the market scale of the Internet economy is expanding. On usual EC sites, users can put the interesting items in their shopping cart, conduct order operations (input of personal information, selection of payment method, etc.), and then purchase. At this time, each user should have browsed various types of pages sequentially on the EC site. At an EC site, it is possible to acquire the data of users’ actions, including various browsing behaviors, such as page transition and staying time on each page in addition to the purchasing history. A large amount of data has been accumulated in databases, and it is possible to analyze more-detailed customer purchase behavior. Therefore, an analytical model of customer purchase behavior can be constructed with considering browsing behaviors as well on the EC site.

### 2.2 Target Data for Our Case Study

This research focuses on a case study by using an actual browsing and purchasing log data on an EC site. The data is provided by VALUES, Inc. in Japan. In this chapter, we show the outline of the target data.

There are several types of EC sites, and one of them deals with many products in various categories, such as Amazon.com and Alibaba.com. In this study, we focus on a shopping mall type EC site with many different shops in the site and target “Rakuten market” that is one of the most famous EC sites in Japan. According to Rakuten Inc. (2020), “Rakuten market,” is also dealing with many products of various categories and includes more than 50,000 shops (as of March, 2020).

In this study, the target data only about “Rakuten market” were extracted from the browsing history data of web pages on Internet provided by VALUES, Inc., as the target for our analysis. The data collection period was three months from August 1, 2017 to October 31, 2017. In these data, there are browsing history data for each user. The total number of users in the target data was 766, the total number of browsed pages was 190,754, of which 1,298 pages appeared when the purchase procedure was completed (hereinafter called “order completion pages”), and the total number of sessions was 35,958. At this time, the number of unique pages included in the total number of browsed pages was 53,423. The ratio of the purchasing sessions is called the CVR, and it is defined by equation (1).

(1)

### 2.3 Pre-Analysis

Before proposing a new statistical model, we show the fundamental analysis of the browsing behavior of users on EC sites. First, we show how many pages a user browses during a data collection period (three months) or within one session. Table 1 shows the minimum, maximum, average, and median of the number of pages viewed during the data collection period and within one session.

In addition, Figures 1 and 2 show the number of pages included during the data collection period and within one session. Figure 1 shows that more than half of the users browsed 50 or more pages during the data collection period. On the other hand, Figure 2 shows that the user browsed only one page and left the EC site in more than 40% of the sessions. In addition, about 80% of sessions have less than 5 pages. However, it is clear that more than 10% of the sessions include more than 10 pages in one session.

## 3. RELATED WORK

In this section, we explain the related conventional models as preparation for explaining the proposed model. The most topic models explained in this section have originally proposed for language processing. Although we discuss a model for analyzing customers’ browsing and purchasing behavior on EC sites in this paper, the conventional topic models are explained in this section by using the terms used in language processing field because we respect the original meaning.

### 3.1 Hidden Markov Model

Hidden Markov Model (HMM) is a model of Mar-kov process that represents a sequence of symbols by combining unobservable Markov processes of states and symbols generated depending on each state. The sequence data subject to HMM is assumed to be a first-order Markov model for the state sequence on a finite set of multiple states and the states transition follows a certain transition probability. Further, since symbols output from each state are considered to be generated according to the probability distribution depending on only the state, the symbols are conditionally independent of each other. At this time, only the output symbol sequence is observed, and the model state cannot be observed directly. In other words, a symbol sequence generated by a certain hidden Markov model gives some information about the internal state sequence. The term “hidden” in the model name indicates that the state sequence in which the model has transitioned is not directly observed from the outside. HMM can be applied to various kinds of problems, such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

Figure 3 shows the graphical model of HMM. The stochastic variable xt is a latent variable at time t, and the stochastic variable yt is an observation value (symbol) at time t. From this, the state xt depends only on the previous state xt1 at time t-1 and is not affected by the states before the time t-2. This is called the primary Markov property. It can also be seen that the observed value yt depends only on the state xt at the same time and is not affected by other observed values or the states at another time.

### 3.2 Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is one of the typical multitopic models proposed by Blei et al. (2003) for analysis of natural language data. LDA is a generation model assuming that a document is generated from a mixture of topics, which is a distribution of words with certain characteristics. LDA assumes that there are multiple topics for one sentence and each word can derive from the probability distribution on topics based on the idea that each topic will occur on the sentence with a certain probability. Currently, LDA assumes Dirichlet prior distribution for topic multinomial distribution in each sentence and word multinomial distribution in each topic. The general idea of LDA is based on the hypothesis that a document writer has a specific topic in mind. Writing on a particular topic means choosing a word with a certain probability from the word pool. When the degree of association between a word and a topic is high, the probability that the word is selected is high, and when the degree of association is low, the probability that the word is selected is low. The entire sentence can also be represented as a mixture of different topics.

Figure 4 shows the graphical model of LDA. In LDA, each sentence d has a topic distribution θ on the defined topic set. And for each word on the sentence, a topic z is first selected according to the topic distribution θ, and the word w is generated according to the word distribution λ corresponding to the topic z. represents the number of topics, D represents the number of sentences, and N represents the number of occurrences of the word in the document d. A topic distribution θ is generated for each document every time, and a word distribution λ is generated for each topic. Also, α and β are hyper parameters, which indicate a Dirichlet distribution parameter followed by the topic distribution θ and a Dirichlet distribution parameter followed by the word distribution λ respectively. Among these variables, the only variable that is actually viewed is the word w appearing on the sentence. Therefore, practically, other latent variables are estimated using this observed variable w.

### 3.3 Hidden Topic Markov Model

HTMM is a model that assumes the primary Markov property to the topic of the sentences in the document in the LDA. In order to be more similar with a real situation of people generating natural languages, this model assumes that there is a high possibility that consecutive sentences have the same topic. Unlike LDA, HTMM (which cannot frequently move topics within a document) are no longer invariant for changing the order of words in documents. In other words, the HTMM hypothesis that “documents where consecutive sentences are likely to have the same topic” is more likely than random permutations of the same words. This enable us to perform inferences that are not possible with a bag of words model. For example, HTMM does not necessarily correspond the same topic to the same word in the same document. This is useful for clarifying meaning in applications such as automatic sentence translation. In addition, the topics of consecutive words are likely to be different on LDA, while in HTMM, consecutive words tend to be assigned to the same topic, which helps to divide the document into subsections. Therefore, it is shown that the HTMM model is superior to the LDA model in predictive performance.

Figure 5 shows the graphical model of HTMM. K is the number of topics, D is the number of documents, and Nd is the number of words in document d. Like LDA, in HTMM each document d has a topic distribution θd. First, the topic zd,n is selected according to the topic distribution θd. Then, each word w in the document d is generated according to the word distribution λ corresponding to the topic zd,n. The topic distribution θd is generated for each document, and the word distribution λ is generated for each topic. In LDA, the word topics z generated from the topic distribution θd are independent of each other. However, HTMM assumes first-order Markov property for the word topic z. In other words, the state x in the HMM (unobservable) is latent topic z, and the output y (observable) is the word w. Thus, as shown in Figure 5, in the HTMM, each topic zd,n forms a Markov chain with a transition probability that depends on the topic distribution θd and the topic transition variable Ψd,n. The topic transition variable Ψd,n is expressed by equation (2), and it is a parameter that indicates whether each word takes over the previous word topic. The topic transition variable Ψd,n for the first word of sentences is determined by the binomial distribution according to the hyperparameter , and the topic transition variable Ψd,n of the word in the sentence is 0.

(2)

When the topic transition is forced between the pre-ceding and following words, in other words, when Ψn=1 is set for all n, this model is equivalent to LDA. On the other hand, if we do not allow topic transitions, in other words, setting Ψn=0 for all n, this model becomes a mixed model of unigrams assuming that all words in the document have the same topic.

As with LDA, the only variable actually observed from the HTMM parameters is the word w appearing on the document. Therefore, other latent variables and parameters must be estimated by only the observed sentences consisting of words. Gruber et al. (2007) applied the Expectation-Maximization Algorithm (EM) commonly used in HMM as the estimation method. The EM algorithm consists of E step and M step. In E step, the likelihood is calculated using the probability distribution of the current latent variable. In M step, the parameters are updated so as to maximize the likelihood obtained in E step. In the EM algorithm, unobservable variables can be estimated by repeating E step and M step until the likelihood converges. The case where the EM algorithm is adapted to HTMM is described below. In the case of HTMM, the parameters to be estimated are θd, λ and . After θd and are obtained by the estimation, the transition matrix is updated accordingly. The update formula of each step of EM algorithm in HTMM is given by equations (3)-(8).

E-step :

$P r ( z n , Ψ n | d , w 1 … w N d ; θ d , λ , ϵ )$
(3)

Equation (3) is calculated by using the Forward-backward Algorithm for θd, λ and output from M-step.

M-step :

$θ d , z ∝ E ( C d , z ) + α − 1$
(4)

$λ z , w ∝ E ( C z , w ) + β − 1$
(5)

(6)

θd,z is normalized, θd is the distribution, and $N d s$ is the number of sentences in document d. Cd,z is the number of times that the topic z is selected according to θd in the document d, and Cz,w is the number of times that the word w is selected from the topic z according to λz,w.

(7)

(8)

## 4. PROPOSED MODEL

In this chapter, we describe a proposed method of analyzing browsing history data by introducing the idea of HTMM. In this study, we consider extracting users’ intention from the browsing history data of EC sites in order to help appropriate CVR improvement measures. The purpose of this study is to propose a latent class model that infers unobservable users’ intention from observable browse pages.

### 4.1 Overview

In previous research (Matsuzaki et al., 2017), when applying the latent class model to browsing history data, one latent topic was assumed for one session or one user. This means that users do not change the state of their preferences or emotions during the session. However, it is generally possible that users’ preferences and emotions change during browsing. For example, a user may sometimes recall a necessary item that he/she should purchase during watching items on an EC site. After that, he/she may change to look for the necessary item. It is possible that someone like friends, parents, brothers, or sisters, etc. say you something of trigger for changing the state of your purchasing intension and real time preferences.

In order to express such a change in purchasing in-tention by a change in latent topics, it is necessary to assume multiple latent topics for a session. Therefore, we consider a latent topic for each browsed page included in a session and assume a first-order Markov property for the latent topic sequence. In this case, the current latent topic zt is determined every time when the browsing page changes according to the previous latent topic zt-1 and the transition probability. At this time, regardless of whether the latent topic changes or not, the current latent topic is estimated by the transition probability every time the browse page changes. However, it is unlikely that users’ intention changes at every browsing time during a session. It is, therefore, necessary to consider the case where the latent topic does not transition independently of the transition probability.

In this study, we consider to introduce the idea of HTMM, which is a model with restrictions on the transi-tion of latent topics. Conventionally, HTMM is a model applied to document data written by natural language. However, it is considered that document data and browsing history data have similar characteristics, so it is appropriate to apply HTMM to browsing history data of EC sites. For example, each word in a document is considered to reflect the purpose of a writer and topic of the document. Similarly, each page in the browsing history is considered to reflect the user’s purchasing intention. HTMM assumes that successive sentences are likely to have the same latent topic by introducing a parameter that indicates whether the user’s intention changes. Therefore, HTMM is considered to be suitable for browsing history data analysis in consideration of the context.

On the other hand, HTMM assumes that all words in the same sentence belong to the same latent topic. In other words, HTMM has a constraint that Ψd,n=0 except for the first word of a sentence. Compared with the document data, the browsing history data has “a user” corresponding to “a document” and “a page” corresponding to “a word”, but does not have a unit of “a sentence”. Regarding browsing history data, the EC site wants to know the purchase probability of each user in real time. Thus, the analysis unit for browsing sequences should be as small as possible. Therefore, we consider a model that estimates latent topics for each page. For this purpose, we propose a model that removes the constraint that Ψd,n=0 except for the first word of a sentence in HTMM. In other words, the proposed method removes the constraint that “words in the same sentence belong to the same topic (Ψd,n=0 except for the first word of the sentence)” of the two constraints of the HTMM, and we construct a model in which the latent topic does not change at every page browsing under the constraint that it is likely to be the same latent topic as the previous latent topic zt-1.

In addition, in order to associate the latent topics with the users’ purchasing intentions, we extract the pur-chase status (whether a purchase has occurred) from the browsing history data, and define the ease of purchasing on each latent topic as the “purchase rate”.

It is thought that the proposed method can identify whether the user is in a state that makes it easy to pur-chase. In the proposed method, the operation of estimating the latent topic using the HTMM for the browsing history data is defined as step 1, and the operation of calculating the purchase rate for each estimated latent topic is defined as step 2.

### 4.2 Proposed model

#### 4.2.1 Step1 - Inviting HTMM for Browsing History Data

In step 1, we construct a model that estimates latent topics by using HTMM for browsing history data. Specifically, the proposed method removes the constraint that “words in the same sentence belong to the same topic (Ψd,n=0 except for the first word of the sentence)” of the two constraints of the HTMM, and represents a real situation in which the latent topic does not transition every page under the constraint that it is likely to be the same latent topic as the previous latent topic zt-1.

In the HTMM, there is a restriction that all words in the same sentence belong to the same topic. For words at the beginning of the sentence, Ψd,n is determined to be 0 or 1 by a binomial distribution according to the hyperparameter On the other hand, . On the other hand, Ψd,n=0 for other words in the sentence. Then, from Ψd,n, it is determined whether to generate the latent topic from the topic distribution θd or to inherit the previous latent topic zt-1 according to equation (2). That is, the proposed method determines Ψd,n for all words by the binomial distribution with the probability parameter .

By applying the proposed method to browsing history data, we can estimate latent topics for each page regardless of the session. Therefore, there is a possibility that the latent topic will change if the page changes even in the same session.

#### 4.2.2 Step2 - Calculation of purchase rate

In step 2, we extract the purchase status from the browsing history data and define the ease of purchasing each latent topic as the “purchase rate”. This study does not analyze each session. Because we want to know the user’s real-time purchase intent. Therefore, the purchase rate is used instead of the CVR defined in equation (1).

After estimating the latent topics of all pages in step 1, the purchase rate of each latent topic is calculated using the order completion page in the user’s browsing data. Equation (9) shows the calculation method of the purchase rate. Sz is the total number of pages for which the latent topic was estimated to be z, and Cz is the number of pages within Sz that were purchased within 10 pages after browsing.

(9)

Thus, it is possible to find latent topics that are likely to be purchased after browsing the page. In addition, by setting “the page where the purchase occurred within 10 pages after browsing” instead of “the page where the purchase occurred”, it is possible to predict whether or not the user is likely to purchase within the next few pages from the estimated latent topic even during browsing.

## 5. ACTUAL DATA ANALYSIS

In this section, we show the effectiveness of the proposed method. For this purpose, we apply the proposed model to real browsing history data to see if there is a difference in purchasing rates between latent topics. In addition, we analyze from multiple perspectives using the pages included in each latent topic and the transition probability between the latent topics.

### 5.1 Details of Data

The analysis data is browsing history data for 766 persons in the three months from August 1, 2017 to October 31, 2017 on the EC site “Rakuten Ichiba”.

In this study, we analyze each page that is the smallest unit in the browsing history data, so that the EC site can take measures in real time, such as a real-time coupon (Matsuzaki et al., 2017). Therefore, users and sessions are not considered in the analysis.

The purchase rate for each latent topic is calculated by using equation (9). The hyperparameters are α = 1.001, β = 1.0001, which are commonly used in HTMM. And the number of topics is K = 8, which captures the features of each latent topic most.

### 5.2 Result and Consideration

#### 5.2.1 Purchase Rate – Latent Topics

Figure 6 shows the purchase rate for each latent topic. However, latent topics are shown in order of decreasing purchase rate. The broken line (4.98%) in the figure indicates the average purchase rate of the entire browsing history data.

From Figure 6, it is found that the purchase rate varies greatly depending on the latent topic. In particular, the latent topic 1 has a significantly higher purchase rate than other latent topics. From this result, we can speculate that it is possible to get a grasp of whether or not the user is likely to purchase within a few pages by identifying the latent topic by the proposed model. For example, if a user is estimated to be topic 1, this user is more likely to purchase within 10 pages than users of other latent topics. Therefore, it was shown that the proposed model is also effective for analyzing browsing history data. However, since the purchase rate of the latent topic 1 is at most 20%, even a user estimated to be the latent topic 1 does not have high purchasing probability. Therefore, it is considered that the users who are currently estimated to be the latent topic 1 are targeted for CVR improvement measures. On the other hand, users estimated as in the latent topic 8 are less likely to purchase than users of other latent topics, and it is considered that it is inefficient to target this user as a measure.

#### 5.2.2 Purchase Rate – Page Type

Next, we consider the relationship between the us-er’s intention and the contents of the browse page. Therefore, we extract page types that appear or do not appear in each latent topic.

Tables 2 show the top 10 and bottom 10 page types that have a higher percentage of the page types included in each latent topic compared to the overall page type ratio for some latent topics. Note that the latent topics with a high purchase rate are with young numbers. In this table, the examples of the latent topics 1, 2, and 8 are shown.

From the results, it can be seen that the percentage of pages that must be passed when purchasing an item, such as Cart page (where the user puts items that the user wants to purchase) and Order operation page (for inputting the payment method and shipping address) have a positive correlation with the purchase rate. This is thought to be since it is easy to purchase items when users put them in a cart or start ordering.

Also, pages related to specific item information (such as Item detail page, Item image page, and Review page) are ranked higher in latent topics with a higher purchase rate, but are also ranked higher in latent topics with a lower purchase rate. This may be because users’ purchasing intention is different even if they browse the same content page. For example, users belonging to a latent topic with a high purchase rate have a purchase intention, while users belonging to a latent topic with a low purchase rate do not have a purchase intention and are simply doing window shopping.

Table 2 (i) shows the latent topic 1 with the highest purchase rate. The ratio of pages that always pass when purchasing and pages related to item information is high. In addition, the ratio of Price comparison pages is also high. This indicates that users are investigating the item in detail or considering a reasonable shop to purchase a desired item. At this stage, users often purchase within 10 pages. On the other hand, the lower 10 cases include pages that are not directly related to purchase (Member page, Survey page, etc.). However, the ratio of Search pages that seem to be related to purchase is low. Therefore, Search page is often viewed before the users decides to purchase or when they are not sure which item to purchase. In other words, it is thought that the user does not purchase immediately after browsing Search page.

Table 2 (ii) shows the latent topic 2 with second highest purchase rate. The feature is that the ratio of E-mail magazine page is high. From this, it is considered that the users who subscribe to the E-mail magazine are good users who use Rakuten Ichiba daily and tend to purchase more easily than other users. However, it is thought that it is difficult for a visit to an EC site to be directly linked to purchases because it is used daily.

Table 2 (iii) shows the latent topic 8 with the lowest purchase rate. Like the latent topic 1, it can be seen that users are searching for items or browsing Shop page or Item detail page. Therefore, since the user has been searching for items for a long time, just searching for items does not immediately lead to a purchase action. Purchasing requires a deep interest in items and a price consideration stage. In addition, the ratio of pages related to points cards and FAQs that are not found in other latent topics is relatively high. Since these pages are not directly related to purchasing, it is considered that users may use EC sites for purposes other than shopping.

#### 5.2.3 Purchase rate – Page type

Next, we consider the purchase rate of each latent topic when the current latent topic is a successor to the previous latent topic (hereinafter called “inheritance”) and when it is selected by the transition probability (hereinafter called “selection”). Table 3 shows the purchase rate of each latent topic in the case of “selection” and “inheritance”.

From Table 3, it can be seen that the latent topics with high purchase rates are the same for both “selection” and “inheritance”. However, some topics had higher purchase rates in “inheritance” than in “selection” and vice versa. In other words, the timing at which measures should be taken differs depending on the latent topic. However, in many cases including latent topics with a high purchase rate, it can be seen that “selection” has a higher purchase rate than “inheritance”. Therefore, if a user is assumed to belong to a latent topic with a high purchase rate, measures should be taken as soon as possible.

#### 5.2.4 Transitions between Latent Topics – Transition Probability

Next, we focus on the transition of latent topics in a session. The average number of pages viewed during the data collection period was about 249 pages. Therefore, transition of latent topics was often seen. Table 4 shows the topic transition ratio when the latent topic changes (in the case of “selection”). The transition rate that is at least 10% larger than the content rate of each latent topic is shown in bold.

First, we focus on the latent topics 7 and 8 with low purchase rates. The latent topic 7 has a high probability of transitioning to the latent topic 1 with a high purchase rate. On the other hand, the latent topic 8 mostly transitions to topics with a medium purchase rate. In addition, the transition rate to the next latent topic is relatively low overall. Therefore, it is thought that the willingness to purchase does not gradually increase, but rather increases at a certain timing. By focusing on the transition probability between latent topics, it is possible to estimate not only the current user’s willingness to purchase but also the future change of the user’s intention.

We also looked at what kind of pages many latent topic transitions occur. The following characteristics were observed.

• • Top page of EC site itself or each shop

• - The user seems to have reset his/her thoughts.

• • Item details page

• - It is thought that the stage has been shifted from searching for products to narrowing down products.

• • Search page

• - Users used EC site for other (point function, etc.) than shopping, but it seems that they are now using for shopping.

These pages are frequently seen in the browsing history data. However, by using the proposed method, we can think that such pages also reflect changes in users thinking and state. Though we could not examine each browsing sequences in detail, it is considered that the users’ latent topic transition tendency can be understood by grasping the details of these pages, the preceding and following pages, and the relationship with latent topics.

## 6. DISCUSSION

In this section, we consider throughout this research.

### 6.1 Proposed Model

The purpose of this research is to infer the unobservable thinking state from observable browse pages as part of the improvement of CVR on EC site. For the purpose, we tried to extract the user’s intention from the browsing history data of the EC site and to capture the time change. Therefore, we used a latent class model that assumed multiple latent topics behind the user’s browsing sequence. In previous studies (Hotoda et al., 2019), we performed analysis using the browsed page itself without assuming latent topics. However, the influence of Cart page and Order operation page was so strong, and we could not find the characteristic page transition that led to the purchase. This is because, the purchase intention is most likely to appear in the page transition among the user’s intentions, but it is difficult to extract other intentions from the page transition itself. Therefore, in this research, we decided to express the user’s intention not by the page itself but by the latent topic.

In previous researches, either the same latent topic is assumed for the pages in a session, or the latent topic is estimated every time a page transition occurred. It is not possible to capture changes in thinking state when the same latent topics is assumed for the pages in a session. Also, since it is considered that the user’s intention does not change frequently, it is not appropriate to estimate a latent topic every time when a page transition occurs.

Therefore, we estimated whether or not the latent topics transitioned before estimating the transition probability between latent topics. Thereby, we thought that the same latent topics would be continuous, and that changes in the user’s intention could be extracted more accurately. In this method, the browsing page sequence can be divided into subsequences by latent topics. Therefore, the change of the user’s intention can be grasped greatly. In addition, by extracting the order completion page from the actual browsing history data, which is learning data, it becomes possible to associate latent topics with purchases.

#### 6.1.1 Relevance of Results

We consider whether the analysis results in this study are appropriate and reasonable as user behaviors.

According to the analysis results, the latent topic 1 had a significantly higher purchase rate than other latent topics. It was also found that the users in the latent topic 1 browsed many types of pages that they always browsed when making purchases. Therefore, it is considered that the proposed model could extract the latent states of users who browse these pages as compared to those who do not. We also calculated the purchase rate, assuming the latent state of the user who viewed these pages as purchase willingness. Then, it turned out that the purchase rate was relatively high. Therefore, it can be said that the user who browsed Cart page and Order operation page has a high desire to purchase. Certainly, this result is considered to be reasonable because the user does not browse Cart page and Order operation page without willingness to purchase. In addition, in the latent topic 8 with the lowest purchase rate, these pages are less viewed than other latent topics. Therefore, it can be seen that Cart page and Order operation page greatly influence the purchase.

Next, we consider Item detail page. It is found in both latent topics with high purchase rates and latent topics with low purchase rates. Therefore, it is not possible to know the user’s willingness to purchase on Item detail page alone. Indeed, it is thought that users often browse Item detail page not only when deciding on an item to purchase, but also when they don’t have a willing to purchase but enjoy looking around the item itself. Therefore, this result is considered valid. It is difficult to measure the user’s willingness to purchase only on Item detail page. However, it seems to be possible to determine whether the user is browsing Item detail page with a willingness to purchase by analyzing together with the other browse page.

#### 6.1.2 Effectiveness of the Proposed Model

In this study, we analyzed using actual browsing history data, and specifically considered the important suggestions extracted by the result of the analysis from the following five viewpoints.

1. Identify latent topics with high or low purchase rates

• - The results show that there is a large difference in the purchase rate depending on the latent topics. From this, it is thought that it is possible to predict whether a user is likely to purchase within a few pages in the future by knowing the potential topic using the proposed method. This is considered to be effective in determining the target user in real time when the EC site takes measures. Although this study could not perform, it will be more effective if analyzed along with user information such as age and gender.

2. Relationship between latent topics and page types

• - As a result, we found a page type characteristic of each latent topic. In particular, pages related items were identified as both latent topics with a high purchase rate and latent topics with a low purchase rate. Normally, users on the latent topic with the lowest purchase rate are considered not interested in items, but instead, they were interested in items but did not intend to purchase. It is thought that the proposed model is effective because such unexpected discoveries can be made.

3. Timing to take measures

• - The results show that “selection” has a higher purchase rate than “inheritance” in many cases. Even for the latent topics 1 and 2 with a high purchase rate, “selection” has a higher purchase rate than “inheritance”. As a result, it is better to take measures as soon as possible if that the system can find a user belongs to a latent topic with a high purchase rate. The reason is that the willingness to purchase does not last long and the willingness to purchase decreases while selecting an item. Thereby, it can be said that the proposed model is effective in that it is possible to identify the appropriate policy implementation timing. These benefits are good for a real-time promotion system like real-time coupons.

4. Characteristics of transition between latent topics

• - Based on the transition probability between la-tent topics obtained by the analysis, not only the current latent topic but also the subsequent latent topics can be considered. In other words, by focusing on the transition probability between latent topics, it is possible to estimate not only the current user’s willingness to purchase but also the future change of the user’s intention. Thus, it can be said that the proposed method is effective not only in real-time analysis but also in future analysis.

5. Relationship between latent topic transition and page type

• - The results show that after a latent topic transi-tion occurs, specific pages, such as Home page, are browsed many times. By using the proposed model, it can be considered that the page reflects the change of the user’s intention even if it does not seem to affect the analysis.

Like this, the proposed model seems to extract the important characteristics of users’ browsing tendencies. Although we could not do in this research, it is expected to grasp the user’s page transition tendency by analyzing the details of these pages, the previous and next pages, and the relationship with latent topics.

### 6.2 Example of CVR Improvement Measures

By the proposed model, it is possible to estimate in real time whether or not the user is likely to purchase from the browsing history data. However, even for latent topics with high purchase rates, the percentage of purchases within 10 pages is about 20%. In other words, many users tend to change the state to other latent topics without purchasing, even though they are on latent topics with a high purchase rate. Therefore, we think that CVR improvement can be expected by taking measures targeting such users.

From the analysis of the page type, it was found that the users who subscribe to the E-mail magazine were relatively easy to purchase. Also, it was found that pages related to item details were widely viewed on the latent class with high purchase probability. In addition, since the willingness to purchase does not last long, it was found that measures should be taken as soon as possible after transitioning to a latent topic with a high purchase rate. Therefore, as an example of measures to improve CVR, it is conceivable to use an E-mail magazine to simplify the operation of a user who has flowed in from an E-mail magazine until the purchase. The real-time coupon system is also one of the possible measures. In addition, users can compare prices in Item detail page that is widely viewed and can easily refer to reviews that are being referred to. Thus, the flow line design can complete the purchase before the purchase motivation decreases.

In this study, we focused on the statistical model of users’ browsing sequences. If we combine the analysis by the proposed model with the other data analysis such as browsing time (time zone, length of time browsing the page, etc.), demographic information of the user (sex, age, membership or not, etc.), browsing media (PC, smartphone, app, etc.) it is expected that the range of CVR improvement measures will expand.

### 6.3 Extension of the Proposed Model on EC Sites

There are various possible reasons for the user ac-cessing the EC site. For example, considering the Rakuten Ichiba dealt with in this study, a user may access the top page of the Rakuten Ichiba for his/her purchasing purposes and he/she may intend to enjoy window shopping. Another case is to access a page in Rakuten Ichiba from a link or advertisement of another site. There is also a case that Rakuten members access the EC sites for purposes other than purchasing. Thus, it is thought that the page transition differs depending on how the user accesses the EC site, and this may influence whether or not to purchase. Therefore, by analyzing the inflow source to the EC site, it is expected that users’ latent intention can be extracted more accurately, and the inflow source that leads to the purchase can be found.

Also, many EC sites on the Internet have different characteristics. It is, therefore, considered that the user characteristics and the site structure of each EC site will be different. For example, when comparing Rakuten Ichiba and Amazon.com, it is said that Rakuten Ichiba has a lot of information and is suitable for women and others who want to enjoy shopping itself. On the other hand, Amazon.com has a simple design and site structure, and is said to be suitable for men and others who want to purchase the desired items as easily as possible. There are also many EC sites that deal only with specific items. For example, on an EC site that deals only with expensive items, users cannot easily make purchasing decisions. Therefore, it is thought that it is easy to catch the change of the users’ intention because a relatively long browsing history can be observed before purchasing. Thus, it can be expected that the way of page transition will differ depending on the user characteristics and the purpose of use. Therefore, by applying the proposed model to EC sites other than Rakuten Ichiba, it may be possible to extract latent topics and transitions characteristic of each EC site.

## 7. CONCLUSION AND FUTURE WORK

This research constructed an analytical model to extract the user’s intention from the browsing history data of EC site and to grasp the change over time in order to support the examination of CVR improvement measures. In order to infer the unobservable users’ intention from the observable browse page, we proposed a method to apply a latent class model based on HTMM and use it together with the purchase history.

Specifically, we proposed an analysis method con-sisting of the following two steps. In step 1, we construct a model that estimates latent topics by using HTMM for browsing history data. Specifically, the proposed method removes the constraint that “words in the same sentence belong to the same topic (Ψd,n=0 except for the first word of the sentence)” of the two constraints of the HTMM. In addition, the model can represent the situation that the latent topic does not transition every page under the constraint that it is likely to be the same latent topic as the previous latent topic. In step 2, we extract the purchase status from the browsing history data and define the ease of purchasing each latent topic as the “purchase rate”. The proposed model makes it possible to extract users’ intention (mainly the willingness to purchase).

The effectiveness of the proposed model was veri-fied by applying the proposed model to actual browsing history data of about 190,000 pages in Rakuten Ichiba. As a result, the purchase rate differs greatly depending on the latent topic, and it was considered that the user’s intention can be extracted by the proposed model. In addition, we found that there were differences in the page types included in the latent topics and that the transition of the latent topics was biased. From the results, we concluded that the proposed model will give a precious information to consider the measures for improving CVR. For example, an appropriate site redesign that promotes effective page transition on EC sites or issuing coupons individually to each user can be considered as CVR improvement measures. Therefore, it was confirmed that various findings were obtained by the proposed model, and the effectiveness of the proposed model was demonstrated.

In future work, we must confirm the validity of the results and confirm the usefulness of the obtained knowledges through A/B testing or another additional experiment. In this study, the data we dealt with were limited in data collection period, target users, target EC sites, etc. Therefore, it is desirable to confirm whether the proposed model is effective for a longer period, more users, and different EC sites. These are the future works.

Mio Hotoda received her B.E. and M.E. degrees in Industrial and Management Systems Engineering from Waseda University, Tokyo, Japan, in 2018 and 2020, respectively. Her research interest is machine learning theory.

Gendo Kumoi is a research associate in the department of Industrial and Management Systems Engineering, Waseda University, Japan. He is studying in the field of applied information mathematics, machine learning, and text mining. He is a member of IEEE, Information Processing Society of Japan, and etc.

Masayuki Goto is a professor in the department of Industrial and Management Systems Engineering, Waseda University, Japan. He received his Dr.E. degree from Waseda University in 2000. He is studying in the field of data science, business analytics, machine learning, and Bayesian statistics. He is now a director of the Research Institute of Data Science, Waseda University. He has won several best paper awards at several conferences such as the 20th Asia Pacific Industrial Engineering and Management Systems (APIEMS 2019) and 16th Asian Network for Quality Congress.

## ACKNOWLEDGEMENT

The authors would like to express our gratitude to VALUES, Inc. for providing us with the browsing history data. Also, we would like to express their gratitude to Dr. Kenta Mikawa, Dr. Haruka Yamashita, Mr. Tianxiang Yang, and all the members of Goto Laboratory, Waseda University, for their helpful comments in this research.

## Figure

The number of pages during the data collection period (three months).

The number of pages within one session.

HMM graphical model.

LDA graphical model.

LDA graphical model HTMM graphical model.

The purchase rate of each latent topic.

## Table

The number of pages viewed in each period.

The top 10 and bottom 10 page types

The purchase rate of each latent topic in the case of “selection” and “inheritance”. (%)

The topic transition ratio when the latent topic changes. (%)

## REFERENCES

1. Bassi, F. (2007), Latent class factor models for market segmentation: An application to pharmaceuticals, Statistical Methods and Applications, 16(2), 279-287.
2. Bhatnagar, A. and Ghose, S. (2004), A latent class segmentation analysis of e-shoppers, Journal of Business Research, 57(7), 758-767.
3. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003), Latent dirichlet allocation, Journal of machine Learning research, 3(Jan), 993-1022.
4. Burt, S. and Sparks, L. (2003), E-commerce and the retail process: A review, Journal of Retailing and Consumer Services, 10(5), 275-286.
5. Dias, J. G. and Vermunt, J. K. (2007), Latent class modeling of website users’ search patterns: Implications for online market segmentation, Journal of Retailing and Consumer Services, 14(6), 359-368.
6. Gruber, A., Weiss, Y., and Rosen-Zvi, M. (2007), Hidden topic markov models, Artificial Intelligence and Statistics, 2, 163-170.
7. Hotoda, M., Mizuochi, H., Kumoi, G., and Goto, M. (2019), Analytical model of customer purchase behavior considering page transitions on EC site, Total Quality Science, 5(1), 23-33.
8. Jin, X., Zhou, Y., and Mobasher, B. (2004), Web usage mining based on probabilistic latent semantic analysis, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 197-205.
9. Kairos Marketing Inc. (2014), Estimated conversion rate and average value for each industry, Available from: https://blog.kairosmarketing.net/contentmarketing/conversion-average-140320/.
10. Kawazu, H., Toriumi, F., Takano, M., Wada, K., and Fukuda, I. (2016), Analytical method of web user behavior using hidden markov model, Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 2518-2524.
11. Matsuzaki, Y., Mikawa, K., and Goto, M. (2017), Latent semantic markov model for effective promotion activities in EC sites, Journal of Information Processing (JIP), 58(12), 2034-2045.
12. Ministry of Economy, Trade and Industry of Japan (2019), Establishment of infrastructure for information and service conversion of Japanese economic society of 2018 (Market research on e-commerce).
13. Rakuten Inc. (2020), Rakuten Ichiba, Available from: https://www.rakuten.co.jp/.
14. Wedel, M. and Kamakura, W. A. (2012), Market Segmentation: Conceptual and Methodological Foundations (vol.8), Springer Science & Business Media.
 Do not open for a day Close