1. INTRODUCTION
Thematic stocks generally mean a group of stocks with stock price movements synchronized in the same direction on a single theme. Thematic stocks exist in various forms, including politics, science, technology, entertainment, environment, healthcare, pharmaceuticals, resource development, and others. These thematic stocks are characterized by investors’ investment in anticipation of shortterm high returns rather than steady gains. In addition, they are often driven by investors’ supply and demand rather than by corporate financial conditions or performance factors. Thematic stocks tend to occur more frequently in the domestic stock market than in advanced countries, such as the United States, where themerelated external information or the direction of government poli cies often changes the theme’s return rate.
In the Republic of Korea (from now on referred to as Korea), thematic stocks originate from the socalled “unprovoked investment” in construction company stocks that occurred in Korea in the late 1970s when domestic construction companies were booming overseas, mainly in the Middle East. In late 1987, the term “thematic stocks” began to be used in earnest in the Great Wall thematic stocks. After rumors spread that the Chinese government would install a windbreaker on the Great Wall, related stocks surged. Since then, the thematic stocks of the Four Rivers project, which were policyrelated thematic stocks in the 2007 presidential election, have soared and brought significant profits to investors (Kwak and Yeo, 2019).
Research on these thematic stocks tends to focus on Korea, where investment in thematic stocks is widespread and is mainly related to political thematic stocks. Herron et al. (1999) analyzed the relationship between the 1992 US presidential election and sectorbysector returns on the US stock market. Knight (2006) examined the Republican and Democratic candidates George W. Bush and Al Gore’s policies in the 2000 US presidential election to confirm significant stock fluctuations. In the case of domestic research, Woo and Kim (2014) used event research methodology to calculate regular stock prices for thematic stocks, and Kang (2016) used Fama–French’s threefactor model for Korea’s 18th presidential election thematic stock. In addition, the Financial Services Commission (2017) and Financial Supervision Service (2017) warned against the risk of investing in thematic stocks. They noted that political thematic stocks related to presidential candidates plunged to market levels on election day 19. Nam (2017) also confirmed the need for cumulative return analysis. Kwak and Yeo (2019) analyzed short and longterm abnormal returns using market models, Fama– French’s threefactor, and Carhart’s fourfactor model as of the 19th presidential election. Kim and Lim (2020) confirmed that KOSPI 200 and particulate matterthemed stocks responded to changes in PM10 concentration. Finally, Nam (2020) analyzed the cumulative abnormal return (CAR), suggesting that politically themed stocks continue to emerge in major political events, including the last presidential election, as a medium for connecting with leading politicians unrelated to corporate intrinsic values. In Table 1, we summarized those previous papers and their methodologies for detecting thematic stocks.
In summary, prior research conducted experiments primarily by analyzing the existence of thematic stocks based on overreturn rates within a specific period due to an incident by setting quantitative variables related to a particular theme as explanatory variables. Based on prior research, we try to illuminate these thematic stock investments using daily sentiment scores and information theory for a new part of individual thematic stock research based on the specific period due to an incident in this study. For example, the objectives include analyzing external information using text mining techniques, verifying whether the theme is related through theme sentiment index (TSI) and causal analysis of candidate stocks in the candidate group, and finding ways to apply them. This study used economic text data containing keywords, such as masks and search volumes for related keywords as external information, news articles where individual investors get the most information, and private and economic broadcasting scripts. In addition, we verified the existence and magnitude of causal relationships between the stock price and abnormal return in candidate stocks in the candidate group with text sensitivity indices produced by combining search volume and sensitivity in the text. In addition, using candidate stocks that have been proven to have significant causal relationships, the network of thematic stocks was constructed based on the analysis of the causal relationship among candidate stocks belonging to the thematic stocks. The network theory was used to verify the influence within each listingthemed share network for the constructed network. Finally, we examined the developed thematic stock network. We conducted an investment simulation by identifying dynamic changes across the thematic stock network at a particular time and dynamic changes in specific listing stocks (Choi and Kim, 2021). In this study, we tried to demonstrate what was empirically analyzed in previous studies that profit generation at specific times or events, and in reality, the risk of thematic stock investments, should be considered as mentioned above in previous studies.
This paper is further organized as follows. Section 2 describes the research methodology based on the foundational analysis results of the subjects and data used in the composition and analysis of the thematic stock network. Then, Section 3 schematizes the thematic stock network based on the results derived. In addition, this section seeks the investment methodology based on experimental results using schematic thematic stock networks and the possibility of application in RegTech and SupTech fields. Finally, Section 4 describes this study’s summaries, limitations, and conclusions.
2. DATA AND METHODOLOGY
2.1 Study Subject
2.1.1 Subject to Research
This study set the configuration and utilization of thematic stock networks as key objectives. In addition, maskthemed stocks were first selected as subjects to form networks owing to yellow and fine dust. They were reilluminated as the demand for masks increased rapidly after the COVID19 outbreak.
In this study, 20 candidate stocks (10 listed KOSPI stocks and 10 listed KOSDAQ stocks) were selected to form a thematic stock network, considering the frequency of themed stocks classified as maskthemed stocks in the top 10 securities firms by sales. As the purpose of this study is not to encourage the purchase or sale of candidate stocks by a particular company, all candidate stocks belonging to the candidate group are alphabetically deidentified and described.
2.1.2 Research Period
To form the current network of maskthemed stocks, we set a period of 1,461 days (980 trading days) from December 1, 2016, to November 30, 2020, as an analysis period. This period is believed to have had the most significant effect on thematic stocks in recent years. Then, we collected text and return data within that period. Within this period, candidate stocks whose causal relationship from maskrelated text data were selected and used as the nodes for the maskthemed stocks network.
2.2 Asset Pricing Model
In this study, we used abnormal returns of stocks to reduce a stock market’s influence and focus on maskthemed stocks’ performances on the individual level. The abnormal return of an asset is the subtraction of the expected return derived from the asset pricing model from the historical return. We used these three asset pricing models for the results’ consistency: the market model.
2.2.1 Market Model
Brown and Warner (1980, 1985) introduced the market model to examine the properties of daily stock returns and how particular characteristics of them affect event study methodologies. Various finance studies used this model to derive individual stocks’ abnormal returns. The equation of MM can be expressed as follows:
where $\text{E}({\text{r}}_{\text{i},\text{t}})$ is the return of stock i on day t , and $\text{E}({\text{r}}_{\text{m},\text{t}})$ is the expected daily return to the market portfolio of risky assets on day t . α_{i} and β_{i} are the intercept and the slope of the fitted line derived from linear regression results. We estimate the values of α_{i} and β_{i} using ordinary least squares estimators in the original research. Therefore, ε_{i,t} is the error term (a random variable) with expectation zero and finite variance. Moreover, ε_{i,t} is uncorrelated to the market return $\text{E}({\text{r}}_{\text{i},\text{t}})$ and firm return $\text{E}({\text{r}}_{\text{m},\text{t}})$ with i≠ j , homoscedastic and not autocorrelated.
2.3 Text Mining Technique
2.3.1 Latent Dirichlet Allocation (LDA)
Topic modeling methods are powerful, intelligent techniques widely applied in natural language processing (NLP) to discover topics and semantic mining from unordered documents. Specifically, LDA, one of the most popular topic modeling methods, is a generative probabilistic model for collections of discrete data, such as text corpora (Blei et al., 2003). LDA can generate a topic per document model and words per topic model based on Dirichlet distribution. Figure 1 shows the concept of LDA.
Many studies applied topic modeling methods based on LDA in various fields, such as keyword selection, source code analysis, opinion mining, event detection, music key profiling, image classification, recommendation system, and emotion classification.
We used LDA based on the Gibbs sampling method because of its rapid speed compared with the original model. Gibbs sampling is one of the Markov chain Monte Carlo algorithms for sampling conditional distributions of variables, approximated from an actual distribution when direct sampling is inefficient or difficult. Equation (1) is the updated equation of LDA using Gibbs sampling for the probability that the kth topic is assigned to z_{d,i}, the ith word of the dth document (Griffiths and Steyvers, 2004;Darling, 2011).
where z_{−i} signifies leaving the ith out of the calculation, w is the word vector of documents, n_{d,k} is the number of times words in the dth document that are assigned to the kth topic, w_{d,n} is the nth word in the dth document, and v_{k}, w_{d,n} is the number of times w_{d,n} from the whole corpus in the kth topic. α_{k} and β_{k} are the hyperparameters of percorpus topic distributions and perdocument topic proportions, following symmetric Dirichlet distributions. Equation (1) can be summarized as two parts: A and B. A means the relationship between the dth document and the kth topic, and B is the relationship between the nth words of the dth document and the kth topic.
For selecting the optimal number of topics of the LDA model, we considered perplexity and a topic coherence measure: C_{V}. Perplexity, generally used in language modeling, is originally the entropybased measurement of how well a probability distribution or probability model predicts a sample. Perplexity is equal to the inverse of the geometric mean perword likelihood algebraically. The C_{V} measure is based on a sliding window, oneset segmentation of the top words, and an indirect confirmation measure that uses normalized pointwise mutual information and the cosine similarity. We selected the topic number, which has the lowest perplexity, and assigned a topic to each document based on Eq. (3):
where n_{m} is the topic number that is assigned to the document m and θ_{m1}, …, θ_{mN} are assigned topic probabilities from topic 1 to topic N of the document m based on the LDA model.
This study used LDA to filter out neutral documents from the original document data to make text data for sentiment analysis. We first classified raw documents into an optimal number of topics determined by perplexity to filter out neutral documents. Then, we assigned the wordlevel polarity score to words in the raw documents. For the neutral word, we gave zero for the wordlevel polarity score. Conversely, if the word has positivity or negativity, we designated one for the wordlevel polarity score. We judged positivity, negativity, and neutrality regarding the sentiment lexicon referred to the National Institute of the Korean Language and made the wordlevel polarity score matrix with them. Then, we computed the matrix products of the term frequencyinverse document frequency (TFIDF) matrix of raw documents and word polarity matrix to calculate documentlevel polarity scores. As various versions of TFIDF equations exist, we chose the version of Eq. (4):
where tf(t,d) means that the count of word t in document d is divided by the number of words in document d , and n is the total number of documents in the document set. df (t) is the document frequency of word t , which is the number of documents in the document set that contain t .
Finally, we computed topiclevel polarity scores, the average of documentlevel polarity scores belonging to the same topics. Then, we decided whether to add to text data to perform sentiment analysis. Figure 2 depicts a summary of the above process.
2.3.2 Sentiment Analysis
Sentiment analysis is a text mining methodology that analyzes the attitude or inclination of writing or speaking to identify sentiments on a particular subject, usually the text data. This analysis mainly determines text positive and negative opinions of data, such as articles, movie reviews, or posts from social network services.
From the aspect of tasks and applications (Ravi and Ravi, 2015;Kumar and Ravi, 2016), sentiment analysis has been applied in broad areas, such as subjectivity classification, polarity determination, multilingual and crosslingual sentiment analysis, crossdomain sentiment analysis, opinion spam detection, corpora creation, opinion word, and aspects extraction. Concerning the finance domain, researchers used sentiment analysis to predict movements of stock prices (Deng et al., 2011;Mittal and Goel, 2011;Nguyen et al., 2015;Pagolu et al., 2016;Khedr and Yaseen, 2017), support decisions (Wu et al., 2014;Hájek et al., 2014;Chan and Chong, 2017), and anticipate risks (Wang et al., 2013;Nopp and Hanbury, 2015).
We assumed that sentiments within one document extracted from articles, editorials, comments, and posts were probabilitydistributed to use the sentiment analysis for calculating the TSI. We used a machine learningbased approach for performing sentiment analysis implementation. In this study, we used four popular transformer based language models (Vaswani et al., 2017) to conduct sentiment analysis for calculating the TSI. These models are known to achieve high accuracy on natural language understanding tasks. The descriptions of the four transformerbased models are as follows:
Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) is based on transformer architecture. BERT is designed to pretrain bidirectional representations from an unlabeled text by jointly conditioning the left and right contexts. As a result, the pretrained BERT model can be finetuned with just one additional output layer to create stateoftheart models for a wide range of NLP tasks. BERT is pretrained on two NLP tasks: Masked Language Models (MLM) and next sentence prediction.
Robustly Optimized BERT Approach (RoBERTa) (Liu et al., 2019) performs better than BERT by applying the following adjustments:

Adjustment 1: RoBERTa uses BookCorpus (16G), CCNEWS (76G), OpenWebText (38G), and Stories (31G) data, whereas BERT only uses BookCorpus as training data only.

Adjustment 2: BERT masks training data once for MLM objective, whereas RoBERTA duplicates training data 10 times and masks data differently.
The developers of RoBERTa presented a replication study of BERT pretraining (Devlin et al., 2018) that carefully measures the impact of various key hyperparameters and training data size. They found that BERT was significantly undertrained and can match or exceed the performance of every model published after it.
XLNet (Yang et al., 2019) is a generalized autoregressive (AR) model, where the next token is dependent on all previous tokens. XLNET is generalized because it captures a bidirectional context through a mechanism called permutation language modeling (PLM). The AR language model is a model type that uses the context word to predict the next word. BERT outperforms the previous language models, but XLNET outperforms BERT. This model uses the (MASK) in the pretraining. However, these symbols are absent from actual data at the finetuning time, resulting in a pretrainfine tune discrepancy. XLNET proposes a new way to avoid the disadvantages brought by the (MASK) method in BERT. In the pretrain phase, XLNet proposed a new objective called PLM. This objective learns contextual text representation using the permutation of input. Overall, XLNet achieves stateoftheart results on various downstream language tasks, including question answering, natural language inference, sentiment analysis, and document ranking.
Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA) (Clark et al., 2020) is a novel pretraining approach that aims to match or exceed the downstream performance of an MLM pretrained model while using significantly less compute resources for the pretraining stage. The pretraining task in ELECTRA is based on detecting replaced tokens in the input sequence. This setup requires two transformerbased models, that is, a generator and a discriminator. Then, we calculated the TSI and used the most commonly used form of index in financial research as the sentiment index (Antweiler and Frank, 2004;Checkley et al., 2017;Giannini et al., 2019;Hiew et al., 2019;Liang et al., 2020). In those prior research, the TSI usually used the additional term of regression models. However, we used TSI as a criterion of selecting thematic stocks and signal of the thematic investment.
In Equation (5), TSI_{i,t} is the TSI of the politician i at time t, ${\text{P}}_{\text{i},\hspace{0.17em}\text{t}}^{\text{j}}$ is the positive rate, and ${\text{N}}_{\text{i},\hspace{0.17em}\text{t}}^{\text{j}}$ is the negative rate of the jth document of the politician i at time t from sentiment analysis results, which hold ${\text{P}}_{\text{i},\text{t}}^{\text{j}}+{\text{N}}_{\text{i},\text{t}}^{\text{j}}=1$. V_{i,t} means the number of documents related to politician i used in analysis at time t, and SV_{i,t} is the standardized version of search volume data of politician i at time t. Then, we transformed original related terms of ${\text{P}}_{\text{i},\hspace{0.17em}\text{t}}^{\text{j}}$ and j i, t N from scale [1, 1] to [0, 1] for the convenience of calculation, expressed in the second expression of Eq. (5). Using the property that ${\text{P}}_{\text{i},\hspace{0.17em}\text{t}}^{\text{j}}+{\text{N}}_{\text{i},\hspace{0.17em}\text{t}}^{\text{j}}=1$, we can convert the TSI to the rightmost expression of Eq. (5). We obtained the daily TSI by calculating Eq. (5) and computed their receiver operating characteristic (ROC) to demonstrate causal relationships between abnormal returns of maskthemed stock candidates.
We divided the training data and test data to select the number of topics in document data to use the perplexity. We collected another 15,337 articles, including the word “mask” and its related keywords based on our research, such as “maskthemed stocks” from December 1, 2016, to November 30, 2020, using the Big Kinds developed by the Korea Press Foundation. Before classifying sentiments, we performed LDA for topic modeling and selected the optimal number of topics based on perplexity for filtering neutral text data. We also trained the machine learning model for sentiment analysis using training data. The training data set comprised randomly collected 126,703 articles from 2010 to 2020, excluding our test data. We labeled sentiment values of those articles +1 if the summation of lexicons in an article is positive and −1 if negative. We assumed that each article has only one sentiment in one article: positive or negative. After that, we conducted sentiment analysis using four transformerbased models for 126,703 randomly collected text data from the political section. Then, we calculated TSI using the model with the best performance. We pretrained the data from “Modu Corpus” developed by the National Institute of the Korean Language for using four transformer based models. Finally, we conducted LDA and sentiment analysis to calculate TSI using the finetuned model for collected documents with the best performance.
2.4 Entropy Measure
2.4.1 Effective Transfer Entropy (ETE)
To select and analyze thematic stocks, we needed to quantitatively measure dependencies and causal relationships. However, general dependencies and causal relationships represented by correlation coefficients and Granger causality (Granger, 1969) should precede data assumptions, such as normality, stationary, and linearity. However, the natures of returnbased data of stocks are not usually satisfied with these properties (Quigley, 2008;Sheikh and Qiao, 2009;Tsai, 2011). Therefore, we tried to utilize econophysics and information theories, which can be used without the assumptions mentioned above. To use such theories, we can consider linear and nonlinear relationships between objectives to measure correlations and causal relationships (Schinckus, 2010;Jovanovic and Schinckus, 2013). Accordingly, we used the concept of mutual information firstly suggested by Shannon (1948) and Kreer (1957) and transfer entropy (TE) proposed by Schreiber (2000), which are the entropybased measures. In detail, we used TE based on the Shannon entropy in this study.
TE is a nonparametric measure for verifying information transfer amount between two variables based on Shannon entropy. In contrast to Granger causality, TE is framed not in terms of prediction but terms of resolution of uncertainty. “TE from Y to X” means that the degree to which Y disambiguates the future of X is beyond the degree to which X already disambiguates its future. Therefore, an attractive symmetry exists between the notions “predicts” and “disambiguates.”
TE represents a viable modelfree tool to infer causal relationships between time series in two dynamical systems. TE can quantify causal relationships within systems and efficiently identify the source and target variables. Hence, TE has received significant attention and is widely used not only in the information or physics field but also in fields such as neuroscience, electrical engineering, and chemical engineering. TE has been widely used to determine causal relationships between financial assets and markets in finance (Bossomaier et al., 2016). Specifically, general stock market indices, exchange rates, stock price, sector index, and cryptocurrency have been researched. In the 2000s, Marschinski and Kantz (2002) reported the causal relationship between the German DAX Xetra Stock Index (DAX) and Dow Jones Industrial Average. Kwon and Yang (2008) showed the directionality of the information transfer and found that the market indices influence individual stocks in the US stock market. After the 2000s, Dimpfl and Peter (2013) analyzed the causal relationship of the credit default swap market relative to the corporate bond market for the pricing of credit risk and the dynamic relation between market risk and credit risk proxied by the VIX and the iTraxx Europe from the perspective of precrisis, crisis, and postcrisis periods based on ETE.
Moreover, Sandoval (2014) used ETE to examine the causal relationship among 197 worldwide financial companies. Sensoy et al. (2014) investigated the strength and direction of information flow between exchange rates and stock prices in several emerging countries using ETE. Based on TE, Bekiros et al. (2017) investigated the network dynamics in US equity and commodity markets, and Lim et al. (2017) analyzed the information flow between industrial sectors in credit default swaps and stock markets in the US based on TE from the aspects of intraand interstructures. Recently, Jang et al. (2019) studied the causal relationship among Bitcoin, gold, S&P 500 index, and US dollars using TE. Yue et al. (2020) analyzed information transfers between stock market sectors in China and compared between the US and China stock markets. These prior studies can support our idea to use TE to measure causal relationships. Last, Choi and Kim (2021) used ETE to detect politicallythemed stocks and construct politicallythemed stock networks.
Based on the concepts mentioned earlier related to entropy, conditional entropy quantifies the amount of information needed to describe the outcome of a random variable X given that the value of another random variable Y is known. Here, the conditional entropy of X given Y can be expressed as follows:
Equation (6) can be interpreted as the uncertainty about Y when X is known or the expected number of bits needed to describe X when Y is known to both the encoder and decoder. Based on the above definition, we can define the general form of (k, 1) history TE between two variables X_{t} and Y_{t} for ${\text{x}}_{\text{t}}^{(\text{k})}=({\text{x}}_{\text{t}},\text{}\mathrm{...},\hspace{0.17em}{\text{x}}_{\text{t}\text{k}+1})$ and ${\text{y}}_{\text{t}}^{\left(\text{1}\right)}=({\text{y}}_{\text{t}},\hspace{0.17em}\mathrm{...},\hspace{0.17em}{\text{y}}_{\text{t}\text{l}+1})$. The general (k, 1)history TE can be expressed as follows:
where $\text{i}=\{{\text{x}}_{\text{t}+1},\hspace{0.17em}{\text{x}}_{\text{t}}^{(\text{k})},\hspace{0.17em}{\text{y}}_{\text{t}}^{(\text{l})}\}$. ${\text{TE}}_{\text{Y}\to \text{X}}^{(\text{k},\text{l})}(\text{t})$ is nonnegative, and we can drop the time dependency argument t for stationary processes. ${\text{TE}}_{\text{Y}\to \text{X}}^{(\text{k},\text{l})}(\text{t})$ is the information about the future state of X_{i}, which can be obtained by subtracting information retrieved from only ${\text{X}}_{\text{t}}^{(\text{k})}$ from information gathered from ${\text{X}}_{\text{t}}^{(\text{k})}$ and ${\text{Y}}_{\text{t}}^{(\text{l})}$. Figure 4 shows the schematic representation of TE.
In this study, we focused on the TE under the following conditions of two lags k=l=1 , which is commonly selected because these settings about lags can be safely assumed on the weak form of the efficient market hypothesis and the random walk behavior of stock prices. Then, we can be expressed the equation of (1,1)history TE as follows:
where $\text{i}=\{{\text{x}}_{\text{t}+1},\hspace{0.17em}{\text{x}}_{\text{t}},\hspace{0.17em}{\text{y}}_{\text{t}}\}$.
2.5 Complex Network Analysis
Complex network analysis is usually used to describe a high degree of interdependence between objects. From this point of view, there can be various network analysis applications to financial systems. Most existing network theories studies focused on analyzing correlation, financial stability, and contagion phenomena. Moreover, most financial network studies researched network effects rather than network formation (Allen and Babus, 2009). Recently, several papers have been published in new research areas, such as market analysis (Beije and Groenewegen, 1992;Namaki et al., 2011), social networks (Huang et al., 2009;Roy and Sarkar, 2011;Martin et al., 2011), investment decisions (Ojala and Hallikas, 2006;Lee et al., 2011), investment banking (Schnabel and Shin, 2004;Minoiu and Reyes, 2013;Gemici and Lai, 2019), and microfinance (Ohanyan, 2002;Fafchamps and Gubert, 2007;Tahmasebi and Askaribeazyeh, 2020).
Many complex network analysis methods exist. Link analysis is a subset of network analysis, exploring associations between objects. An example may be examining the addresses of suspects and victims, the telephone numbers they have dialed and financial transactions that they have partaken in during a given timeframe, and the familial relationships between these subjects as a part of a police investigation. Link analysis here provides the crucial relationships and associations among many objects of different types that are not apparent from isolated pieces of information. Computerassisted or fully automatic computerbased link analysis is increasingly employed by the following: banks and insurance agencies in fraud detection; telecommunication operators in telecommunication network analysis; medical sector in epidemiology and pharmacology; law enforcement investigations; search engines for relevance rating (and conversely by the spammers for spamdexing and by business owners for search engine optimization); and everywhere else, where relationships between many objects have to be analyzed. Links are also derived from the similarity of time behavior in both nodes.
Information about the relative importance of nodes and edges in a graph can be obtained through centrality measures, widely used in disciplines such as sociology. For example, eigenvector centrality uses the eigenvectors of the adjacency matrix corresponding to a network to determine nodes that tend to be frequently visited. Formally established centrality measures are degree centrality (DC), closeness centrality, betweenness centrality, eigenvector centrality, subgraph centrality, Katz centrality, and weblink centrality measures. The purpose or objective of analysis generally determines the type of centrality measure to be used. For example, if one is interested in network dynamics or the robustness of a network to node/link removal, a node’s dynamical importance is often the most relevant centrality measure.
Based on the above information about complex network analysis, we used minimum spanning trees and weighted directed networks to illustrate politicallythemed stock networks based on ETE. Then, we analyzed politicallythemed stock networks on the network and node levels and confirmed their network dynamics in realworld situations.
2.5.1 Networklevel Network Analysis
We used network density (ND), ETE, and the frequency distribution of ETE at the network level to analyze and summarize network dynamics. These networklevel measures have been generally used to analyze stock markets and can effectively illustrate the states of politically themed stock networks.
2.5.1.1 ND
The ND can be defined as the number of edges K to the number of possible connections in a network with N edges. The idea of ND comes from the binomial coefficient. In sum, ND of the directed graph refers to the following:
2.5.2 Nodelevel Network Analysis
There are many kinds of nodelevel measures to analyze networks. Among various nodelevel measures, the concept of centrality is important in nodelevel network analysis. Centrality measures also have many types, and we focused on centrality measures that indicate direct influences of politically themed stocks. Hence, we considered centrality measurements considering the intuitive influence of individual nodes. In other words, DC, node strength (NS), and PageRank (PR) are the methods used to analyze politicallythemed stock networks based on ETE (Liao et al., 2017). We used the normalized versions of nodelevel measures to compare the results of other politically themed stock networks.
2.5.2.1 DC
In unweighted directed networks, DC represents the total number of edges connected with other nodes through which a node is connected. However, nodes of weighted directed networks have either ingoing edges, outgoing edges, or both. Generally, a node with many other nodes is called a hub, and a node with many other nodes pointing at it is called authority. Therefore, the DC is analyzed into two measures in weighted directed networks: indegree (DC_{in}) and outdegree (DC_{out}). In this study, we divided two original DC measures by N−1 to compare with other politically themed stock networks on the same scale as we mentioned. In sum, DC_{in} and DC_{out} of the j th stock can be defined as Eqs. (10) and (11), respectively.
In Eqs. (10) and (11), a_{ij} is the elements of the adjacency matrix $A\in {\text{\mathbb{R}}}^{\text{N}\times \text{N}}$. or the adjacency matrix A, if a link exists from stock i to stock j , then a_{ij} =1 , and a_{ij} = 0 if otherwise. In addition, ii a = 0 for all (1≤ i ≤ N, 1≤ j≤ N) .
2.5.2.2 NS
In general, NS is the sum of the weights of links connected to the node. In weighted directed networks, the instrength is the sum of inward link weights, and the outstrength is the sum of outward link weights. NS represented the influence features in politically themed stock networks and can be calculated into two measures: instrength ( NS_{in} ) and outstrength ( NS_{out} ), similar to DC. We also divided by N−1 for comparing with other politically themed stock networks, NS_{in} , and NS_{out} of the jth stock can be expressed as Eqs. (12) and (13), respectively.
In Eqs. (12) and (13), w_{ij} are elements of weight matrix $W\in {\text{\mathbb{R}}}^{\text{N}\times \text{N}}(1\le \text{i}\le \text{N},\text{}1\le \text{j}\le \text{N})$, where w_{ii} = 0 for all 1≤ i ≤ N .
2.5.2.3 PR
PR (Page et al., 1999) is an algorithm used by Google Search to rank web pages in their search engine results. PR is one of the famous eigenvectorbased metrics that consider all network paths to determine node importance. In addition, PR has been introduced to rank web pages from the web graph initially. PR improved the disadvantages of similar metrics, such as the eigenvector centrality and the Katz centrality. PR defines a link analysis method for a directed network to evaluate a user’s influence. Both immediate information flow and the information flow after that would also be considered. PR is recently used in finance as a systematic measure (Kuzubaş et al., 2014;Yun et al., 2019) and stock market analysis (Tu, 2014;Tang et al., 2019). Suppose there are N stocks and $A\in {\text{\mathbb{R}}}^{\text{N}\times \text{N}}$ denotes the adjacency matrix for the politically themed stock network. In mathematical terms, we can get PR values of politically themed stocks as the form of the column vector $r\in {\text{\mathbb{R}}}^{\text{N}}$ (Higham, 2005):
In Eq. (14), $I\in {\text{\mathbb{R}}}^{\text{N}\times \text{N}}$ is the identity matrix, $1\in {\text{\mathbb{R}}}^{\text{N}}$ is the column vector whose elements are all one, and $D\in {\text{\mathbb{R}}}^{\text{N}\times \text{N}}$ is diag (deg_{i}) with ${\text{deg}}_{\text{i}}=\text{max}({\text{K}}_{\text{out}}^{\text{i}}$, 1) where ${\text{K}}_{\text{out}}^{\text{i}}$ is the number of outer edges started from node i . α is the damping factor, which ranges between zero and one. We set α=0.85 , which is the original study’s conventional value (Page et al., 1999).
3. RESULTS
3.1 Descriptive Statistics
The daily closing price data for the 20 candidate stocks selected as candidates within the above period were collected, and the return was calculated. In this study, log returns were set as a method of return calculation and presentation because, unlike general returns, the positive and negative values of the returns are symmetrical and easy to aggregate returns. Table 2 presents the descriptive statistics for the abnormal returns.
As a result, all return distributions did not satisfy normality in the data, making it reasonable to use TE, the entropybased causality measure.
3.2 Calculating the Sentiment Index of MaskThemed Stocks
3.2.1 Preprocessing Text
Text preprocessing is a crucial task in NLP, which refers to the preprocessing of text according to its purpose. This study used text preprocessing techniques, such as cleaning, normalization, tokenization, stopword optimization, stemming, tablecontrolled extraction, and keyword extraction, to transform collected text data into suitable forms for analysis.
We carried out text preprocessing using Python “nltk” library and “KoNLPy” library to implement the above Korean text preprocessing technique on collected text data.
We first conducted a normalization for text preprocessing that incorporates the same and similar vocabulary as the purification process, eliminating noise, such as typos, from the corpus of collected text data. Then, the normalized text data went through a tokenization process and removed the meaningless word token among the tokens generated. In addition, the text was converted into a form that is easy to model topics and analyze sentiments using LDA by paraphrasing and tabular control extraction tasks from termtreated word tokens.
3.2.2 Selecting the Model
This study hypothesized that the mentioned volume and sentiment analysis results could affect the abnormal return. As we have already shown, Eq. (5) is the probability of being judged as positive text because of sentiment analysis and is the probability of being considered as negative text. The range of the sentiment score is 100%, indicating that the text on that date has a 100% positive average probability, and the text has a 100% negative average probability of the text on that date. These indices can reflect less sensitivity to text, as text converges to zero when it is close to information transfer or when objective sentences have a high proportion.
We first labeled the sentiment score for the training data set to calculate the sentiment. We conducted sentiment analysis using four transformerbased models for 126,703 randomly collected text data as the training set using Big Kinds. To score those unlabeled randomly collected data, we gave those articles +1 if the summation of lexicons in an article is positive and −1 if negative. Then, we conducted sentiment analysis using four transformerbased models for 126,703 randomly collected text data from the political section.
As we mentioned above, we set those transformerbased models with finetuning. As a result, ELECTRA achieves the highest accuracy and macro F1 score for the test sets. Therefore, we selected ELECTRA as the model for conducting sentiment analysis. Table 3 shows the comparison of finetuned four transformer language models. The model’s accuracy and macro F1 score after 100 epochs were 88.31% and 82.93%, respectively. We used the early stopping method for preventing overfitting.
3.2.3 Topic Modeling Results
The textual data consisted of articles, editorials, and news scripts containing mask keywords from December 1, 2019, to August 26, 2020. A total of 15,337 were collected. Of these, 240 text data, which was highly associated, were removed first, preprocessed text data, and then topic modeling was carried out. As a result of the perplexity analysis for calculating the optimal number of topics, three topics were optimally determined and classified into three topics. Table 4 shows the top 10 keywords that appear the most for each topic. Table 4 also shows that Topics 2 and 3 have several keywords related to the supply and demand of masks, including production, demand, thematic stocks, stock prices, and increase. However, Topic 1 can be inferred that there is text related to infections caused by noncompliance with prevention rules or lack of mask. Using the method of Figure 2, we calculated the polarity score and excluded the text data in keywords that appear the most for each topic. Table 4 also shows that Topics 2 and 3 have several keywords related to the supply and demand of masks, including production, demand, thematic stocks, stock prices, and increase. However, Topic 1 can be inferred that there is text related to infections caused by noncompliance with prevention rules or lack of mask. Using the method of Figure 2, we calculated the polarity score and excluded the text data in the topic with low average polarity scores. Most of the text data were suitable for our research purposes, mainly reporting confirmed cases and routes. Therefore, sentiment analysis was performed on 13,389 text data from which 1,948 textual data belonging to Topic 1 were removed.
3.2.4 Calculating TSI and Abnormal Returns of Candidate Stocks
For screening candidate stocks through causal analysis, the daily search volume and sentiment scores were multiplied to produce TSI, as in many previous studies. Figure 5 shows the result of the final computed TSI and the fluctuation of TSI. Significantly, the fluctuation is considerable between January and March 2020, when demand for masks, where the socalled “mask crisis” occurred because of COVID19, surged.
The number of bins of the histogram for calculating the TE for listing stocks as follows (HacineGharbi and Ravier, 2018):
where is the empirical estimated correlation coefficient between X and Y.
At this time, the calculated results of alpha and beta via the market model are as follows:
3.2.5 Selecting MaskThemed Stocks
The calculated TSI and abnormal return included in the candidate group of themed stocks were set to 1 day and then verified for the existence of calculated ETE values, pvalues, and causality. For <Table 6>, this table summarizes the TSI and the TE values and pvalues for individual candidate stocks and the TE values and pvalues for the abnormal return of individual candidate stocks. For pvalues, “***” for less than 0.001, “**” for less than 0.01, and “*” for less than 0.05 are marked on the right side of the pvalue. In this study, a significant level of was set, based on which candidate stocks were included in the thematic stock network when the TSI was significantly causal to the candidate stock price. The candidate stocks were excluded from the thematic stock network if the pvalue was greater than or equal to 0.05.
The causal relationship between candidate stocks in the candidate group from the TSI showed that 19 of the 20 candidate stocks in the candidate group had significant causal relationships at the level.
The candidate stocks S, which did not have significant causal relationships at the level of the stock price, were excluded from the thematic stock network configuration. The stock S is the fashion mask company, which is not related to the medical masks. In other words, this nonlinear causal relationship from TSI to the candidate stocks can be translated qualitatively in this case
3.3 Constructing and Analyzing the MaskThemed Stock Network
3.3.1 Constructing the MaskThemed Stock Network of the Research Period
Each causal relationship is measured for 19 maskthemed stocks selected based on the causal relationship with the TSI. The form of the thematic stock network constructed using this can be confirmed in Figures 6–8. Figure 6 is based on the significant ETE value when α = 0.1, Figure 8 is α=0.5 , and Figure 9 is α=0.01 . The schematic thematic stock network of Figures 6–8 shows that the thickness of the color of the connection line has a strong degree of causal relationships. In addition, a larger size of the circle means that a thematic stock has a larger degree value. We used Kamada–Kawai (Kamada and Kawai, 1989) pathlength costfunction for positioning nodes. The ND of these three networks is 61.70% ( α = 0.1 ), 61.40% ( α=0.05 ), and 40.06% ( α=0.01 ) each. At the 0.05 significance level, the maskthemed stock network is dense, and over 50% of possible connections are connected. In the next section, we illustrated that the network analysis results at the significance level α=0.05 are based on the configured maskthemed stock network.
3.3.2 Network Analysis Results
We analyzed the network based on the ETEbased causal relationships statistically significant at the significance level α=0.05 .
From Figure 9, in the case of outdegrees, it was confirmed that stocks C, H, I, L, and R were close to 0.8, led by J and K exceeding 0.8. In the case of indegrees, they were large in the order of H, K, I, R, O, and J. Simi lar results can be confirmed in the case of outstrength, instrength, and PR values.
Generally, a node that has connections with many other nodes going out from it is called a hub, and a node with many other nodes pointing at it is called authority.
From these measures, I and K stocks have very high values of 0.4 or higher in very all network measures, and those values are statistically significant at the significance level α=0.01 . Moreover, compared with the ETE value of mask thematic stocks, there is a very strong positive correlation for all correlation coefficients. These results can be seen as proving that stocks with high causal relationships from TSI within maskthemed stocks play a role of hub and authority.
Moreover, if the connection between a thematic stock’s themes is deeper than other stakeholders, then the stock is generally called the leading stock in Korea. These leading stocks are frequently used in Korea when financial regulatory authorities look for signals of thematic stocks’ abnormal movements (Choi and Kim, 2021). From this perspective, those stocks with high network measure value the stocks as the leading stocks in this case.
3.3.3 Analyzing the Longitudinal Changes of Thematic Stock Network
In particular, during the COVID19 outbreak of Period 4, these thematic stock networks were denser than other thematic stock networks. These results confirm that the intensity of the causal relationship within the maskthemed stock network has been further strengthened as the interest in the maskthemed network increases. This finding can be said to be quantitative evidence supporting the monitoring of thematic stocks at that period.
In addition, the change of thematic stock network can be used to analyze the particular stock. For example, company R is mistakenly known as a maskthemed stock because of misinformation that they massproduce masks, and such false information was delivered to shareholders. However, as of February 24, 2020, company R officially announced that they were far from maskthemed stocks and did not massproduce masks.
As a result, all network measurements of company R have decreased, and the overall ND has also decreased. These results indicate that company R has narrowed its position within the maskthemed stock network, which can be understood as a decrease in the overall ND. Moreover, in the Kamada–Kawai algorithm, if the node is more centered, then the network is more critical. From this perspective, stock R went to the edge of the network because its importance decreased. Furthermore, only stocks D and R showed an overall decline among network measures of stocks.
These results can be quantitatively confirmed when considering the thematic stock characteristics of the stocks in future results if changes within the thematic stock network are noted.
3.3.4 Investment Strategy based on the theme sentiment index (TSI) and MaskThemed Stock Network
Based on the previous studies about thematic stocks and results of previous sections, we obtain insight into dramatic price change of thematic stocks with the positive CAR effect when the theme keyword is focused on. Based on these effects that are known in advance, we developed the investment strategy using TSI and maskthemed stock networks as the class of assets (Kim et al., 2014) to verify the possibility of benchmarking KOSPI and KOSDAQ, which are the major market indices of the stock market in Korea. Therefore, the strategy can be described by performing the following steps:

Step 1: Check the daily ROC of TSI.

Step 2: Conduct anomaly detection methodologies to the ROC of TSI.

Step 3: Optimize the portfolio with the stocks, invest at the anomaly periods based on the optimization result, and check the performance.
To apply this strategy, we assumed the daily rebalancing period and the weights of stocks were collected from the solution of portfolio optimization problems to maximize the Sharpe ratio. Maximizing the Sharpe ratio in the portfolio optimization problem is one of the most commonly used portfolio optimization methodologies. Sharpe ratio has advantages that consider return and risk in the objective function simultaneously and can be computed directly from any observed time series of returns regardless of additional information from stocks. The portfolio optimization problem that maximizes the Sharpe ratio can be defined as follows:
where $w\in {\text{\mathbb{R}}}^{\text{N}}$ is the column vector consists of the weight of assets, $\mu \in {\text{\mathbb{R}}}^{\text{N}}$ is the mean return vector, and $\Sigma \in {\text{\mathbb{R}}}^{\text{N}\times \text{N}}$ is the covariance matrix. The column vectors ${r}_{f}\in {\text{\mathbb{R}}}^{\text{N}}$ and $1\in {\text{\mathbb{R}}}^{\text{N}}$ each consists of all elements are the riskfree rate and one each. The first constraint means that the sum of portfolio weight is one, and the second implies that short selling is not allowed, which is hard to carry out by individual investors.
We used the stock return data from before 20 trading days to calculate the average return vector and the covariance matrix. The South Korea 10 Years Government Bond was used as the riskfree rate (Goedhart, 2015). We also consider transaction costs to prevent overestimating the performance, critical to the daily rebalancing portfolio optimization problem. In other words, we assumed that we sell all of our stocks, which we held at the previous period, and buy new stocks when we rebalance our portfolio. This assumption was applied when the selected politically themed stock network is the same for several days in a row. Therefore, the cumulative return data of the portfolio are the lower bound of the portfolio’s performance. We also optimized the portfolio from all politically themed stocks without considering what TSI and their networks belong to. Additionally, we considered the reformulation version considering computational times and achieving the objective function’s convexity (Iyengar and Kang, 2005).
For considering the robustness of selecting anomaly time points, we selected the anomaly periods using the ensemble hardvoting results of the following 13 different outlier detection methods: anglebase outlier detection method, clusteringbased local outlier method, connectivity based local outlier method, isolation forest method, histogrambased outlier detection method, knearest neighbors detector method, local outlier factor method, oneclass support vector machine detector method, principal component analysis method, minimum covariance determinant method, subspace outlier detection method, deviationbased outlier detection method, and copulabased outlier detection method. We firstly conduct each outlier detection method and score dates as 0 (normal) and 1 (outlier). The anomalies are calculated from only the previous periods of the timepoint considering the realistic condition. Then, we summed up those 13 outlier detection results for calculating points and selected the data as an outlier if the points were seven or higher. Fig ure 13 shows the results of this method.
The portfolio with our investment strategy can obtain additional profit than both market indices before and around the outlier time points with 20day windows, as shown in Figure 14. As a result, it was confirmed that returns, such as KOSPI and KOSDAQ, could be benchmarked when shortterm investments were outliers. In particular, during the mask shortage period, around March 2020, the 20day cumulative return of our investment strategy increased considerably when the outliers appeared frequently. After the first 20 business days, the return of the maskthemed stock portfolio was 84.5%. This portfolio return of the period is four times that of KOSPI (23.3%) and three times that of KOSDAQ (26.4%). These results can support the results of existing studies that when a specific theme keyword comes to mind, a positive excess return may occur. However, suppose we have an optimized portfolio of maskthemed stocks for 20 business days from a random point in time within the research period. In that case, only 36.8% of periods generate positive returns compared to the two indices.
Moreover, approximately half of the periods (48.28%) got worse investment results than the two indices, as shown in Figure 15. In Figure 15, the green and yellow background colors indicate that our investment strategy was better than the two and only one market index, respectively. Last, the red background indicates that our investment strategy was worse than the two market indices. This result is similar to previous studies; the reported investment risks can generate loss when investing in thematic stocks (Financial Services Commission, 2017;Financial Supervisory Service, 2017;Nam, 2017;Nam, 2020;Choi and Kim, 2021).
4. CONCLUSION
This study presented a novel analysis method for thematic stocks using natural language processing techniques and network analysis. We tried to demonstrate what was empirically analyzed in previous studies that profit generation at themerelated times or events, and in reality, the risk of thematic stock investments, should be considered in previous studies. We set the configuration and utilization of thematic stock networks as key objectives. In addition, maskthemed stocks were first selected as subjects to form networks owing to yellow and fine dust. They were reilluminated as the demand for masks increased rapidly after the COVID19 outbreak. For conducting this study, 20 candidate stocks (10 listed KOSPI stocks and 10 listed KOSDAQ stocks) were selected to form a thematic stock network, considering the frequency of themed stocks classified as maskthemed stocks in the top 10 securities firms by sales. As the purpose of this study is not to encourage the purchase or sale of candidate stocks by a particular company, all candidate stocks belonging to the candidate group are alphabetically deidentified and described.
For screening candidate stocks through causal analysis, the daily search volume and sentiment scores were multiplied to produce theme sentiment index (TSI) calculated from natural language process techniques such as sentiment analysis and topic modeling. As a result, the causal relationship between candidate stocks in the candidate group from the TSI showed that 19 of the 20 candidate stocks in the candidate group had significant causal relationships at the level. This result means that most of the actual known theme stocks can be identified by quantified indices such as the TSI with the information that investors can get and the concept of theme stocks generated from them.
Also, we conducted the experiment using a portfolio optimization strategy based on the signals from TSI’s outliers. That result shows that we can get some profit when the themerelated interest, which can be quantified to the keyword's search volume, rises. However, overall, we got the experiment result that the reported investment risks can generate loss when investing in thematic stocks like previous studies.
The limitations of this study can be summarized as follows. First, although the actual stock price is affected by several factors, the experiment was conducted after assuming that the impact factor is affected only by external factors, excluding internal factors, such as financial status. Second, further research on consistency and availability is needed as the network is constructed by statistical techniques, such as TE, that do not consider the linkages or financial domains between real businesses. Finally, when initially establishing a candidate group of candidate stocks for thematic stock composition, the judgment of the experimental subjects may be included. In addition, the distortion or subjective part may occur, resulting in missing candidate stocks even when the causality is significant.
Moreover, the TSI may have errors because the accuracy and F1 score of the training period were not perfect. Therefore, followup studies need to supplement these limitations to adequately decompose stock price output's internal and external factors. Followup studies could also further validate the consistency and availability of these thematic stock networks with the findings of this study.
From the contribution perspective, this study’s thematic stock network has the advantages of schematizing. This schematized thematic stock network can lead to easy checking of the causal relationship among thematic stocks. Furthermore, this schematized thematic stock network may be used as a methodology to assist the detection of abnormal transactions in the RegTech and SupTech sectors by improving the limitations mentioned in line with this trend. In detail, the need to detect abnormal transactions to prevent unfair trade and financial market efficiency of FinTech and RegTech is highlighted in Korea. Especially, the methods for detecting abnormal transactions for existing themed stocks are mainly based on empirical methods through financial statements analysis, technical analysis on stock price charts, and psychological analysis. The thematic stock network developed in this study and the experimental results using the statistical techniques of text mining and TE confirmed that this process could detect abnormal transactions through networkrelated indicators. Therefore, this method may be used as an anomaly transaction detection method for verifying thematic stocks in the RegTech and SupTech areas.