Journal Search Engine
Search Advanced Search Adode Reader(link)
Download PDF Export Citaion korean bibliography PMC previewer
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.21 No.2 pp.244-266
DOI : https://doi.org/10.7232/iems.2022.21.2.244

Analyzing and Utilizing Thematic Stocks based on Text Mining Techniques and Information Flow-Based Networks: An Example of the
Republic of Korea’s Mask-Themed Stocks

Insu Choi, Woo Chang Kim*
Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea
*Corresponding Author, E-mail: wkim@kaist.ac.kr
November 8, 2021 ; March 22, 2022 ; March 23, 2022

Abstract


In this study, based on the theme sentiment index (TSI) derived from text mining techniques, we measured the causal relationship from the theme sentiment index to the stock price’s abnormal returns for detecting thematic stocks. We selected mask-themed stocks as the experiment subject and set 20 candidate stocks as candidates, considering the frequency of appearance as the associated search term for a mask. Then, we collected search volumes for the keyword “mask” and related keywords from December 1, 2016, to November 30, 2020, to construct the TSI. In addition, we scraped 15,337 text data, such as articles, editorials, and economic broadcast scripts. We also used the abnormal return data of selected 20 stocks derived from the market model. Results show that 19 stocks have statistically significant causal relationships from the TSI to the abnormal returns of their stock prices when the effective transfer entropy is used. We constructed a thematic stock network using the 19 stocks to detect their inner causal relationships. Networkand node-level measures were measured for the constructed thematic stock network for selecting core stocks in the thematic stock network. In addition, two experiments were conducted using the configured thematic stock network. Results confirmed the thematic stock network’s change in behavior and interconnectivity and confirmed that abnormalities such as listing stock misinformation could be detected and empirically analyzed.



초록


    1. INTRODUCTION

    Thematic stocks generally mean a group of stocks with stock price movements synchronized in the same direction on a single theme. Thematic stocks exist in various forms, including politics, science, technology, entertainment, environment, healthcare, pharmaceuticals, resource development, and others. These thematic stocks are characterized by investors’ investment in anticipation of short-term high returns rather than steady gains. In addition, they are often driven by investors’ supply and demand rather than by corporate financial conditions or performance factors. Thematic stocks tend to occur more frequently in the domestic stock market than in advanced countries, such as the United States, where theme-related external information or the direction of government poli- cies often changes the theme’s return rate.

    In the Republic of Korea (from now on referred to as Korea), thematic stocks originate from the so-called “unprovoked investment” in construction company stocks that occurred in Korea in the late 1970s when domestic construction companies were booming overseas, mainly in the Middle East. In late 1987, the term “thematic stocks” began to be used in earnest in the Great Wall thematic stocks. After rumors spread that the Chinese government would install a windbreaker on the Great Wall, related stocks surged. Since then, the thematic stocks of the Four Rivers project, which were policy-related thematic stocks in the 2007 presidential election, have soared and brought significant profits to investors (Kwak and Yeo, 2019).

    Research on these thematic stocks tends to focus on Korea, where investment in thematic stocks is widespread and is mainly related to political thematic stocks. Herron et al. (1999) analyzed the relationship between the 1992 US presidential election and sector-by-sector returns on the US stock market. Knight (2006) examined the Republican and Democratic candidates George W. Bush and Al Gore’s policies in the 2000 US presidential election to confirm significant stock fluctuations. In the case of domestic research, Woo and Kim (2014) used event research methodology to calculate regular stock prices for thematic stocks, and Kang (2016) used Fama–French’s three-factor model for Korea’s 18th presidential election thematic stock. In addition, the Financial Services Commission (2017) and Financial Supervision Service (2017) warned against the risk of investing in thematic stocks. They noted that political thematic stocks related to presidential candidates plunged to market levels on election day 19. Nam (2017) also confirmed the need for cumulative return analysis. Kwak and Yeo (2019) analyzed short- and long-term abnormal returns using market models, Fama– French’s three-factor, and Carhart’s four-factor model as of the 19th presidential election. Kim and Lim (2020) confirmed that KOSPI 200 and particulate matter-themed stocks responded to changes in PM10 concentration. Finally, Nam (2020) analyzed the cumulative abnormal return (CAR), suggesting that politically themed stocks continue to emerge in major political events, including the last presidential election, as a medium for connecting with leading politicians unrelated to corporate intrinsic values. In Table 1, we summarized those previous papers and their methodologies for detecting thematic stocks.

    In summary, prior research conducted experiments primarily by analyzing the existence of thematic stocks based on over-return rates within a specific period due to an incident by setting quantitative variables related to a particular theme as explanatory variables. Based on prior research, we try to illuminate these thematic stock investments using daily sentiment scores and information theory for a new part of individual thematic stock research based on the specific period due to an incident in this study. For example, the objectives include analyzing external information using text mining techniques, verifying whether the theme is related through theme sentiment index (TSI) and causal analysis of candidate stocks in the candidate group, and finding ways to apply them. This study used economic text data containing keywords, such as masks and search volumes for related keywords as external information, news articles where individual investors get the most information, and private and economic broadcasting scripts. In addition, we verified the existence and magnitude of causal relationships between the stock price and abnormal return in candidate stocks in the candidate group with text sensitivity indices produced by combining search volume and sensitivity in the text. In addition, using candidate stocks that have been proven to have significant causal relationships, the network of thematic stocks was constructed based on the analysis of the causal relationship among candidate stocks belonging to the thematic stocks. The network theory was used to veri-fy the influence within each listing-themed share network for the constructed network. Finally, we examined the developed thematic stock network. We conducted an investment simulation by identifying dynamic changes across the thematic stock network at a particular time and dynamic changes in specific listing stocks (Choi and Kim, 2021). In this study, we tried to demonstrate what was empirically analyzed in previous studies that profit generation at specific times or events, and in reality, the risk of thematic stock investments, should be considered as mentioned above in previous studies.

    This paper is further organized as follows. Section 2 describes the research methodology based on the foundational analysis results of the subjects and data used in the composition and analysis of the thematic stock network. Then, Section 3 schematizes the thematic stock network based on the results derived. In addition, this section seeks the investment methodology based on experimental results using schematic thematic stock networks and the possibility of application in RegTech and SupTech fields. Finally, Section 4 describes this study’s summaries, limitations, and conclusions.

    2. DATA AND METHODOLOGY

    2.1 Study Subject

    2.1.1 Subject to Research

    This study set the configuration and utilization of thematic stock networks as key objectives. In addition, mask-themed stocks were first selected as subjects to form networks owing to yellow and fine dust. They were re-illuminated as the demand for masks increased rapidly after the COVID-19 outbreak.

    In this study, 20 candidate stocks (10 listed KOSPI stocks and 10 listed KOSDAQ stocks) were selected to form a thematic stock network, considering the frequency of themed stocks classified as mask-themed stocks in the top 10 securities firms by sales. As the purpose of this study is not to encourage the purchase or sale of candidate stocks by a particular company, all candidate stocks belonging to the candidate group are alphabetically deidentified and described.

    2.1.2 Research Period

    To form the current network of mask-themed stocks, we set a period of 1,461 days (980 trading days) from December 1, 2016, to November 30, 2020, as an analysis period. This period is believed to have had the most significant effect on thematic stocks in recent years. Then, we collected text and return data within that period. Within this period, candidate stocks whose causal relationship from mask-related text data were selected and used as the nodes for the mask-themed stocks network.

    2.2 Asset Pricing Model

    In this study, we used abnormal returns of stocks to reduce a stock market’s influence and focus on maskthemed stocks’ performances on the individual level. The abnormal return of an asset is the subtraction of the expected return derived from the asset pricing model from the historical return. We used these three asset pricing models for the results’ consistency: the market model.

    2.2.1 Market Model

    Brown and Warner (1980, 1985) introduced the market model to examine the properties of daily stock returns and how particular characteristics of them affect event study methodologies. Various finance studies used this model to derive individual stocks’ abnormal returns. The equation of MM can be expressed as follows:

    E ( r i , t ) = α i + β i E(r m , t ) + ε i , t
    (1)

    where E ( r i , ) is the return of stock i on day t , and E ( r m , t ) is the expected daily return to the market portfolio of risky assets on day t . αi and βi are the intercept and the slope of the fitted line derived from linear regression results. We estimate the values of αi and βi using ordinary least squares estimators in the original research. Therefore, εi,t is the error term (a random variable) with expectation zero and finite variance. Moreover, εi,t is uncorrelated to the market return E ( r i , t ) and firm return E ( r m , t ) with i≠ j , homoscedastic and not autocorrelated.

    2.3 Text Mining Technique

    2.3.1 Latent Dirichlet Allocation (LDA)

    Topic modeling methods are powerful, intelligent techniques widely applied in natural language processing (NLP) to discover topics and semantic mining from unordered documents. Specifically, LDA, one of the most popular topic modeling methods, is a generative probabilistic model for collections of discrete data, such as text corpora (Blei et al., 2003). LDA can generate a topic per document model and words per topic model based on Dirichlet distribution. Figure 1 shows the concept of LDA.

    Many studies applied topic modeling methods based on LDA in various fields, such as keyword selection, source code analysis, opinion mining, event detection, music key profiling, image classification, recommendation system, and emotion classification.

    We used LDA based on the Gibbs sampling method because of its rapid speed compared with the original model. Gibbs sampling is one of the Markov chain Monte Carlo algorithms for sampling conditional distributions of variables, approximated from an actual distribution when direct sampling is inefficient or difficult. Equation (1) is the updated equation of LDA using Gibbs sampling for the probability that the k-th topic is assigned to zd,i, the ith word of the d-th document (Griffiths and Steyvers, 2004;Darling, 2011).

    p ( z d , i = k|z i ,   w ) = n d , k + α k i = 1 K n d , i + α i v k , w d , n + β w d , n j = 1 V v k , j + b j = A B
    (2)

    where z−i signifies leaving the i-th out of the calculation, w is the word vector of documents, nd,k is the number of times words in the d-th document that are assigned to the k-th topic, wd,n is the n-th word in the d-th document, and vk, wd,n is the number of times wd,n from the whole corpus in the k-th topic. αk and βk are the hyperparameters of per-corpus topic distributions and per-document topic proportions, following symmetric Dirichlet distributions. Equation (1) can be summarized as two parts: A and B. A means the relationship between the d-th document and the k-th topic, and B is the relationship between the n-th words of the d-th document and the k-th topic.

    For selecting the optimal number of topics of the LDA model, we considered perplexity and a topic coherence measure: CV. Perplexity, generally used in language modeling, is originally the entropy-based measurement of how well a probability distribution or probability model predicts a sample. Perplexity is equal to the inverse of the geometric mean per-word likelihood algebraically. The CV measure is based on a sliding window, one-set segmentation of the top words, and an indirect confirmation measure that uses normalized pointwise mutual information and the cosine similarity. We selected the topic number, which has the lowest perplexity, and assigned a topic to each document based on Eq. (3):

    n m = arg max n ( θ mn ) for 1 m M , 1 n N
    (3)

    where nm is the topic number that is assigned to the document m and θm1, …, θmN are assigned topic probabilities from topic 1 to topic N of the document m based on the LDA model.

    This study used LDA to filter out neutral documents from the original document data to make text data for sentiment analysis. We first classified raw documents into an optimal number of topics determined by perplexity to filter out neutral documents. Then, we assigned the wordlevel polarity score to words in the raw documents. For the neutral word, we gave zero for the word-level polarity score. Conversely, if the word has positivity or negativity, we designated one for the word-level polarity score. We judged positivity, negativity, and neutrality regarding the sentiment lexicon referred to the National Institute of the Korean Language and made the word-level polarity score matrix with them. Then, we computed the matrix products of the term frequency-inverse document frequency (TF-IDF) matrix of raw documents and word polarity matrix to calculate document-level polarity scores. As various versions of TF-IDF equations exist, we chose the version of Eq. (4):

    tf idf ( t , d ) = tf ( t , d ) × idf ( t ) = tf ( t , d ) × [ ln ( n + 1 df ( t ) + 1 ) + 1 ) ]
    (4)

    where tf(t,d) means that the count of word t in document d is divided by the number of words in document d , and n is the total number of documents in the document set. df (t) is the document frequency of word t , which is the number of documents in the document set that contain t .

    Finally, we computed topic-level polarity scores, the average of document-level polarity scores belonging to the same topics. Then, we decided whether to add to text data to perform sentiment analysis. Figure 2 depicts a summary of the above process.

    2.3.2 Sentiment Analysis

    Sentiment analysis is a text mining methodology that analyzes the attitude or inclination of writing or speaking to identify sentiments on a particular subject, usually the text data. This analysis mainly determines text positive and negative opinions of data, such as articles, movie reviews, or posts from social network services.

    From the aspect of tasks and applications (Ravi and Ravi, 2015;Kumar and Ravi, 2016), sentiment analysis has been applied in broad areas, such as subjectivity classification, polarity determination, multilingual and crosslingual sentiment analysis, cross-domain sentiment analysis, opinion spam detection, corpora creation, opinion word, and aspects extraction. Concerning the finance domain, researchers used sentiment analysis to predict movements of stock prices (Deng et al., 2011;Mittal and Goel, 2011;Nguyen et al., 2015;Pagolu et al., 2016;Khedr and Yaseen, 2017), support decisions (Wu et al., 2014;Hájek et al., 2014;Chan and Chong, 2017), and anticipate risks (Wang et al., 2013;Nopp and Hanbury, 2015).

    We assumed that sentiments within one document extracted from articles, editorials, comments, and posts were probability-distributed to use the sentiment analysis for calculating the TSI. We used a machine learningbased approach for performing sentiment analysis implementation. In this study, we used four popular transformer- based language models (Vaswani et al., 2017) to conduct sentiment analysis for calculating the TSI. These models are known to achieve high accuracy on natural language understanding tasks. The descriptions of the four transformer-based models are as follows:

    Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) is based on transformer architecture. BERT is designed to pre-train bidirectional representations from an unlabeled text by jointly conditioning the left and right contexts. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks. BERT is pre-trained on two NLP tasks: Masked Language Models (MLM) and next sentence prediction.

    Robustly Optimized BERT Approach (RoBERTa) (Liu et al., 2019) performs better than BERT by applying the following adjustments:

    • Adjustment 1: RoBERTa uses BookCorpus (16G), CC-NEWS (76G), OpenWebText (38G), and Stories (31G) data, whereas BERT only uses BookCorpus as training data only.

    • Adjustment 2: BERT masks training data once for MLM objective, whereas RoBERTA duplicates training data 10 times and masks data differently.

    The developers of RoBERTa presented a replication study of BERT pre-training (Devlin et al., 2018) that carefully measures the impact of various key hyperparameters and training data size. They found that BERT was significantly undertrained and can match or exceed the performance of every model published after it.

    XLNet (Yang et al., 2019) is a generalized autoregressive (AR) model, where the next token is dependent on all previous tokens. XLNET is generalized because it captures a bidirectional context through a mechanism called permutation language modeling (PLM). The AR language model is a model type that uses the context word to predict the next word. BERT outperforms the previous language models, but XLNET outperforms BERT. This model uses the (MASK) in the pre-training. However, these symbols are absent from actual data at the fine-tuning time, resulting in a pre-train-fine tune discrepancy. XLNET proposes a new way to avoid the disadvantages brought by the (MASK) method in BERT. In the pre-train phase, XLNet proposed a new objective called PLM. This objective learns contextual text representation using the permutation of input. Overall, XLNet achieves state-of-the-art results on various downstream language tasks, including question answering, natural language inference, sentiment analysis, and document ranking.

    Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA) (Clark et al., 2020) is a novel pre-training approach that aims to match or exceed the downstream performance of an MLM pre-trained model while using significantly less compute resources for the pre-training stage. The pretraining task in ELECTRA is based on detecting replaced tokens in the input sequence. This setup requires two transformer-based models, that is, a generator and a discriminator. Then, we calculated the TSI and used the most commonly used form of index in financial research as the sentiment index (Antweiler and Frank, 2004;Checkley et al., 2017;Giannini et al., 2019;Hiew et al., 2019;Liang et al., 2020). In those prior research, the TSI usually used the additional term of regression models. However, we used TSI as a criterion of selecting thematic stocks and signal of the thematic investment.

    TSI i , t = 1 2 ( j = 1 V t P i , t j N i , t j P i , t j + N i , t j + 1 ) SV i , t V i , t = ( j = 1 V t P i , t j ) SV i , t V i , t
    (5)

    In Equation (5), TSIi,t is the TSI of the politician i at time t, P i , t j is the positive rate, and N i , t j is the negative rate of the j-th document of the politician i at time t from sentiment analysis results, which hold P i , t j + N i , t j = 1 . Vi,t means the number of documents related to politician i used in analysis at time t, and SVi,t is the standardized version of search volume data of politician i at time t. Then, we transformed original related terms of P i , t j and j i, t N from scale [-1, 1] to [0, 1] for the convenience of calculation, expressed in the second expression of Eq. (5). Using the property that P i , t j + N i , t j = 1 , we can convert the TSI to the rightmost expression of Eq. (5). We obtained the daily TSI by calculating Eq. (5) and computed their receiver operating characteristic (ROC) to demonstrate causal relationships between abnormal returns of maskthemed stock candidates.

    We divided the training data and test data to select the number of topics in document data to use the perplexity. We collected another 15,337 articles, including the word “mask” and its related keywords based on our research, such as “mask-themed stocks” from December 1, 2016, to November 30, 2020, using the Big Kinds developed by the Korea Press Foundation. Before classifying sentiments, we performed LDA for topic modeling and selected the optimal number of topics based on perplexity for filtering neutral text data. We also trained the machine learning model for sentiment analysis using training data. The training data set comprised randomly collected 126,703 articles from 2010 to 2020, excluding our test data. We labeled sentiment values of those articles +1 if the summation of lexicons in an article is positive and −1 if negative. We assumed that each article has only one sentiment in one article: positive or negative. After that, we conducted sentiment analysis using four transformerbased models for 126,703 randomly collected text data from the political section. Then, we calculated TSI using the model with the best performance. We pre-trained the data from “Modu Corpus” developed by the National Institute of the Korean Language for using four transformer- based models. Finally, we conducted LDA and sentiment analysis to calculate TSI using the fine-tuned model for collected documents with the best performance.

    2.4 Entropy Measure

    2.4.1 Effective Transfer Entropy (ETE)

    To select and analyze thematic stocks, we needed to quantitatively measure dependencies and causal relationships. However, general dependencies and causal relationships represented by correlation coefficients and Granger causality (Granger, 1969) should precede data assumptions, such as normality, stationary, and linearity. However, the natures of return-based data of stocks are not usually satisfied with these properties (Quigley, 2008;Sheikh and Qiao, 2009;Tsai, 2011). Therefore, we tried to utilize econophysics and information theories, which can be used without the assumptions mentioned above. To use such theories, we can consider linear and nonlinear relationships between objectives to measure correlations and causal relationships (Schinckus, 2010;Jovanovic and Schinckus, 2013). Accordingly, we used the concept of mutual information firstly suggested by Shannon (1948) and Kreer (1957) and transfer entropy (TE) proposed by Schreiber (2000), which are the entropy-based measures. In detail, we used TE based on the Shannon entropy in this study.

    TE is a non-parametric measure for verifying information transfer amount between two variables based on Shannon entropy. In contrast to Granger causality, TE is framed not in terms of prediction but terms of resolution of uncertainty. “TE from Y to X” means that the degree to which Y disambiguates the future of X is beyond the degree to which X already disambiguates its future. Therefore, an attractive symmetry exists between the notions “predicts” and “disambiguates.”

    TE represents a viable model-free tool to infer causal relationships between time series in two dynamical systems. TE can quantify causal relationships within systems and efficiently identify the source and target variables. Hence, TE has received significant attention and is widely used not only in the information or physics field but also in fields such as neuroscience, electrical engineering, and chemical engineering. TE has been widely used to determine causal relationships between financial assets and markets in finance (Bossomaier et al., 2016). Specifically, general stock market indices, exchange rates, stock price, sector index, and cryptocurrency have been researched. In the 2000s, Marschinski and Kantz (2002) reported the causal relationship between the German DAX Xetra Stock Index (DAX) and Dow Jones Industrial Average. Kwon and Yang (2008) showed the directionality of the information transfer and found that the market indices influence individual stocks in the US stock market. After the 2000s, Dimpfl and Peter (2013) analyzed the causal relationship of the credit default swap market relative to the corporate bond market for the pricing of credit risk and the dynamic relation between market risk and credit risk proxied by the VIX and the iTraxx Europe from the perspective of pre-crisis, crisis, and post-crisis periods based on ETE.

    Moreover, Sandoval (2014) used ETE to examine the causal relationship among 197 worldwide financial companies. Sensoy et al. (2014) investigated the strength and direction of information flow between exchange rates and stock prices in several emerging countries using ETE. Based on TE, Bekiros et al. (2017) investigated the network dynamics in US equity and commodity markets, and Lim et al. (2017) analyzed the information flow between industrial sectors in credit default swaps and stock markets in the US based on TE from the aspects of intraand inter-structures. Recently, Jang et al. (2019) studied the causal relationship among Bitcoin, gold, S&P 500 index, and US dollars using TE. Yue et al. (2020) analyzed information transfers between stock market sectors in China and compared between the US and China stock markets. These prior studies can support our idea to use TE to measure causal relationships. Last, Choi and Kim (2021) used ETE to detect politically-themed stocks and construct politically-themed stock networks.

    Based on the concepts mentioned earlier related to entropy, conditional entropy quantifies the amount of information needed to describe the outcome of a random variable X given that the value of another random variable Y is known. Here, the conditional entropy of X given Y can be expressed as follows:

    H ( X | Y ) = x X , y Y p ( x , y ) log 2 p ( x , y ) p ( y )
    (6)

    Equation (6) can be interpreted as the uncertainty about Y when X is known or the expected number of bits needed to describe X when Y is known to both the encoder and decoder. Based on the above definition, we can define the general form of (k, 1) -history TE between two variables Xt and Yt for x t ( k ) = ( x t ,   ... , x t k + 1 ) and y t ( 1 ) = ( y t , ... , y t l + 1 ) . The general (k, 1)-history TE can be expressed as follows:

    TE Y X ( k , l ) ( t ) = H ( X t + 1 |X t , , X t k + 1 ) H ( X t + 1 | X t , , X t k + 1 , Y t , , Y t l + 1 ) = i p ( x t + 1 , x t ( k ) , y t ( l ) ) log 2 p(x t + 1 |x t ( k ) , y t ( l ) ) i p ( x t + 1 , x t ( k ) , y t ( l ) ) log 2 p(x t + 1 |x t ( k ) ) = i p ( x t + 1 , x t ( k ) , y t ( l ) ) log 2 p ( x t + 1 | x t ( k ) , y t ( l ) ) p ( x t + 1 | x t ( k ) ) ,
    (7)

    where i = { x t + 1 , x t ( k ) , y t ( l ) } . TE Y X ( k , l ) ( t ) is non-negative, and we can drop the time dependency argument t for stationary processes. TE Y X ( k , l ) ( t ) is the information about the future state of Xi, which can be obtained by subtracting information retrieved from only X t ( k ) from information gathered from X t ( k ) and Y t ( l ) . Figure 4 shows the schematic representation of TE.

    In this study, we focused on the TE under the following conditions of two lags k=l=1 , which is commonly selected because these settings about lags can be safely assumed on the weak form of the efficient market hypothesis and the random walk behavior of stock prices. Then, we can be expressed the equation of (1,1)-history TE as follows:

    TE Y X ( 1 , 1 ) ( t ) = i p ( x t + 1 , x t , y t ) log 2 p ( x t + 1 |x t , y t ) p ( x t + 1 |x t ) = i p ( x t + 1 , x t , y t ) log 2 p ( x t + 1 , x t , y t ) p ( x t ) p ( x t + 1 , x t ) p ( x t , y t ) ,

    where i = { x t + 1 , x t , y t } .

    2.5 Complex Network Analysis

    Complex network analysis is usually used to describe a high degree of interdependence between objects. From this point of view, there can be various network analysis applications to financial systems. Most existing network theories studies focused on analyzing correlation, financial stability, and contagion phenomena. Moreover, most financial network studies researched network effects rather than network formation (Allen and Babus, 2009). Recently, several papers have been published in new research areas, such as market analysis (Beije and Groenewegen, 1992;Namaki et al., 2011), social networks (Huang et al., 2009;Roy and Sarkar, 2011;Martin et al., 2011), investment decisions (Ojala and Hallikas, 2006;Lee et al., 2011), investment banking (Schnabel and Shin, 2004;Minoiu and Reyes, 2013;Gemici and Lai, 2019), and microfinance (Ohanyan, 2002;Fafchamps and Gubert, 2007;Tahmasebi and Askaribeazyeh, 2020).

    Many complex network analysis methods exist. Link analysis is a subset of network analysis, exploring associations between objects. An example may be examining the addresses of suspects and victims, the telephone numbers they have dialed and financial transactions that they have partaken in during a given timeframe, and the familial relationships between these subjects as a part of a police investigation. Link analysis here provides the crucial relationships and associations among many objects of different types that are not apparent from isolated pieces of information. Computer-assisted or fully automatic computer-based link analysis is increasingly employed by the following: banks and insurance agencies in fraud detection; telecommunication operators in telecommunication network analysis; medical sector in epidemiology and pharmacology; law enforcement investigations; search engines for relevance rating (and conversely by the spammers for spamdexing and by business owners for search engine optimization); and everywhere else, where relationships between many objects have to be analyzed. Links are also derived from the similarity of time behavior in both nodes.

    Information about the relative importance of nodes and edges in a graph can be obtained through centrality measures, widely used in disciplines such as sociology. For example, eigenvector centrality uses the eigenvectors of the adjacency matrix corresponding to a network to determine nodes that tend to be frequently visited. Formally established centrality measures are degree centrality (DC), closeness centrality, betweenness centrality, eigenvector centrality, subgraph centrality, Katz centrality, and weblink centrality measures. The purpose or objective of analysis generally determines the type of centrality measure to be used. For example, if one is interested in network dynamics or the robustness of a network to node/link removal, a node’s dynamical importance is often the most relevant centrality measure.

    Based on the above information about complex network analysis, we used minimum spanning trees and weighted directed networks to illustrate politicallythemed stock networks based on ETE. Then, we analyzed politically-themed stock networks on the network and node levels and confirmed their network dynamics in real-world situations.

    2.5.1 Network-level Network Analysis

    We used network density (ND), ETE, and the frequency distribution of ETE at the network level to analyze and summarize network dynamics. These network-level measures have been generally used to analyze stock markets and can effectively illustrate the states of politically themed stock networks.

    2.5.1.1 ND

    The ND can be defined as the number of edges K to the number of possible connections in a network with N edges. The idea of ND comes from the binomial coefficient. In sum, ND of the directed graph refers to the following:

    ND = K N ( N 1 )
    (9)

    2.5.2 Node-level Network Analysis

    There are many kinds of node-level measures to analyze networks. Among various node-level measures, the concept of centrality is important in node-level network analysis. Centrality measures also have many types, and we focused on centrality measures that indicate direct influences of politically themed stocks. Hence, we considered centrality measurements considering the intuitive influence of individual nodes. In other words, DC, node strength (NS), and PageRank (PR) are the methods used to analyze politically-themed stock networks based on ETE (Liao et al., 2017). We used the normalized versions of node-level measures to compare the results of other politically themed stock networks.

    2.5.2.1 DC

    In unweighted directed networks, DC represents the total number of edges connected with other nodes through which a node is connected. However, nodes of weighted directed networks have either ingoing edges, outgoing edges, or both. Generally, a node with many other nodes is called a hub, and a node with many other nodes pointing at it is called authority. Therefore, the DC is analyzed into two measures in weighted directed networks: indegree (DCin) and out-degree (DCout). In this study, we divided two original DC measures by N−1 to compare with other politically themed stock networks on the same scale as we mentioned. In sum, DCin and DCout of the j th stock can be defined as Eqs. (10) and (11), respectively.

    DC in j = 1 N 1 i = 1 N a ij
    (10)

    DC out j = 1 N 1 i = 1 N a ji
    (11)

    In Eqs. (10) and (11), aij is the elements of the adjacency matrix A N × N . or the adjacency matrix A, if a link exists from stock i to stock j , then aij =1 , and aij = 0 if otherwise. In addition, ii a = 0 for all (1≤ i ≤ N, 1≤ j≤ N) .

    2.5.2.2 NS

    In general, NS is the sum of the weights of links connected to the node. In weighted directed networks, the instrength is the sum of inward link weights, and the outstrength is the sum of outward link weights. NS represented the influence features in politically themed stock networks and can be calculated into two measures: in-strength ( NSin ) and out-strength ( NSout ), similar to DC. We also divided by N−1 for comparing with other politically themed stock networks, NSin , and NSout of the j-th stock can be expressed as Eqs. (12) and (13), respectively.

    NS in j = 1 N 1 i = 1 N w ij
    (12)

    NS out j = 1 N 1 i = 1 N w ji
    (13)

    In Eqs. (12) and (13), wij are elements of weight matrix W N × N ( 1 i N ,   1 j N ) , where wii = 0 for all 1≤ i ≤ N .

    2.5.2.3 PR

    PR (Page et al., 1999) is an algorithm used by Google Search to rank web pages in their search engine results. PR is one of the famous eigenvector-based metrics that consider all network paths to determine node importance. In addition, PR has been introduced to rank web pages from the web graph initially. PR improved the disadvantages of similar metrics, such as the eigenvector centrality and the Katz centrality. PR defines a link analysis method for a directed network to evaluate a user’s influence. Both immediate information flow and the information flow after that would also be considered. PR is recently used in finance as a systematic measure (Kuzubaş et al., 2014;Yun et al., 2019) and stock market analysis (Tu, 2014;Tang et al., 2019). Suppose there are N stocks and A N × N denotes the adjacency matrix for the politically themed stock network. In mathematical terms, we can get PR values of politically themed stocks as the form of the column vector r N (Higham, 2005):

    r = ( 1 α ) ( I α A T D 1 ) 1 1
    (14)

    In Eq. (14), I N × N is the identity matrix, 1 N is the column vector whose elements are all one, and D N × N is diag (degi) with deg i = max ( K out i , 1) where K out i is the number of outer edges started from node i . α is the damping factor, which ranges between zero and one. We set α=0.85 , which is the original study’s conventional value (Page et al., 1999).

    3. RESULTS

    3.1 Descriptive Statistics

    The daily closing price data for the 20 candidate stocks selected as candidates within the above period were collected, and the return was calculated. In this study, log returns were set as a method of return calculation and presentation because, unlike general returns, the positive and negative values of the returns are symmetrical and easy to aggregate returns. Table 2 presents the descriptive statistics for the abnormal returns.

    As a result, all return distributions did not satisfy normality in the data, making it reasonable to use TE, the entropy-based causality measure.

    3.2 Calculating the Sentiment Index of Mask-Themed Stocks

    3.2.1 Pre-processing Text

    Text preprocessing is a crucial task in NLP, which refers to the preprocessing of text according to its purpose. This study used text preprocessing techniques, such as cleaning, normalization, tokenization, stopword optimization, stemming, table-controlled extraction, and keyword extraction, to transform collected text data into suitable forms for analysis.

    We carried out text preprocessing using Python “nltk” library and “KoNLPy” library to implement the above Korean text preprocessing technique on collected text data.

    We first conducted a normalization for text preprocessing that incorporates the same and similar vocabulary as the purification process, eliminating noise, such as typos, from the corpus of collected text data. Then, the normalized text data went through a tokenization process and removed the meaningless word token among the tokens generated. In addition, the text was converted into a form that is easy to model topics and analyze sentiments using LDA by paraphrasing and tabular control extraction tasks from term-treated word tokens.

    3.2.2 Selecting the Model

    This study hypothesized that the mentioned volume and sentiment analysis results could affect the abnormal return. As we have already shown, Eq. (5) is the probability of being judged as positive text because of sentiment analysis and is the probability of being considered as negative text. The range of the sentiment score is 100%, indicating that the text on that date has a 100% positive average probability, and the text has a 100% negative average probability of the text on that date. These indices can reflect less sensitivity to text, as text converges to zero when it is close to information transfer or when objective sentences have a high proportion.

    We first labeled the sentiment score for the training data set to calculate the sentiment. We conducted sentiment analysis using four transformer-based models for 126,703 randomly collected text data as the training set using Big Kinds. To score those unlabeled randomly collected data, we gave those articles +1 if the summation of lexicons in an article is positive and −1 if negative. Then, we conducted sentiment analysis using four transformerbased models for 126,703 randomly collected text data from the political section.

    As we mentioned above, we set those transformerbased models with fine-tuning. As a result, ELECTRA achieves the highest accuracy and macro F1 score for the test sets. Therefore, we selected ELECTRA as the model for conducting sentiment analysis. Table 3 shows the comparison of fine-tuned four transformer language models. The model’s accuracy and macro F1 score after 100 epochs were 88.31% and 82.93%, respectively. We used the early stopping method for preventing overfitting.

    3.2.3 Topic Modeling Results

    The textual data consisted of articles, editorials, and news scripts containing mask keywords from December 1, 2019, to August 26, 2020. A total of 15,337 were collected. Of these, 240 text data, which was highly associated, were removed first, preprocessed text data, and then topic modeling was carried out. As a result of the perplexity analysis for calculating the optimal number of topics, three topics were optimally determined and classified into three topics. Table 4 shows the top 10 keywords that appear the most for each topic. Table 4 also shows that Topics 2 and 3 have several keywords related to the supply and demand of masks, including production, demand, thematic stocks, stock prices, and increase. However, Topic 1 can be inferred that there is text related to infections caused by non-compliance with prevention rules or lack of mask. Using the method of Figure 2, we calculated the polarity score and excluded the text data in keywords that appear the most for each topic. Table 4 also shows that Topics 2 and 3 have several keywords related to the supply and demand of masks, including production, demand, thematic stocks, stock prices, and increase. However, Topic 1 can be inferred that there is text related to infections caused by non-compliance with prevention rules or lack of mask. Using the method of Figure 2, we calculated the polarity score and excluded the text data in the topic with low average polarity scores. Most of the text data were suitable for our research purposes, mainly reporting confirmed cases and routes. Therefore, sentiment analysis was performed on 13,389 text data from which 1,948 textual data belonging to Topic 1 were removed.

    3.2.4 Calculating TSI and Abnormal Returns of Candidate Stocks

    For screening candidate stocks through causal analysis, the daily search volume and sentiment scores were multiplied to produce TSI, as in many previous studies. Figure 5 shows the result of the final computed TSI and the fluctuation of TSI. Significantly, the fluctuation is considerable between January and March 2020, when demand for masks, where the so-called “mask crisis” occurred because of COVID-19, surged.

    The number of bins of the histogram for calculating the TE for listing stocks as follows (Hacine-Gharbi and Ravier, 2018):

    IEMS-21-2-244_EQ15.gif
    (15)

    where IEMS-21-2-244_I1.gif is the empirical estimated correlation coefficient between X and Y.

    At this time, the calculated results of alpha and beta via the market model are as follows:

    3.2.5 Selecting Mask-Themed Stocks

    The calculated TSI and abnormal return included in the candidate group of themed stocks were set to 1 day and then verified for the existence of calculated ETE values, p-values, and causality. For <Table 6>, this table summarizes the TSI and the TE values and p-values for individual candidate stocks and the TE values and p-values for the abnormal return of individual candidate stocks. For p-values, “***” for less than 0.001, “**” for less than 0.01, and “*” for less than 0.05 are marked on the right side of the p-value. In this study, a significant level of was set, based on which candidate stocks were included in the thematic stock network when the TSI was significantly causal to the candidate stock price. The candidate stocks were excluded from the thematic stock network if the p-value was greater than or equal to 0.05.

    The causal relationship between candidate stocks in the candidate group from the TSI showed that 19 of the 20 candidate stocks in the candidate group had significant causal relationships at the level.

    The candidate stocks S, which did not have significant causal relationships at the level of the stock price, were excluded from the thematic stock network configuration. The stock S is the fashion mask company, which is not related to the medical masks. In other words, this nonlinear causal relationship from TSI to the candidate stocks can be translated qualitatively in this case

    3.3 Constructing and Analyzing the Mask-Themed Stock Network

    3.3.1 Constructing the Mask-Themed Stock Network of the Research Period

    Each causal relationship is measured for 19 maskthemed stocks selected based on the causal relationship with the TSI. The form of the thematic stock network constructed using this can be confirmed in Figures 68. Figure 6 is based on the significant ETE value when α = 0.1, Figure 8 is α=0.5 , and Figure 9 is α=0.01 . The schematic thematic stock network of Figures 68 shows that the thickness of the color of the connection line has a strong degree of causal relationships. In addition, a larger size of the circle means that a thematic stock has a larger degree value. We used Kamada–Kawai (Kamada and Kawai, 1989) path-length cost-function for positioning nodes. The ND of these three networks is 61.70% ( α = 0.1 ), 61.40% ( α=0.05 ), and 40.06% ( α=0.01 ) each. At the 0.05 significance level, the maskthemed stock network is dense, and over 50% of possible connections are connected. In the next section, we illustrated that the network analysis results at the significance level α=0.05 are based on the configured mask-themed stock network.

    3.3.2 Network Analysis Results

    We analyzed the network based on the ETE-based causal relationships statistically significant at the significance level α=0.05 .

    From Figure 9, in the case of out-degrees, it was confirmed that stocks C, H, I, L, and R were close to 0.8, led by J and K exceeding 0.8. In the case of in-degrees, they were large in the order of H, K, I, R, O, and J. Simi lar results can be confirmed in the case of out-strength, instrength, and PR values.

    Generally, a node that has connections with many other nodes going out from it is called a hub, and a node with many other nodes pointing at it is called authority.

    From these measures, I and K stocks have very high values of 0.4 or higher in very all network measures, and those values are statistically significant at the significance level α=0.01 . Moreover, compared with the ETE value of mask thematic stocks, there is a very strong positive correlation for all correlation coefficients. These results can be seen as proving that stocks with high causal relationships from TSI within mask-themed stocks play a role of hub and authority.

    Moreover, if the connection between a thematic stock’s themes is deeper than other stakeholders, then the stock is generally called the leading stock in Korea. These leading stocks are frequently used in Korea when financial regulatory authorities look for signals of thematic stocks’ abnormal movements (Choi and Kim, 2021). From this perspective, those stocks with high network measure value the stocks as the leading stocks in this case.

    3.3.3 Analyzing the Longitudinal Changes of Thematic Stock Network

    In particular, during the COVID-19 outbreak of Period 4, these thematic stock networks were denser than other thematic stock networks. These results confirm that the intensity of the causal relationship within the maskthemed stock network has been further strengthened as the interest in the mask-themed network increases. This finding can be said to be quantitative evidence supporting the monitoring of thematic stocks at that period.

    In addition, the change of thematic stock network can be used to analyze the particular stock. For example, company R is mistakenly known as a mask-themed stock because of misinformation that they mass-produce masks, and such false information was delivered to shareholders. However, as of February 24, 2020, company R officially announced that they were far from mask-themed stocks and did not mass-produce masks.

    As a result, all network measurements of company R have decreased, and the overall ND has also decreased. These results indicate that company R has narrowed its position within the mask-themed stock network, which can be understood as a decrease in the overall ND. Moreover, in the Kamada–Kawai algorithm, if the node is more centered, then the network is more critical. From this perspective, stock R went to the edge of the network because its importance decreased. Furthermore, only stocks D and R showed an overall decline among network measures of stocks.

    These results can be quantitatively confirmed when considering the thematic stock characteristics of the stocks in future results if changes within the thematic stock network are noted.

    3.3.4 Investment Strategy based on the theme sentiment index (TSI) and Mask-Themed Stock Network

    Based on the previous studies about thematic stocks and results of previous sections, we obtain insight into dramatic price change of thematic stocks with the positive CAR effect when the theme keyword is focused on. Based on these effects that are known in advance, we developed the investment strategy using TSI and mask-themed stock networks as the class of assets (Kim et al., 2014) to verify the possibility of benchmarking KOSPI and KOSDAQ, which are the major market indices of the stock market in Korea. Therefore, the strategy can be described by performing the following steps:

    • Step 1: Check the daily ROC of TSI.

    • Step 2: Conduct anomaly detection methodologies to the ROC of TSI.

    • Step 3: Optimize the portfolio with the stocks, invest at the anomaly periods based on the optimization result, and check the performance.

    To apply this strategy, we assumed the daily rebalancing period and the weights of stocks were collected from the solution of portfolio optimization problems to maximize the Sharpe ratio. Maximizing the Sharpe ratio in the portfolio optimization problem is one of the most commonly used portfolio optimization methodologies. Sharpe ratio has advantages that consider return and risk in the objective function simultaneously and can be computed directly from any observed time series of returns regardless of additional information from stocks. The portfolio optimization problem that maximizes the Sharpe ratio can be defined as follows:

    max  w T μ   r f w T Σ w subject to . 1 T w = 1 , w 0
    (16)

    where w N is the column vector consists of the weight of assets, μ N is the mean return vector, and Σ N × N is the covariance matrix. The column vectors r f N and 1 N each consists of all elements are the risk-free rate and one each. The first constraint means that the sum of portfolio weight is one, and the second implies that short selling is not allowed, which is hard to carry out by individual investors.

    We used the stock return data from before 20 trading days to calculate the average return vector and the covariance matrix. The South Korea 10 Years Government Bond was used as the risk-free rate (Goedhart, 2015). We also consider transaction costs to prevent overestimating the performance, critical to the daily rebalancing portfolio optimization problem. In other words, we assumed that we sell all of our stocks, which we held at the previous period, and buy new stocks when we rebalance our portfolio. This assumption was applied when the selected politically themed stock network is the same for several days in a row. Therefore, the cumulative return data of the portfolio are the lower bound of the portfolio’s performance. We also optimized the portfolio from all politically themed stocks without considering what TSI and their networks belong to. Additionally, we considered the reformulation version considering computational times and achieving the objective function’s convexity (Iyengar and Kang, 2005).

    For considering the robustness of selecting anomaly time points, we selected the anomaly periods using the ensemble hard-voting results of the following 13 different outlier detection methods: angle-base outlier detection method, clustering-based local outlier method, connectivity- based local outlier method, isolation forest method, histogram-based outlier detection method, k-nearest neighbors detector method, local outlier factor method, one-class support vector machine detector method, principal component analysis method, minimum covariance determinant method, subspace outlier detection method, deviation-based outlier detection method, and copulabased outlier detection method. We firstly conduct each outlier detection method and score dates as 0 (normal) and 1 (outlier). The anomalies are calculated from only the previous periods of the timepoint considering the realistic condition. Then, we summed up those 13 outlier detection results for calculating points and selected the data as an outlier if the points were seven or higher. Fig- ure 13 shows the results of this method.

    The portfolio with our investment strategy can obtain additional profit than both market indices before and around the outlier time points with 20-day windows, as shown in Figure 14. As a result, it was confirmed that returns, such as KOSPI and KOSDAQ, could be benchmarked when short-term investments were outliers. In particular, during the mask shortage period, around March 2020, the 20-day cumulative return of our investment strategy increased considerably when the outliers appeared frequently. After the first 20 business days, the return of the mask-themed stock portfolio was 84.5%. This portfolio return of the period is four times that of KOSPI (23.3%) and three times that of KOSDAQ (26.4%). These results can support the results of existing studies that when a specific theme keyword comes to mind, a positive excess return may occur. However, suppose we have an optimized portfolio of mask-themed stocks for 20 business days from a random point in time within the research period. In that case, only 36.8% of periods generate positive returns compared to the two indices.

    Moreover, approximately half of the periods (48.28%) got worse investment results than the two indices, as shown in Figure 15. In Figure 15, the green and yellow background colors indicate that our investment strategy was better than the two and only one market index, respectively. Last, the red background indicates that our investment strategy was worse than the two market indices. This result is similar to previous studies; the reported investment risks can generate loss when investing in thematic stocks (Financial Services Commission, 2017;Financial Supervisory Service, 2017;Nam, 2017;Nam, 2020;Choi and Kim, 2021).

    4. CONCLUSION

    This study presented a novel analysis method for thematic stocks using natural language processing techniques and network analysis. We tried to demonstrate what was empirically analyzed in previous studies that profit generation at theme-related times or events, and in reality, the risk of thematic stock investments, should be considered in previous studies. We set the configuration and utilization of thematic stock networks as key objectives. In addition, mask-themed stocks were first selected as subjects to form networks owing to yellow and fine dust. They were re-illuminated as the demand for masks increased rapidly after the COVID-19 outbreak. For conducting this study, 20 candidate stocks (10 listed KOSPI stocks and 10 listed KOSDAQ stocks) were selected to form a thematic stock network, considering the frequency of themed stocks classified as mask-themed stocks in the top 10 securities firms by sales. As the purpose of this study is not to encourage the purchase or sale of candidate stocks by a particular company, all candidate stocks belonging to the candidate group are alphabetically deidentified and described.

    For screening candidate stocks through causal analysis, the daily search volume and sentiment scores were multiplied to produce theme sentiment index (TSI) calculated from natural language process techniques such as sentiment analysis and topic modeling. As a result, the causal relationship between candidate stocks in the candidate group from the TSI showed that 19 of the 20 candidate stocks in the candidate group had significant causal relationships at the level. This result means that most of the actual known theme stocks can be identified by quantified indices such as the TSI with the information that investors can get and the concept of theme stocks generated from them.

    Also, we conducted the experiment using a portfolio optimization strategy based on the signals from TSI’s outliers. That result shows that we can get some profit when the theme-related interest, which can be quantified to the keyword's search volume, rises. However, overall, we got the experiment result that the reported investment risks can generate loss when investing in thematic stocks like previous studies.

    The limitations of this study can be summarized as follows. First, although the actual stock price is affected by several factors, the experiment was conducted after assuming that the impact factor is affected only by external factors, excluding internal factors, such as financial status. Second, further research on consistency and availability is needed as the network is constructed by statistical techniques, such as TE, that do not consider the linkages or financial domains between real businesses. Finally, when initially establishing a candidate group of candidate stocks for thematic stock composition, the judgment of the experimental subjects may be included. In addition, the distortion or subjective part may occur, resulting in missing candidate stocks even when the causality is significant.

    Moreover, the TSI may have errors because the accuracy and F1 score of the training period were not perfect. Therefore, follow-up studies need to supplement these limitations to adequately decompose stock price output's internal and external factors. Follow-up studies could also further validate the consistency and availability of these thematic stock networks with the findings of this study.

    From the contribution perspective, this study’s thematic stock network has the advantages of schematizing. This schematized thematic stock network can lead to easy checking of the causal relationship among thematic stocks. Furthermore, this schematized thematic stock network may be used as a methodology to assist the detection of abnormal transactions in the RegTech and SupTech sectors by improving the limitations mentioned in line with this trend. In detail, the need to detect abnormal transactions to prevent unfair trade and financial market efficiency of FinTech and RegTech is highlighted in Korea. Especially, the methods for detecting abnormal transactions for existing themed stocks are mainly based on empirical methods through financial statements analysis, technical analysis on stock price charts, and psychological analysis. The thematic stock network developed in this study and the experimental results using the statistical techniques of text mining and TE confirmed that this process could detect abnormal transactions through network-related indicators. Therefore, this method may be used as an anomaly transaction detection method for verifying thematic stocks in the RegTech and SupTech areas.

    Figures

    IEMS-21-2-244_F1.gif

    Intuition behind LDA (Blei et al., 2003).

    IEMS-21-2-244_F2.gif

    Selecting text data for performing sentiment analysis.

    IEMS-21-2-244_F3.gif

    Model architecture of the transformer (Vaswani et al., 2017).

    IEMS-21-2-244_F4.gif

    Schematic representation of transfer entropy.

    IEMS-21-2-244_F5.gif

    Calculated TSI (rate of change).

    IEMS-21-2-244_F6.gif

    Mask-themed stock network of the whole research period (α = 0.1).

    IEMS-21-2-244_F7.gif

    Mask-themed stock network of the whole research period (α = 0.05).

    IEMS-21-2-244_F8.gif

    Mask-themed stock network of the whole research period (α = 0.01).

    IEMS-21-2-244_F9.gif

    Network measures of the mask-themed stock network.

    IEMS-21-2-244_F10.gif

    Network change of mask-themed stock network.

    IEMS-21-2-244_F11.gif

    Network change before and after information correction of company R.

    IEMS-21-2-244_F12.gif

    ROC of TSI and anomalies.

    IEMS-21-2-244_F13.gif

    Cumulative return chart of the investment strategy based on TSI and mask-themed stocks and market indices in the whole research period.

    IEMS-21-2-244_F14.gif

    Twenty business day-long investment when anomalies occur.

    IEMS-21-2-244_F15.gif

    Twenty business day-long investment strategy results.

    Tables

    Summary of previous studies about thematic stocks

    Descriptive statistics for abnormal returns of mask-themed candidate stocks

    Comparison of fine-tuned four transformer language models

    Most frequent keywords and distribution of raw text data

    Calculated alpha and beta of mask-themed candidate stocks

    TSI and the TE values and p-values for individual candidate stocks

    Correlation values between ETE from and network analysis results

    Network density values of each period

    Change of network measures before and after information correction of company R

    Cumulative return and standard deviation results of the investment strategy based on TSI and mask-themed stocks and market indices in the whole research period

    References

    1. Allen, F. and Babus, A. (2009), Networks in Finance, The Network Challenge: Strategy, Profit, and Risk in an Interlinked World, In: P. R. Kleindorfer, Y. Wind, and R. E. Gunther (eds.), Wharton School Publishing.
    2. Antweiler, W. and Frank, M. Z. (2004), Is all that talk just noise? The information content of internet stock message boards, The Journal of Finance, 59(3), 1259-1294.
    3. Beije, P. R. and Groenewegen, J. (1992), A network analysis of markets, Journal of Economic Issues, 26(1), 87-114.
    4. Bekiros, S. , Nguyen, D. K. , Junior, L. S. , and Uddin, G. S. (2017), Information diffusion, cluster formation and entropy-based network dynamics in equity and commodity markets, European Journal of Operational Research, 256(3), 945-961.
    5. Blei, D. M. , Ng, A. Y. , and Jordan, M. I. (2003), Latent dirichlet allocation, Journal of Machine Learning Research, 3(1), 993-1022.
    6. Bossomaier, T. , Barnett, L. , Harré, M. , and Lizier, J. T. (2016), Transfer Entropy, An Introduction to Transfer Entropy, 65-95.
    7. Brown, S. J. and Warner, J. B. (1980), Measuring security price performance, Journal of Financial Economics, 8(3), 205-258.
    8. Brown, S. J. and Warner, J. B. (1985), Using daily stock returns: The case of event studies, Journal of Financial Economics, 14(1), 3-31.
    9. Chan, S. W. K. and Chong, M. W. C. (2017), Sentiment analysis in financial text data, Decision Support Systems, 94, 53-64.
    10. Checkley, M. S. , Higón, D. A. , and Alles, H. (2017), The hasty wisdom of the mob: How market sentiment predicts stock market behavior, Expert Systems with applications, 77, 256-263.
    11. Choi, I. and Kim, W. C. (2021), Detecting and analyzing politically-themed stocks using text mining techniques and transfer entropy: Focus on the Republic of Korea’s case, Entropy, 23(6), 734.
    12. Clark, K. , Luong, M. T. , Le, Q. V. , and Manning, C. D. (2020), ELECTRA: Pre-training Text Encoders as Discriminators Rather than Generators, arXiv preprint arXiv:2003.10555.
    13. Darling, W. M. (2011), A Theoretical and practical implementation tutorial on topic modeling and Gibbs sampling, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 642-647.
    14. Deng, S. , Mitsubuchi, T. , Shioda, K. , Shimada, T. , and Sakurai, A. (2011), Combining technical analysis with sentiment analysis for stock price prediction, Proceedings of the IEEE 9th International Conference on Dependable, Autonomic and Secure Computing, 800-807.
    15. Devlin, J. , Chang, M. W. , Lee, K. , and Toutanova, K. (2018), BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805.
    16. Dimpfl, T. and Peter, F. J. (2013), Using transfer entropy to measure information flows between financial markets, Studies in Nonlinear Dynamics and Econometrics, 17(1), 85-102.
    17. Fafchamps, M. and Gubert, F. (2007), The formation of risk sharing networks, Journal of Development Economics, 83(2), 326-350.
    18. Financial Services Commission (2017), Results of countermeasure against politically-themed stocks of 19th presidential election [Press Release]. May 17.
    19. Financial Supervisory Service (2017), Survey on unfair trade in politically-themed stocks of the 19th presidential election [Press Release]. Sep 29. Available from: https://eiec.kdi.re.kr/policy/materialView.do? num=169425.
    20. Gemici, K. and Lai, K. P. (2019), How ‘global’ are investment banks? An analysis of investment banking networks in Asian equity capital markets, Regional Studies, 54(2), 149-161.
    21. Giannini, R. , Irvine, P. , and Shu, T. (2019), The convergence and divergence of investors’ opinions around earnings news: Evidence from a social network, Journal of Financial Markets, 42, 94-120.
    22. Goedhart, M. , Koller, T. , and Wessels, D. (2015), Valuation: Measuring and Managing the Value of Companies, JohnWiley & Sons.
    23. Granger, C. W. J. (1969), Investigating causal relations by econometric models and cross-spectral methods, Econometrica: Journal of the Econometric Society, 37(3), 424-438.
    24. Griffiths, T. L. and Steyvers, M. (2004), Finding scientific topics, Proceedings of the National Academy of Sciences, 101(Suppl 1), 5228-5235.
    25. Hacine-Gharbi, A. and Ravier, P. (2018), A binning formula of bi-histogram for joint entropy estimation using mean square error minimization, Pattern Recognition Letters, 101, 21-28.
    26. Hájek, P. , Olej, V. , and Myskova, R. (2014), Forecasting corporate financial performance using sentiment in annual reports for stakeholders’ decision-making, Technological and Economic Development of Economy, 20(4), 721-738.
    27. Herron, M. C. , Lavin, J. , Cram, D. , and Silver, J. (1999), Measurement of political effects in the united states economy: A study of the 1992 presidential election, Economics and Politics, 11(1), 51-81.
    28. Hiew, J. Z. G. , Huang, X. , Mou, H. , Li, D. , Wu, Q. , and Xu, Y. (2019), BERT-based financial sentiment index and LSTM-based stock return predictability, arXiv preprint arXiv:1906.09024.
    29. Higham, D. J. (2005), Google PageRank as mean playing time for pinball on the reverse web, Applied Mathematics Letters, 18(12), 1359-1362.
    30. Huang, W. Q. , Zhuang, X. T. , and Yao, S. (2009), A network analysis of the Chinese stock market, Physica A: Statistical Mechanics and Its Applications, 388(14), 2956-2964.
    31. Iyengar, G. and Kang, W. (2005), Inverse conic programming with applications, Operations Research Letters, 33(3), 319-330.
    32. Jang, S. , Yi, E. , Kim, W. C. , and Ahn, K. (2019), Information Flow between Bitcoin and other investment assets, Entropy, 21(11), 1116.
    33. Jovanovic, F. and Schinckus, C. (2013), The emergence of econophysics: A new approach in modern financial theory, History of Political Economy, 45(3), 443-474.
    34. Kamada, T. and Kawai, S. (1989), An algorithm for drawing general undirected graphs, Information Processing Letters, 31(1), 7-15.
    35. Kang, S. (2016), An event study on the abnormal return of political thematic stock during presidential election, Master Thesis, Soongsil University, Republic of Korea.
    36. Khedr, A. E. and Yaseen, N. (2017), Predicting stock market behavior using data mining technique and news sentiment analysis, International Journal of Intelligent Systems and Applications, 9(7), 22.
    37. Kim, M. J. and Lim, G. G. (2020), Bigdata analysis of fine dust thematic stock price volatility according to pm10 concentration change, Journal of Service Research and Studies, 10(1), 55-67.
    38. Kim, W. C. , Lee, Y. , and Lee, Y. H. (2014), Cost of asset allocation in equity market: How much do investors lose due to bad asset class design?, Journal of Portfolio Management, 41(1), 34-44.
    39. Knight, B. (2006), Are policy platforms capitalized into equity prices? Evidence from the Bush/Gore 2000 presidential election, Journal of Public Economics, 90(4-5), 751-773.
    40. Kreer, J. (1957), A question of terminology, IRE Transactions on Information Theory, 3(3), 208-208.
    41. Kumar, B. S. and Ravi, V. (2016), A survey of the applications of text mining in financial domain, Knowledge-Based Systems, 114, 128-147.
    42. Kuzubaş, T. U. , Ömercikoğlu, I. , and Saltoğlu, B. (2014), Network centrality measures and systemic risk: An application to the turkish financial crisis, Physica A: Statistical Mechanics and its Applications, 405, 203-215.
    43. Kwak, H. S. and Yeo, E. J. (2019), An event study on the politically-themed stocks on the 19th presidential election in Korea, Korean Journal of Financial Management, 36(2), 209-245.
    44. Kwon, O. and Yang, J. S. (2008), Information flow between composite stock index and individual stocks, Physica A: Statistical Mechanics and Its Applications, 387(12), 2851-2856.
    45. Lee, W. S. , Huang, A. Y. , Chang, Y. Y. , and Cheng, C. M. (2011), Analysis of decision making factors for equity investment by DEMATEL and analytic network process, Expert Systems with Applications, 38(7), 8375-8383.
    46. Liang, C. , Tang, L. , Li, Y. , and Wei, Y. (2020), Which sentiment index is more informative to forecast stock market volatility? Evidence from China, International Review of Financial Analysis, 71, 101552.
    47. Liao, H. , Mariani, M. S. , Medo, M. , Zhang, Y. C. , and Zhou, M. Y. (2017), Ranking in evolving complex networks, Physics Reports, 689, 1-54.
    48. Lim, K. , Kim, S. , and Kim, S. Y. (2017), Information transfer across intra/inter-structure of CDS and stock markets, Physica A: Statistical Mechanics and Its Applications, 486, 118-126.
    49. Liu, Y. , Ott, M. , Goyal, N. , Du, J. , Joshi, M. , Chen, D. , and Stoyanov, V. (2019), RoBERTa: A robustly optimized BERT pre-training approach, arXiv preprint arXiv:1907.11692.
    50. Marschinski, R. and Kantz, H. (2002), Analyzing the information flow between financial time series, European Physical Journal B–Condensed Matter and Complex Systems, 30(2), 275-281.
    51. Martin, V. , Zhou, X. , Marshall, E. , Jia, B. , Fusheng, G. , Francodixon, M. A. , DeHaan, N. , Pfeiffer, D. U. , Magalhães, R. J. S. , and Gilbert, M. (2011), Risk-based surveillance for avian influenza control along poultry market chains in South China: The value of social network analysis, Preventive Veterinary Medicine, 102(3), 196-205.
    52. Minoiu, C. and Reyes, J. A. (2013), A network analysis of global banking: 1978-2010, Journal of Financial Stability, 9(2), 168-184.
    53. Mittal, A. and Goel, A. (2011), Stock prediction using twitter sentiment analysis, Stanford University CS229, 15.
    54. Nam, G. (2017), Politically-themed stocks: Characteristics and investment risks, KCMI Issue Report, 2(2).
    55. Nam, G. (2020), Concerns over political–themed stocks ahead of Korea’s 21st general election, Capital Market Focus, 5(2).
    56. Namaki, A. , Shirazi, A. H. , Raei, R. , and Jafari, G. R. (2011), Network analysis of a financial market based on genuine correlation and threshold method, Physica A: Statistical Mechanics and Its Applications, 390(21-22), 3835-3841.
    57. Nguyen, T. H. , Shirai, K. , and Velcin, J. (2015), Sentiment analysis on social media for stock movement prediction, Expert Systems With Applications, 42(24), 9603-9611.
    58. Nopp, C. and Hanbury, A. (2015), Detecting risks in the banking system by sentiment analysis, Proceedings of the 20th Conference on Empirical Methods in Natural Language Processing, 591-600.
    59. Ohanyan, A. (2002), Post–conflict global governance: The case of microfinance enterprise networks in bosnia and Herzegovina, International Studies Perspectives, 3(4), 396-416.
    60. Ojala, M. and Hallikas, J. (2006), Investment decision-making in supplier networks: Management of risk, International Journal of Production Economics, 104(1), 201-213.
    61. Page, L. , Brin, S. , Motwani, R. , and Winograd, T. (1999), The PageRank citation ranking: Bringing order to the web, Technical Report, Stanford InfoLab.
    62. Pagolu, V. S. , Reddy, K. N. , Panda, G. , and Majhi, B. (2016), Sentiment analysis of twitter data for predicting stock market movements, Proceedings of 12th International Conference on Signal Processing, Communication, Power and Embedded Systems, 1345-1350.
    63. Quigley, L. (2008), Statistical analysis of the log returns of financial assets, Financial Mathematic, University of Limerick, 32.
    64. Ravi, K. and Ravi, V. (2015), A survey on opinion mining and sentiment analysis: Tasks, Approaches, and Applications. Knowledge-Based Systems, 89, 14-46.
    65. Roy, R. B. and Sarkar, U. K. (2011), Identifying influential stock indices from global stock markets: A social network analysis approach, Procedia Computer Science, 5, 442-449.
    66. Sandoval, L. (2014), Structure of a global network of financial companies based on transfer entropy, Entropy, 16(8), 4443-4482.
    67. Schinckus, C. (2010), Is econophysics a new discipline? The neopositivist argument, Physica A: Statistical Mechanics and Its Applications, 389(18), 3814-3821.
    68. Schnabel, I. and Shin, H. S. (2004), Liquidity and Contagion: The Crisis of 1763, Journal of the European Economic Association, 2(6), 929-968.
    69. Schreiber, T. (2000), Measuring information transfer, Physical Review Letters, 85(2), 461.
    70. Sensoy, A. , Sobaci, C. , Sensoy, S. , and Alali, F. (2014), Effective transfer entropy approach to information flow between exchange rates and stock markets, Chaos, Solitons and Fractals, 68, 180-185.
    71. Shannon, C. E. (1948), A mathematical theory of communication, Bell System Technical Journal, 27(3), 379-423.
    72. Sheikh, A. Z. and Qiao, H. (2009), Non-normality of market returns: A framework for asset allocation decision making, Journal of Alternative Investments, 12(3), 8-35.
    73. Tahmasebi, A. and Askaribezayeh, F. (2020), Microfinance and social capital formation: A social network analysis approach, Socio-Economic Planning Sciences, 76, 100978.
    74. Tang, Y. , Xiong, J. J. , Luo, Y. , and Zhang, Y. C. (2019), How do the global stock markets influence one another? Evidence from finance big data and granger causality directed network, International Journal of Electronic Commerce, 23(1), 85-109.
    75. Tsai, C. S. (2011), The real world is not normal, Morningstar Alternative Investments Observer.
    76. Tu, C. (2014), Cointegration-based financial networks study in Chinese stock market, Physica A: Statistical Mechanics and Its Applications, 402, 245-254.
    77. Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A. , Kaiser, Ł. , and Polosukhin, I. (2017), Attention is all you need, Advances in Neural Information Processing System, 30, 5998-6008.
    78. Wang, C. J. , Tsai, M. F. , Liu, T. , and Chang, C. T. (2013), Financial sentiment analysis for risk prediction, Proceedings of the 6th International Joint Conference on Natural Language Processing, 802-808.
    79. Woo, M. and Kim, M. (2014), Estimating normal price in event study: In the case of thematic stocks, The Korean Journal of Securities Law, 15(3), 353-375.
    80. Wu, D. D. , Zheng, L. , and Olson, D. L. (2014), A decision support approach for online stock forum sentiment analysis, IEEE Transactions on Systems, Man, and Cybernetics: Systems, 44(8), 1077-1087.
    81. Yang, Z. , Dai, Z. , Yang, Y. , Carbonell, J. , Salakhutdinov, R. R. , and Le, Q. V. (2019), XLNet: Generalized autoregressive pre-training for language understanding, Advances in Neural Information Processing Systems, 32, 5754-5764.
    82. Yue, P. , Fan, Y. , Batten, J. A. , and Zhou, W. X. (2020), Information transfer between stock market sectors: A comparison between the USA and China, Entropy, 22(2), 194.
    83. Yun, T. S. , Jeong, D. , and Park, S. (2019), “Too Central to Fail” systemic risk measure using PageRank algorithm, Journal of Economic Behavior and Organization, 162, 251-272.
    Do not open for a day Close