Sentiment-semantic word vectors: A new method to estimate management sentiment

This paper introduces a novel method to extract the sentiment embedded in the Management’s Discussion and Analysis (MD &A) section of 10-K filings. The proposed method outperforms traditional approaches in terms of sentiment classification accuracy. Utilizing this method, the MD &A sentiment is found to be a strong negative predictor of future stock returns, demonstrating consistency in both in-sample and out-of-sample settings. By contrast, if traditional sentiment extraction methods are used, the MD &A sentiment exhibits no predictive ability for stock markets. Additionally, the MD &A sentiment is associated with dividend-related macroeconomic channels regarding future stock return prediction.


Introduction
Serving as the main focus of numerous studies, the Management's Discussion and Analysis (MD &A) section is undoubtedly one of the most important parts of 10-K/Q filings (Bochkay & Levine, 2019;Brown & Tucker, 2011;Cohen, Malloy, and Nguyen, 2020;Davis & Tama-Sweet, 2012;Feldman, Govindaraj, Livnat, and Segal, 2010;Li, 2010a;Loughran & McDonald, 2011;Tavcar, 1998). 1  It purports to "...provide investors and other users with material information that is necessary to an understanding of the company's financial condition and operating performance, as well as its prospects for the future" (SEC, 2003, Chapter III.B, p. 75,059).In this scenario, it is natural to expect the MD &A section to encapsulate insights that may influence stock market dynamics.Surprisingly, there are only a few studies that explore the power of the MD &A section to predict future stock returns.In this paper, we explore stock return predictability using solely the sentiment of the MD &A section, which we term management sentiment. 2In particular, we investigate a behavioral implication of management sentiment in asset pricing: the hypothesis is that information in a corporate disclosure with misleading sentiment is absorbed by investors, leading to an overvaluation in the stock price.When the true stock fundamentals are gradually disclosed to the public, the price reverses, implying that management sentiment negatively predicts future stock returns in the long run.This hypothesis is theoretically modeled by De Long et al. (1990) and empirically confirmed by Jiang et al. (2019) using 10-K/Q filings and conference calls to represent management sentiment.However, 10-K/Q filings are a mixture of informative statements and boilerplate content (Li, 2010b).As a valuable part of 10-K/Q filings (Tavcar, 1998), whether the stand-alone sentiment in the MD &A section is predictive of future stock returns remains an open question.
Page 2 of 22 Phan Swiss Journal of Economics and Statistics (2024) 160:9 We construct a management sentiment index from the MD &A section of 10-K filings using a word-representing model with novel adaptations.Specifically, we introduce a method that integrates both word sentiment and semantics into a pre-defined set of word representations (i.e., vectors).This method results in another set that reflects both sentiment and semantic connotations.To achieve this target, our method relies on three components: (i) a word representation model embracing rich word semantics, which is pre-trained with a massive dataset; (ii) a knowledge distillation technique (Hinton, Vinyals, and Dean, 2015); and (iii) a dataset with sentiment labels.The first component acts as a "semantic anchor" for the word vectors, while the second component seeks to infuse the sentiment meanings, carried by the third component, into these vectors.Intuitively, the word vectors we obtain inherit word semantics from a pre-trained word representation model and, simultaneously, absorb nuanced sentiment information from the labeled dataset.First, our proposed approach successfully obtains a new set of word vectors that captures both word sentiment and semantics; henceforth, these vectors are referred to as sentiment-semantic word vectors.By a word-level sentiment classification, the sentiment-semantic word vectors outperform another set of word vectors carrying only word semantics, which we term semantic-only word vectors, in clustering words into sentiment categories.Furthermore, our sentiment-semantic word vectors demonstrate a superior capability in document sentiment classification, outperforming competing methods, including semantic-only word vectors and the Loughran-McDonald dictionary (Loughran & McDonald, 2011).In particular, the sentiment-semantic word vectors achieve an F1 score of 0.68 in a sentiment classification task using the Financial Phrasebank dataset (Malo, Sinha, Korhonen, Wallenius, and Takala, 2014).Meanwhile, the corresponding scores for the Loughran-McDonald dictionary (which ignores word semantics) and the semantic-only word vectors (which ignore word sentiment) are 0.58 and 0.64.These findings underscore the importance of integrating sentiment and semantic information into word vectors for accurate sentiment analysis.
Second, the variations of our management sentiment index constructed from the sentiment-semantic word vectors reflect business cycles and historical events unlike the indexes built by the semantic-only word vectors.In concrete terms, our sentiment index reflects the fact that firm managers express pessimism during recessions; the index based on semantic-only word vectors strongly exhibits seasonal patterns without clear associations to historical economic regimes.Our findings are in line with those of Jiang et al. (2019) who also document a downward trend in management sentiment during the 2008 financial crisis.
We furthermore suggest that the dot-com crisis hurt the management sentiment.These findings align with the nature of economic recessions.
Third, we find that our management sentiment index serves as a strong predictor of future stock returns, directly confirming the above-mentioned behavioral hypothesis.This result is twofold.First, with the same MD &A corpus, the management sentiment extracted by the sentiment-semantic word vectors encompasses predictive information beyond that derived from the method based on the Loughran-McDonald dictionary.Importantly, this result holds in both in-sample and outof-sample settings and is robust to the choice of stock market index.Additionally, we find that our management sentiment index outperforms the powerful historical average model (Campbell & Thompson, 2008) in predicting out-of-sample future stock returns.Second, our measurement of management sentiment, unlike that of Jiang et al. (2019), merely relies on the MD &A section of 10-K filings.Despite the difference in the input data, the two studies arrive at similar conclusions.This similarity may suggest that the MD &A section contains useful sentiment signals for future stock return prediction, providing accurate sentiment measurement.We further find that the predictive power of our management sentiment index relates to the information provided by firm managers regarding dividend payment plans in the MD &A section.
In conclusion, by introducing sentiment-semantic word vectors, our work highlights the importance of both word sentiment and semantics in achieving an accurate sentiment estimation of a document.The utilization of sentiment-semantic word vectors unlocks valuable sentiment insights within the MD &A section of 10-K filings that strongly predict future stock returns.These valuable pieces of information may be overlooked by methods that ignore either the sentiment or the semantics of words.

Related literature and contributions
The past two decades have witnessed a blooming in research on the economic implications of corporate disclosures and the connection of these disclosures to the equity markets (Dyer, Lang, and Stice-Lawrence, 2017;Frankel, Jennings, and Lee, 2022;Henry, 2008;Jegadeesh & Wu, 2013;Jiang, Lee, Martin, and Zhou, 2019;Li, 2010a;Loughran & McDonald, 2011;Price, Doran, Peterson, and Bliss, 2012).Henry (2008) was among the first authors to analyze press releases on earnings using a word-count method.By introducing lists of positive and negative words, Henry (2008) discovers a relationship between the sentiment of earning press releases and investors' reactions.In a similar vein, Loughran and McDonald (2011) introduce a comprehensive sentiment lexicon tailored for financial context, hereafter referred to as the Loughran-McDonald dictionary.They find that only negative words within 10-K filings are associated with contemporaneous stock returns.Jegadeesh and Wu (2013) argue that words in the Loughran-McDonald dictionary should be subject to weighting.Accordingly, they develop a market-dependent scheme of word weighting and show that stock returns are influenced by both positive and negative words in 10-K filings as long as those words are appropriately weighted.Using the Loughran-McDonald dictionary, Jiang et al. (2019) show that management sentiment extracted from 10-K/Q filings and conference calls is predictive of future stock returns.Frankel et al. (2022) compare the information contained in corporate disclosures using machine learning and dictionary-based methods.
Studies linking the sentiment of the MD &A section and the market reaction are surprisingly infrequent.Loughran and McDonald (2011), besides 10-K filings, provide evidence of a significant relationship between the MD &A section and the stock returns via negative words.By using the Loughran andMcDonald (2011) lexicon, Feldman et al. (2010) detect a significant association between short-window market reactions around the 10-K filing dates and change in the MD &A sentiment.Deviating from sentiment, Brown and Tucker (2011) find that changes in the MD &A content are positively correlated with the magnitude of stock market reactions.Another line of studies on the MD &A section documents its connection to firm characteristics (Bochkay & Levine, 2019;Fengler & Phan, 2023;Li, 2010a;Mayew, Sethuraman, and Venkatachalam, 2015).
So far, studies have documented a link between the MD &A sentiment and contemporaneous market reactions.This is partially in line with the intention of the US Securities and Exchange Commission (SEC) that the MD &A section should provide explanatory information to investors regarding current firm conditions (SEC, 2003).However, another important part of the SEC's intention, regarding the future implications of the MD &A section, has not been fully explored by the current literature despite the potential for the MD &A section to predict stock market (Feldman, Govindaraj, Livnat, and Segal, 2010).Attempting to fill this gap, our work contributes to the extant literature by providing predictive analyses of the MD &A section in the 10-K filings regarding future stock returns.We also contribute to the burgeoning literature on the techniques used in economic and financial sentiment analysis.The current state of the literature in this area is dominated by lexicon-based methods, because of their simplicity (Feldman, Govindaraj, Livnat, and Segal, 2010;Henry, 2008;Jiang, Lee, Martin, and Zhou, 2019;Loughran & McDonald, 2011).Although several efforts have been made to deviate from reliance on a pre-defined sentiment lexicon (Chen, Fengler, Härdle, and Liu, 2022;Frankel, Jennings, and Lee, 2022;Jegadeesh & Wu, 2013;Li, 2010a), the underlying techniques for textual feature extraction are still based on word-count.However, due to the ignorance of word semantics, the current methods may overlook the potential sentiment resulting from word interactions (Huang, Wang, and Yang, 2023).
We seek to overcome this downside of the word-count methods by using the Word2Vec model (Mikolov, Chen, Corrado, and Dean, 2013).Word2Vec, a method based on neural networks, represents words in the form of semantic numerical vectors in which two synonyms tend to be located adjacently in the vector space.Although Word-2Vec captures word semantics, its ability to represent sentiment connotations is still questioned.To enhance the adaptability and proficiency of the Word2Vec model in sentiment analysis, we propose an additional component embedded in the modeling process that functions as sentiment guidance for the model.The inclusion of additional components to the likelihood function to capture sentiment is the core idea of many techniques for learning word sentiment representation (Maas et al., 2011;Labutov & Lipson, 2013;Tang, Wei, Qin, Zhou, and Liu, 2014).However, these techniques are constrained by the need for large datasets (Maas et al., 2011;Tang, Wei, Qin, Zhou, and Liu, 2014) or are limited to binary classifications (Labutov & Lipson, 2013).Our approach not only extends these methods to multi-label classification but also demonstrates that it is effective with small sentiment datasets.By enhancing the capabilities of sentiment classification, our proposed technique provides deeper insights into MD &A documents, surpassing the limitations of current dictionary-based methods.This novel adaptation serves as our main methodological contribution, and full details are given in the next section.

Methodology
The ultimate goal of our proposed method is to obtain a set of word vectors capturing both the sentiment and the semantic meanings of words.It is thus expected to enhance the sentiment extraction from a document.To this end, our method relies on three building blocks: (i) word vectors that are derived using the Word2Vec model (Mikolov, Chen, Corrado, and Dean, 2013), (ii) a technique that distills the knowledge of a large model into a smaller model, known as knowledge distillation (Hinton, Vinyals, and Dean, 2015), and (iii) the Financial Phrasebank dataset (Malo, Sinha, Korhonen, Wallenius, and Takala, 2014), which serve as sentiment guidance for the Word2Vec model.The first building block functions as an initial model by representing the general semantics of words by numerical vectors, with synonyms tending to be represented adjacently in the vector space.The second aims to inject the financial context and the sentiment meanings (which are extracted from the third building block) into the word vectors while preserving the general semantics captured by the initial model.

The Word2Vec model
Since Word2Vec was introduced by Mikolov et al. (2013), studies in economics and finance that adopt this method to explore financial documents have gained in popularity; see Das et al. (2022), Li et al. (2021), Ma et al. (2023), andMiranda-Belmonte et al. (2023), among others.The ability to capture the immediate context when representing words is the key feature that sets Word2Vec apart from count-based word representation methods, which have been widely used in economic research using textual data (Henry & Leone, 2016;A.H. Huang, Zang, and Zheng, 2014;Jegadeesh & Wu, 2013;Jiang, Lee, Martin, and Zhou, 2019;Loughran & McDonald, 2011).However, despite the success of Word2Vec in capturing word semantics, word sentiment representation is still beyond its capabilities.To illustrate this downside of the vanilla Word2Vec model, Table 1 3 presents the top ten most similar words to the word "bad, " based on the Google pre-trained Word2Vec, the Word2Vec model trained on the MD &A corpus, and FinText (Rahimikia, Zohren, and Poon, 2021), which is a Word2Vec model specially designed for financial contexts.At first glance, the words that are most similar to "bad" are "good" and "not-bad." While this result seems logical, in the sense of semantic similarity, it is counterintuitive when the polarized sentiments of these words are considered.Intuitively, these Word2Vec models tend to group together words with opposite sentiments.An effective Word2Vec model for sentiment representation is needed to ensure words with similar sentiments are clustered together.

Knowledge distillation
However, leveraging Word2Vec for comprehensive semantic representation while incorporating sentiment meanings is challenging for the following reasons.On the one hand, to integrate sentiment meanings into a Word-2Vec model, data with sentiment labels are required (Maas et al., 2011).Labeled data are, however, scarce in economics and finance, and the datasets are typically small.This is because expensive and time-consuming human annotation is required (Lutz, Pröllochs, and Neumann, 2020).On the other hand, training a Word2Vec model from scratch requires massive data in order to capture the word semantics sufficiently well (Rodriguez & Spirling, 2022).To resolve this paradoxical situation, we need a technique to construct a model that (i) inherits the knowledge of word semantics from a pre-trained Word2Vec model and, at the same time, (ii) integrates this knowledge with the sentiment information carried by a small labeled dataset.Consequently, knowledge distillation (Hinton, Vinyals, and Dean, 2015) appears to be a suitable technique.Specifically, this technique allows us to obtain a model that internalizes the knowledge of a pre-trained model while being encouraged to acquire the supervised information in a labeled dataset autonomously.
For our problem, a new set of word vectors, denoted by W SS , which captures both the semantics and the senti- ment of words, is wanted.The pre-trained model in our case is the set of FinText word vectors,4 denoted by W Fin , because this set is trained with a massive dataset which contains news stories in Dow Jones Newswires Text News Feed (2, 733, 035 unique tokens) covering various economic and financial topics (Rahimikia, Zohren, and Poon, 2021). 5Finally, we resort to the Financial Phrasebank dataset as the sentiment guidance for W SS .The particular knowledge distillation technique applied particularly to our problem seeks to maximize the following log-likelihood function: (1) 3 We follow the suggestion of Mukherjee et al. (2021), carefully handling negations before proceeding to the sentiment analysis.In particular, we first locate the sentiment words defined by the Loughran-McDonald sentiment dictionary in the MD &A documents.After that, we determine whether, within a certain window, a negation term, meaning "not, " "no, " "none, " "neither, " "nor, " and "never, " appears within a five-adjacent-word window around the sentiment word.If this is the case, the "not-" prefix is added to the sentiment word.This explains why the word "not-bad" appears in Table 1.Phan Swiss Journal of Economics and Statistics (2024) 160:9 in which s i is the sentiment label of document i; X is the information set, {X 1 , X 2 , ..., X N } , of N documents and X i is the set of features extracted from document i; �(W SS , W Fin ) is the average distance between the vec- tors corresponding to W SS and W Fin for the same word; θ and W SS are trainable parameters to maximize the log- likelihood function; and W Fin remains fixed during the training process.
The first term, L(W SS , θ |X) , which functions as a docu- ment sentiment classifier, integrates the sentiment information encoded by s i into the word vectors W SS .The second term imposes a semantic penalty when W SS devi- ates from the FinText pre-trained word vectors W Fin with rich information on word semantics.These competing terms create a trade-off between the amounts of sentiment and semantic information captured by W SS during the training process.The trade-off is controlled by , which is optimally chosen by the accuracy of the sentiment classification on a validation set.

The Financial Phrasebank dataset and the parameterization of the likelihood function
The first question that arises is how to choose the information set (X i ) and the corresponding sentiment labels s i .
Various methods in the literature address this problem.Li (2010a) utilizes corporate disclosures as the information set.Subsequently, he obtains sentiment labels through human annotations.Similarly, Huang et al. (2023) rely on manual labeling of analysts' reports to determine (X i ) and s i .A significant disadvantage of this approach is the high cost and time-consuming nature of manual labeling.Another line of studies uses corporate disclosures or newspapers as the information set and employs the associated stock returns as proxies for the sentiment labels (Frankel, Jennings, and Lee, 2022;Lutz, Pröllochs, and Neumann, 2020;Jegadeesh & Wu, 2013).While this method addresses the high cost of human annotation, it potentially introduces noisy sentiment labels since stock returns can be influenced by many non-text factors (Huang, Wang, and Yang, 2023).Consequently, we use the Financial Phrasebank dataset to acquire s i and X i , inspired by Chen et al. (2022).This approach addresses the drawbacks of the abovementioned methods, as the dataset is publicly available, so the labeling costs are zero, and it is labeled by financial experts, ensuring accurate sentiment labels.There are three sentiment classes in the dataset, negative (1), neutral (2), and positive (3).Accordingly, s i is assumed to follow a multinomial distribution with M = 3 levels.The conditional likelihood function becomes: in which, Technically, W SS is a |V | × d matrix; θ is a d × M matrix; and θ m is column m of θ with m = 1, 2, 3 .The matrix mul- tiplication X ⊤ i W SS serves as an aggregation of the vectors of the words appearing in document i into a single vector representing the document.
It is worth noting that although we use a linear model, X ⊤ i W SS θ m , to parameterize the likelihood function, the proposed method can be extended to more advanced approaches, including state-of-the-art language models.Specifically, W SS serves as the word embedding layer of the language model, X i represents the set of tokens encoded by the language models, and θ m denotes the lan- guage model parameters.

Textual feature extraction and choice of the distance measure
Two problems remain: (i) how to extract X i from docu- ment i; and (ii) how to choose the distance measure, .For the first problem, inspired by Jegadeesh and Wu (2013), we rely on a method called tf.idf, which stands for term frequency-inverse document frequency (Manning & Schutze, 1999).Despite a lack of theoretical justification, Manning and Schutze (1999) suggest that the tf.idf representation is useful in document retrieval applications.Technically, we define V as the vocabulary of the FinText model, and |V| as the number of distinct words in V. X i can now be represented by a |V|-dimensional vector, (X i1 , X i2 , ..., X i|V | ) ⊤ .Each element of this vector, X ij , is calculated as the ratio between the occurrences of word w j in document i ( tf ij ) and the transformed count of the documents containing word w j ( df ij ).We follow the computation specified by Schütze et al. (2008) with a unit smoothing factor for df ij to avoid division by zero.Formally, in which, N is the number of documents in the Financial Phrasebank training set. (2) To address the second problem, we choose cosine similarity as the distance measure between W SS and W Fin . 6This choice is motivated by the fact that Word2Vec learns words that are adjacent to each other in terms of cosine similarity (Levy & Goldberg, 2014).Technically, in which, w k j is the vector representation of word w j based on Putting all these components together gives us the following log-likelihood function, where p is parameterized by equations 2 and 3; and X i is defined by equation 4.
To prevent overfitting, we randomly split the Financial Phrasebank dataset into the training, validation, and testing parts.The training part is used to estimate W SS and θ by maximizing the log-likelihood function 6.The validation part is used to optimize the trade-off hyperparameter .We use the testing part to compare the sentiment classification power between W SS and W Fin .We provide a comprehensive discussion of this comparison in Sect.4.2.

Data
We estimate our sentiment-semantic word vectors by utilizing the Financial Phrasebank dataset (Malo, Sinha, Korhonen, Wallenius, and Takala, 2014). 7This dataset was constructed to address the scarcity of high-quality labeled data specifically for financial sentiment analysis.It consists of English news articles centered around Finnish firms listed on the Nasdaq Helsinki stock exchange and comprises 4, 846 documents.The dataset was manually annotated by 16 people with financial expertise who categorized the documents into three sentiment classes: negative, neutral, and positive.The Financial Phrasebank ( 5) , dataset features a high imbalance in the distribution of documents by labeled sentiment (with 604 negative, 2, 879 neutral, and 1, 363 positive documents).In line with Malo et al. (2014), we adopt the F1 score as the evaluation metric for our approach, to accommodate the lack of balance in the dataset.
After obtaining the sentiment-semantic word vectors from the Financial Phrasebank dataset, we construct the management sentiment index using the corpus of the Management's Discussion and Analysis (MD &A) section of 10-K filings of US firms from 1994 to 2018.The 10-K filings can be downloaded from The Notre Dame Software Repository for Accounting and Finance (SRAF). 8 The SRAF page also provides additional resources for textual data analysis, such as stopword lists and the Loughran-McDonald dictionary.The SRAF data consists of both 10-K and 10-Q filings in text-file format with HTML tags having been removed.We construct our management sentiment index based only on 10-K filings because it is acknowledged that their information is more significant than that of 10-Q filings (Griffin, 2003).We extract the MD &A section from each 10-K file following the advice of Loughran and McDonald (2016), and manage to extract 68% of all the 10-K files in the corpus. 9 When compared with the extraction rate of Loughran and McDonald (2011), which is roughly 50% , our rate is reasonable.We further discard MD &A documents that have fewer than 250 words.After these purges, we retain 124, 133 MD &A documents spanning the period 1994:01 to 2018:12. 10  To supplement our regression analyses in Sect.6, we further employ multiple sources of numerical data, including: • the Standard and Poor' (S &P) 500 and the valueweighted CRSP indexes; both include dividends and are queried from the Wharton Research Data Service (WRDS); and • the one-month US Treasury bill rate used as the riskfree rate, available from Kenneth R. French's data collection. 11 6 The cosine similarity between two vectors x and y is defined as sim(x, y) = �x,y� �x��y� , where x, y is the inner product of the two vectors; and x is the Euclidean norm of a vector x.To use this as the distance measure between x and y, people usually subtract the cosine similarity from one, i.e., 1 − sim(x, y).
8 Available at: https:// sraf.nd.edu/. 9Extracting the MD &A section from a 10-K file is not trivial, although it may seem as straightforward as searching for "Item 7. Management Discussion and Analysis." The phrase can appear in the Table of Contents or other items, complicating the search.Even when the phrase is correctly navigated, identifying the subsequent item remains challenging, as it could be "Item 7A" or "Item 8. " Another hurdle in this process is that the MD &A does not always appear as "Item 7. " These issues often result in the incomplete or inaccurate extraction of the MD &A section from 10-K filings.For a detailed discussion of these problems, refer to Section 6 of Loughran and McDonald (2016). 10From now on, we use the format yyyy:mm to indicate the month mm in the year yyyy. 11Available at: https:// mba.tuck.dartm outh.edu/ pages/ facul ty/ ken.french/ data_ libra ry.html.
The predictive power of the management sentiment in relation to stock returns could be rooted in the reflection of firm managers concerning the business cycle or macroeconomic conditions.To delve into the macroeconomic channels associated with the stock return predictability based on management sentiment, we leverage the monthly macroeconomic dataset provided by Welch and Goyal (2008). 12As it is expected to connect directly with macroeconomic fundamentals, this dataset has gained popularity in stock return forecasting literature that uses macroeconomic variables (Chen, Pelger, and Zhu, 2023;Cochrane, 2011;Gu, Kelly, and Xiu, 2020;D. Huang, Jiang, Tu, and Zhou, 2015;Jiang, Lee, Martin, and Zhou, 2019).In particular, the dataset includes 14 monthly macroeconomic variables: log dividend-price ratio (DP), log dividend yield (DY), log earnings-price ratio (EP), log dividend-payout ratio (DE), stock return variance (SVAR), book-to-market ratio (B/M), net equity expansion (NTIS), Treasury bill rate (TBL), long-term bond yield (LTY), long-term bond return (LTR), term spread (TMS), default yield spread (DFY), default return spread (DFR), and inflation rate (INFL).Detailed definitions of these variables are given in Section 2.2 of Jiang et al. (2019).

Empirical results
This section provides empirical evidence about the effectiveness of the sentiment-semantic word vectors, W SS , in sentiment analyses.As mentioned in Sect.2, semanticonly word vectors, that is, W Fin , tend to group together words with semantic similarity regardless of sentiment.Therefore, we expect that W SS , which captures more sentiment meanings via our proposed method, can mitigate this problem, by clustering words with similar sentiments together.Despite that, we also show that to a large extent W SS retains word semantics in the financial con- text.Moreover, we show that W SS , which excels in sen- timent and semantic encapsulation, classifies document sentiment more accurately than W Fin and the Loughran- McDonald dictionary-based method.

How well does W SS cluster words by sentiment?
We first assess the proficiency of W SS at capturing both the sentiment and the semantics of words, through a comparison with W Fin .To this end, we implement a sen- timent classification at the word level.This classification relies on the presumption that if word vectors are more proficient at capturing sentiment, positive (negative) words will tend to be surrounded by more words delivering positive (negative) sentiment. 13e employ the Loughran-McDonald dictionary to determine the set of positive and negative words.The choice of the Loughran-McDonald dictionary guarantees the relevance of these sentiment words in the financial context.It is worth noting that the Loughran-McDonald dictionary is just used in this study to validate the word vectors, thus ensuring our approach is fully datadriven.After that, for each sentiment word, we examine the sentiment types of its neighboring words using W SS and W Fin .To determine neighboring words, we combine two criteria: (i) the top n most similar words, denoted by n, and (ii) a pre-defined similarity threshold above which two words are determined to be similar, denoted by τ .Because W SS is expected to capture sentiment meanings more effectively than W Fin , we anticipate that more posi- tive (negative) words and fewer negative (positive) words will be found in the neighborhood of positive (negative) words with W SS compared to W Fin .To enhance the robustness of the classification, we apply various values of the criteria to choose the neighboring words.
Table 2 reports the confusion matrix of the sentiment classification described above.The table presents the average numbers of correct and incorrect assignments regarding the word classification of two sentiment categories: negative and positive.The first entry of 2.68 means that, for every positive word defined by the Loughran-McDonald dictionary, W SS yields on average 2.68 other positive words exhibiting a cosine similarity above 0.2 within the top 10 words that are most similar to the given word.Consequently, the classification sees this number as the true positive of W SS under the corresponding set of criteria values.Similarly, based on W SS , there are 0.26 negative words found within the neighborhood of positive words given the same set of criteria.Therefore, this is the false positive of W SS in this particular case.With the same interpretation, the true negative and false negative of W SS with this set of criteria are 3.45 and 0.40, respectively.
At first glance, W SS outperforms W Fin in allocating words into the correct sentiment categories.In particular, W SS has higher true positive/negative and lower false positive/negative than W Fin .Put differently, W SS clusters words into the corresponding sentiment more accurately than W Fin , thus demonstrating the superiority of W SS in capturing the sentiment meanings of words.Moreover, these results are robust to the varying values of the cluster size n and the similarity threshold τ.
So far, W SS has demonstrated its superior proficiency in capturing word sentiment compared to W SS .How- ever, the question concerning the preservation of word semantics in the financial context of W SS remains.To provide an impression of how well W SS maintains the word semantics in the financial context, we retrieve the top ten most similar words for each given word, and then, qualitatively assess their coherence and relevance in the financial context.For robustness, we combine words with strong financial meanings (e.g., "cash, " "debt") and words whose meaning in the financial context is different from their casual meaning (e.g., "bond, " "capital, " "share").This assessment, although prone to some subjectivity, is widely used in word representation research to evaluate the quality of word vectors regarding semantics (Das, Donini, Zafar, He, and Kenthapadi, 2022;Dieng, Ruiz, and Blei, 2020;Li, Mai, Shen, and Yan, 2021;Mikolov, Chen, Corrado, and Dean, 2013).
Table 3 presents the top ten most similar words to "bank, " "bond, " "capital, " "cash, " "debt, " "inflation, " "interest, " "liability, " "share, " and "yield" based on cosine similarity and retrieved using W SS and W Fin .Overall, these words are surrounded by words with strong economic and financial meanings, even for those such as "bond, " "capital, " and "share" whose meanings depend on the context.This serves as compelling evidence that W SS efficiently preserves the word semantics in the financial context.
Comparing the top similar words generated by our method and using the FinText word vectors underlines the preservation of word semantics by our method.Specifically, we find that the top ten most similar words for these chosen words remain consistent between W SS and W Fin , albeit with minor differences in word order.These findings extend to the other benchmark words, indicating the reliability of our method in maintaining semantic coherence.
In conclusion, the sentiment-semantic word vectors W SS derived using our approach outperform the seman- tic-only word vectors W Fin in capturing sentiment.More- over, while proficiently conveying the word sentiment, W SS effectively retains the word semantics inherent in W Fin .

How accurately does W SS classify document sentiment?
W SS has demonstrated greater proficiency than W Fin in capturing both word sentiment and semantics.However, how it performs in sentiment classification remains unanswered.In conjunction with the findings in Sect.4.1, a superior performance of W SS compared to W Fin in sen- timent classification will add robustness to our approach to calibrating word vectors for effective sentiment and semantic representation.Indeed, many studies validate their proposed models by sentiment classification  (Huang, Wang, and Yang, 2023;Li, 2010a;Lutz, Pröllochs, and Neumann, 2020), demonstrating the efficacy of this validation method in evaluating a novel approach.Formally, we maximize the following log-likelihood function for the sentiment classification task, the word vectors W k and document i is calculated as p(s i = m| φk m , W k , X i ).To estimate φ k , we use the training part of the Finan- cial Phrasebank dataset that was used to estimate W SS .The testing part is then used to evaluate the Besides W SS and W Fin , we implement a sentiment classifi- cation model using the Loughran-McDonald dictionary.While comparing the classification capability of W SS with that of W Fin makes manifest the importance of capturing sentiment when classifying sentiment using word vectors, the comparison of the word vectors (i.e., W SS and W Fin ) and the Loughran-McDonald dictionary demonstrates the significance of word semantics in sentiment classification.Following the convention used in many studies (Henry, 2008;Jiang, Lee, Martin, and Zhou, 2019;Loughran & McDonald, 2011), the predicted sentiment class of document i using the Loughran-McDonald dictionary is determined as follows, in which, 1, 2, and 3 indicate negative, neutral, and positive sentiments; and #(pos) i and #(neg) i are, respectively, the number of positive and number of negative words defined by the Loughran-McDonald dictionary that appear in document i.
Like Malo et al. (2014), we opt for the F1 score as the evaluation metric for this classification.We present both class-wise and global F1 scores across sentiment categories for a more comprehensive assessment.Since the F1 score is traditionally used in binary classification, one needs to adapt it for multi-class classification problems via aggregation.Specifically, we apply two types of aggregation: micro-and macro-averages. 14(9) Table 4 reports a wide range of F1 scores for sentiment classification within the Financial Phrasebank dataset.It includes the results from the Loughran-McDonald dictionary-based approach and the word vector approaches using W Fin and W SS .In general, the word vector approach using W SS outperforms its competitors in clas- sifying sentiment.Together with the results of Sect.4.1, this reinforces the success of our approach in injecting the sentiment meanings into semantic-only word vectors, and subsequently in reflecting a more accurate document sentiment classification.
A more thorough examination of Table 4 reveals deeper insights into the sentiment classification capabilities of the methods under consideration.Surprisingly, with a high F1 score in the negative class relative to its competitors, the Loughran-McDonald dictionary performs fairly well in identifying pessimistic documents.The proficiency in detecting the negative sentiment of the Loughran-McDonald dictionary may explain the findings of Loughran and McDonald (2011), wherein only the pessimism embedded in 10-K filings is associated with stock returns.
Second, the semantic-only word vector W Fin performs the worst in classifying pessimism, even in comparison with the dictionary-based approach, by a large margin: its F1 score is only 0.13 compared to the scores of the other approaches of 0.36 and 0.45.Combined with the lowest macro-average F1 score of W Fin , this result suggests that relying exclusively on word semantics is inadequate for gauging nuanced sentiment expressions precisely.
Third, the superior performance of W SS in most cases suggests that, in order to measure sentiments accurately, Table 4 This table compares the performances of three approaches for sentiment classification in the Financial Phrasebank dataset: (i) the Loughran-McDonald dictionary-based approach, (ii) the word vector approach based on W Fin , and (iii) the word vector approach based on W SS The bold numbers indicate the most superior method in classifying document sentiment in terms of each evaluation metric The first three columns present the component F1 scores of three sentiment classes, i.e., negative, neutral, and positive, respectively.The last two columns exhibit the global F1 scores using the micro-and macro-averages, as shown.The F1 scores showcased in this both word sentiment and semantics are required.Comparing W SS with the Loughran-McDonald approach reveals that word semantics, when standing alone, may not be powerful but are still crucial in classifying sentiment precisely.Our findings correspond with many criticisms of bag-of-word methods because of their treatment of words as independent units; see Huang et al. (2023), Mikolov et al. (2013), Li et al. (2021) among others.Associated with the empirical results shown in Sect.4.1, two conclusions can be drawn.First, our approach successfully obtains a set of word vectors that captures both word sentiment and semantics in the financial context.Furthermore, the small size of the Financial Phrasebank dataset highlights the adaptability of our approach in handling small and domain-specific data.Second, both captured sentiment and semantics play crucial roles in accurately identifying sentiment.Ultimately, the next question is how an accurate sentiment measurement can be applied to explore economic values or answer financial puzzles.Subsequent sections will delve into this intriguing topic.

Appendix A: Construction of the management sentiment index
To examine the predictive effects of management sentiment on stock markets, it appears natural to construct an index that conveys firm managers' sentiment through corporate disclosures.Subsequently, the connection between this index and the series of future stock returns is investigated.Attempts of this nature are commonly based on bag-of-words approaches, in which the sentiment of a document is a projection of a pre-defined lexicon (Feldman, Govindaraj, Livnat, and Segal, 2010;Henry, 2008;X. Huang, Teoh, and Zhang, 2014;Jiang, Lee, Martin, and Zhou, 2019;Loughran & McDonald, 2011;Price, Doran, Peterson, and Bliss, 2012).Other studies obtain sentiment labels by human annotation (Li, 2010a) or by the associated market reactions (Frankel, Jennings, and Lee, 2022;Jegadeesh & Wu, 2013).We instead measure the management sentiment using our word vectors W SS and the MD &A corpus.In par- ticular, we apply the sentiment classification model using W SS (i.e., model C in Table 4) to produce the predictions of the sentiment classes (i.e., negative, neutral, and positive) on the MD &A corpus.The sentiment score of each MD &A document is computed as the weighted expected values of its predicted sentiment. 15We then construct our management sentiment index, which is hereafter denoted as S SS , by a simple average of the management sentiment scores from the MD &A documents released in a given month.Following Jiang et al. (2019), we smooth the management sentiment index by a four-month moving average to mitigate the effects of idiosyncratic noises.The moving average is implemented retrospectively, utilizing data from the past four months to prevent lookahead bias.We provide the details of the sentiment estimation in Appendix A.
For the sake of methodological comparison, we also construct two other management sentiment indexes similar to S SS : (i) the Loughran-McDonald sentiment index, S LM , and (ii) the semantic-only sentiment index, S Fin .The first sentiment index is constructed based on a wordcount approach using the Loughran-McDonald dictionary (Loughran & McDonald, 2011).The second sentiment index is formed in a similar way to S SS , but using the Fin- Text word vectors, W Fin , instead of W SS .The two senti- ment indexes are both derived from the MD &A corpus and are aggregated in a similar way to S SS as described in the previous paragraph.As the final step, the three management sentiment indexes are standardized to have zero means and unit variances to eliminate the effects of scale difference. 16As shown by the empirical results in Sect.4, using S SS in presenting the management sentiment index is more advantageous than using S Fin or S LM because of the more effective sentiment representation of W SS over W Fin and the Loughran-McDonald approach.
Figure 1 presents the variations of the three sentiment indexes over time.At first glance, S LM and S Fin , the senti- ment indexes built by the Loughran-McDonald dictionary and the FinText word vectors, respectively, exhibit strong seasonality throughout most of the data sample.While S LM only shows a decline during the dot-com crisis, S Fin displays a local decreasing trend during the financial crisis and remains steady during the dot-com recession.This implies that S LM and S Fin do not ade- quately capture explanatory information about business cycles or historical events.This observation is expected because the Loughran-McDonald dictionary and W Fin , which are the core of S LM and S Fin , capture only one aspect of the sentiment-semantic trade-off.
Unlike S LM and S Fin , S SS aligns well with business states, especially the essence of the two recessions.In particular, the management sentiment based on S SS starts low but gradually increases until before the dot-com crisis.During the dot-com crisis, the management sentiment 15 The weights are the inverse proportions of the sentiment classes in the Financial Phrasebank dataset.We decide to use the weighted average because the distribution of the sentiment classes in the MD &A corpus may differ from that in the Financial Phrasebank dataset, which is well known to be unbalanced.Consequently, biases caused by a distributional shift may occur if weights are not applied. 16The standardization is implemented for the whole time series for the visualization and in-sample regression studies in Sect.6.1.For the out-ofsample studies, index standardization is executed in a recursive-window manner to avoid look-ahead bias; see Sect.6.2 for further details of the study design.
Fig. 1 The market-level management sentiment indexes extracted from the MD &A section of 10-K filings.The first plot depicts the management sentiment index S SS constructed using the sentiment-semantic word vectors trained on the Financial Phrasebank dataset.The second plot is of the sentiment index S LM constructed using the bag-of-words method based on the Loughran-McDonald dictionary (Loughran & McDonald, 2011).The third plot depicts the sentiment index S Fin built by the FinText word vectors.We also present the series of log returns on the S &P 500 index.The vertical gray bars indicate the economic recessions defined by the NBER.The data sample spans the period from 1994:01 to 2018:12 drops, and it then remains low until around 2003.This period coincides with several high-profile accounting fraud cases (e.g., Enron and Worldcom) coming to light, which may have driven down management sentiment.Following this period, S SS rises and then remains high in value until before the 2008 financial crisis, implying that firm managers tended to use an optimistic tone in their MD &A during this time.During the financial crisis, S SS again displays a decrease in management sentiment.As opposed to S LM and S Fin , S SS does not exhibit noticeable seasonality across the data sample.

Predictive regression analysis
In this section, we provide empirical evidence regarding the stock return predictability of our sentiment-semantic management sentiment index, S SS .This goal can be achieved by numerous comparative analyses between our management sentiment index and the index built by the Loughran-McDonald dictionary-based method, S LM .We focus our comparative analyses on S LM instead of S Fin because the Loughran-McDonald dictionary is widely used to extract sentiment in the current literature (Loughran & McDonald, 2011;Jiang, Lee, Martin, and Zhou, 2019;Sautner, Van Lent, Vilkov, and Zhang, 2023), thus serving as a strong benchmark.We implement the analyses in both an in-sample and an out-of-sample manner to guarantee the robustness of our findings.

In-sample market return predictability
We first examine the market return predictability for the sentiment indexes, S SS and S LM .To empirically test the market return predictability of the sentiment indexes, we design the following set of equations, where CER t→t+h is the cumulative excess market returns (i.e., the monthly returns on (i) the value-weighted average CRSP index, and (ii) the S &P 500 index, in excess of the risk-free rate from month t to month t + h ); and Recession is a dummy variable indicating the economic recessions defined by the National Bureau of Economic Research.
Our experiment is inspired by the paper by Jiang et al. (2019) yet possesses two important differences.First, in addition to the S &P 500 index, we also, like Jegadeesh and Wu (2013), use the value-weighted CRSP index to compute the market returns.This additional (10) index is expected to enhance the robustness of the test results.Second, we control for recession fixed effects and monthly fixed effects in all equations to capture the potential variations caused by seasonality and business cycles.As shown in Fig. 1, the recessions negatively affect both management sentiment and the S &P 500 index.17Therefore, omitting the recession control may lead to inconsistent estimates.While the seasonality is not visible with S SS , it is pronounced in S LM .As a result, we decide to include monthly dummies in all equations to obtain fair comparisons.Table 5 presents the regression results of equation 10 over h-month horizons with h = 1, 3, 6, 9, 12,18 First, for the univariate regressions, the coefficients on S SS are neg- ative and significant at the 5% level for the semi-annual to one-year horizons.We do not, however, observe any significant coefficients on S LM .The level of significance is stronger with the S &P 500 index compared with the value-weighted CRSP index, evidenced by significant coefficients across all horizons.Intuitively, the sentiment index, which integrates word sentiment and semantics, is negatively associated with future cumulative excess market returns.In contrast, the index based solely on word sentiment shows no correlation with market returns.
The bivariate regression results reveal deeper insights about the superior predictive capacity of S SS in compari- son with S LM .In particular, adding S LM to the models with only S SS results in limited changes to the sign and the significance of the coefficients for S SS in all regres- sions.Moreover, the R 2 s of the bivariate models are simi- lar to those of the corresponding regressions on S SS alone in most of the cases (e.g., with the S &P 500 index at the semi-annual horizon, R 2 for the two models is 16.8% and 16.4% , respectively).With a substantial correlation of −0.499 between S SS and S LM , these findings suggest that S SS possesses predictive insights regarding future market returns that are beyond those of S LM .
Economically, at the semi-annual horizon, an increase of one standard deviation in the management sentiment is associated with a decrease of 2.3% in the cumu- lative returns with the value-weighted CRSP index and a decrease of 2.7% with the S &P 500 index.Furthermore, the estimated coefficient on S SS increases in absolute value as h increases.This result implies that S SS consist- ently and significantly predicts the cumulative excess 3,6,9,12 The dependent variable, CER t→t+h , is the cumulative excess market returns, i.e., the monthly returns on (i) the value-weighted average CRSP index (Panel A) and (ii) the S &P 500 index (Panel B) in excess of the risk-free rate, from month t to month t + h .S SS and S LM are the management sentiment indexes extracted from the MD &A section of 10-K filings using, respectively, the sentiment-semantic word vectors and the Loughran-McDonald dictionary (Loughran & McDonald, 2011).A constant term ( α ) and a recession dummy (Recession) are also included in each regression equation.The coefficients, Newey-West heteroscedasticity-and autocorrelation-robust t-statistics (in parentheses), and R 2 are reported.The data sample spans the period from 1994:01 to 2018:12.*, **, and *** denote significance at the 10%, 5%, and 1% levels, respectively market returns in the long run.In addition, the predictive power of S SS becomes stronger when the horizon gets longer.Across the horizons, the in-sample R 2 of the regressions on S SS ranges from 10.0% to 18.4% with the value-weighted CRSP index, and from 10.3% to 20.0% with the S &P 500 index.This means that S SS is a fac- tor that can explain large in-sample variations of future excess market returns.Moreover, the out-of-sample tests presented in Sect.6.2 show that this result is maintained out-of-sample.

Monthly cumulative excess market returns ( CER
In conclusion, these results contribute to those of Jiang et al. (2019), who discover a negative correlation between sentiment extracted from 10-K/Q reports and conference calls and future market returns.Our finding suggests that the sentiment information derived exclusively from the MD &A section of 10-K filings is also a strong and negative predictor of the stock market, provided that the sentiment is precisely measured.

Out-of-sample market return predictability
In numerous predictive analyses, researchers discover substantial predictive evidence with in-sample data, yet struggle to obtain significant predictive power with out-of-sample data (Inoue & Kilian, 2005).Additionally, out-of-sample analyses tend to be more resilient to the econometric issues described in Sect.6.1 (Busetti & Marcucci, 2013).Therefore, to provide a robust validation of the predictive power of our management sentiment index, S SS , we conduct several out-of-sample return pre- dictive analyses at the market level.According to Welch and Goyal (2008), in stock returns forecasting, a historical average of stock returns frequently outperforms regression models of stock returns on economic predictors.Therefore, whether S SS outperforms the historical average benchmark model is of great interest.Moreover, as demonstrated in Sect.6.1, S SS encompasses predictive information additional to that of S LM for future market returns based on in-sample tests.If this result is preserved out-of-sample, it can be demonstrated that S SS holds significant economic value when compared to S LM .Accordingly, we compare the market return predictive power of: (i) each of the sentiment indexes S SS and S LM with that of the historical average benchmark; and (ii) the model combining S SS and S LM with that containing only S LM .Technically, we conduct the two following tests, Test A: in which model 1 in the two tests is a parsimonious model and model 2 nests the corresponding model 1.As in the in-sample studies, we regress {CER s+1→s+h } t+1−h s=1 on predictor variables {X s } t+1−h s=1 , in which X s is a combi- nation of a constant term, S SS s , and S LM s depending on the model.
Unlike the in-sample studies, the recession dummy is not included in all models because recessions are typically determined ex-post using macroeconomic variables, which often suffer from delays and revisions in their publication (Clements, 2019).By excluding the recession dummy, the out-of-sample prediction resembles a real-time prediction.We also exclude monthly dummies from the models in Test A for two reasons.First, Model 1A without monthly dummies serves as the robust historical average benchmark described by Welch and Goyal (2008).Second, including monthly dummies only in Model 2A would skew the comparison of the forecasting power between the sentiment indexes and the benchmark, as the forecasting power of Model 2A would be partly impacted by the monthly dummies.For Test B, the results remain robust to the inclusion of monthly dummies; see Appendix C for further details.
The tests are implemented in a recursive-window manner (West & McCracken, 1998), in which the data from 1994:01 to 1999:12 is the initial training set, and the data period from 2000:01 to 2018:12 is used as the evaluation period. 19It is worth noting that the sentiment indexes are also standardized recursively to avoid look-ahead bias.In particular, in one window, we execute the following steps in order: (i) standardize the index; (ii) estimate the model; (iii) standardize the index values in the prediction part by the statistics of the index in the training part; and (iv) make predictions of returns.
We define the mean squared prediction error (MSPE) as a measure of prediction accuracy. 20In both tests A and B, we want to test the null hypothesis that the MSPE of the parsimonious model is smaller than or equal to that of the nested models against the alternative hypothesis that the nested model has a smaller MSPE than the parsimonious model.To this end, we use the Campbell and Thompson (2008) OS ) which is defined as follows: Page 16 of 22 Phan Swiss Journal of Economics and Statistics (2024) 160:9 in which P is the starting time point of the evaluation dataset, which is 2000:01 in our case; and CER j,t+1→t+h with j = 1, 2 are the out-of-sample forecasts produced by, respectively, the parsimonious (model 1) and the nested (model 2) models in each test.By definition, R 2 OS lies in the range (−∞, 1] .A significantly positive R 2 OS leads to the conclusion that the nested models have better forecasting ability than the parsimonious models, implying that the additional variables improve the benchmark variables in predicting stock returns.Accordingly, the above-mentioned testing hypotheses turn out to be, H 0 : R 2 OS ≤ 0 against H A : R 2 OS > 0. We adopt the adjusted MSPE statistic proposed by Clark and West (2007), which is the difference between the MSPE statistics of models 1 and 2 with a bias adjustment, to test the significance of R 2 OS .Clark and West (2007) show that the adjusted MSPE statistic asymptotically follows a standard normal distribution, and the null hypothesis is rejected if the statistic exceeds +1.282 , +1.645 , and +2.323 for a one-sided test at the 10% , 5% , and 1% significance levels, respectively. (11) Table 6 reports the results of tests A and B on the value-weighted CRSP and the S &P 500 indexes.For the value-weighted CRSP index, we observe that S SS signifi- cantly (at the 10% level) improves the historical average in predicting the cumulative excess market returns at the three-month horizon.In contrast, S LM exhibits no pre- dictive power for the cumulative market returns.With significant R 2 OS at the three-month, semi-annual, ninemonth, and one-year horizons in test B, we see that S SS adds significant predictive information to the model with only S LM in the middle and long run.
For the S &P 500 index, S SS possesses even stronger predictive power for future market returns than it does for the value-weighted CRSP index.In comparison with the historical average benchmark, the model with S SS is capable of producing more precise forecasts of cumulative excess market returns across all the time horizons considered except for the one-month period.We also observe a monotonic increase in R 2 OS along the expanding horizons in this case, implying that S SS increasingly pre- dicts the future market returns when the portfolio is held for a longer period.
We still do not observe a significant positive R 2 OS with S LM in any model, suggesting that the sentiment of the MD &A documents, as extracted by the  This table reports the out-of-sample performance of the management sentiment indexes, S SS and S LM , in predicting the cumulative excess market returns, i.e., the monthly returns on (i) the value-weighted average CRSP index (Panel A) and (ii) the S &P 500 index (Panel B) in excess of the risk-free rate, from month t + 1 to month t + h S SS and S LM are the management sentiment indexes extracted from the MD &A section of 10-K filings using, respectively, the sentiment-semantic word vectors and the Loughran-McDonald dictionary (Loughran & McDonald, 2011).Test A evaluates the predicting performance of S SS and S LM in comparison with the historical average benchmark.Test B evaluates the predicting performance of S SS in addition to S LM .R 2 OS is the out-of-sample Campbell and Thompson (2008) R 2 .The adjusted MSPE statistic (in parentheses) is the mean squared prediction error statistic introduced by Clark and West (2007) to test the null hypothesis that the MSPE of the parsimonious models (i.e., the historical average model in test A, and the S LM -only model in test B) is smaller than or equal to the MSPE of the nested models.The tests are implemented in a recursive-window manner (West & McCracken, 1998), in which the data from 1994:01 to 1999:12 is the initial training set, and the data period from 2000:01 to 2018:12 is used as the evaluation period.* and ** denote significance at the 10% and 5% levels, respectively dictionary, does not contain more predictive information for future market returns than the historical average model.Test B conducted on the S &P 500 index further corroborates the findings regarding the enhanced predictive capability of S SS over S LM .More concretely, the pres- ence of numerous significant and positive R 2 OS statistics highlights the considerable predictive capacity contributed by S SS to models solely reliant on S LM .These mod- els, which were previously shown to lack predictive power regarding future market returns, now exhibit enhanced predictive ability because of the inclusion of S SS .Jiang et al. (2019) find that management sentiment in the current month t contains predictive information beyond that of the historical average benchmark for predicting the market return in the next month t + 1 , but our results suggest the opposite.We conjecture that the difference is rooted in the inclusion of more abundant data sources in the Jiang et al. (2019) management sentiment index.This inclusion allows them to "...examine manager sentiment on a more timely basis" (Jiang et al., 2019, p. 129).Consequently, their management sentiment index can exploit earlier effects of the sentiment on the stock markets.However, with our results, we show that exploiting the sentiment exclusively from the MD &A section of 10-K filings is capable of capturing mispricing information in stock prices.Our findings contribute to the literature on stock return predictability and corporate disclosures by showing that the mispricing information contained in the management sentiment embedded in the 10-K filings may to some extent be concentrated in the MD &A section.

Management sentiment and macroeconomic channels
So far, the management sentiment index S SS has been found to negatively predict future stock returns.According to Jiang et al. (2019), the negative predictive power of management sentiment may be due to the misjudgment of investors regarding future firm earnings.This section aims to provide another angle on this finding, using the lens of macroeconomic channels.
We first implement the in-sample analysis regarding the predictive information covered by S SS in relation to the 14 macroeconomic variables, using the following equations: in which X t is one of the 14 macroeconomic variables described in Sect.3. Unlike the model in Sect.6.1, we exclude monthly dummies here because neither S SS nor the macroeconomic variables show seasonal patterns.( 12) Table 7 reports the estimation results for the above regression equations.We observe that S SS exhibits signif- icant correlations to future stock returns at the 5% level when nested with all the macroeconomic variables except the dividend-price ratio (DP) and the dividend yield (DY).These results imply that the management sentiment index S SS may capture information relating to the divi- dend payments of S &P 500 firms.With the significant and negative coefficients in the regressions other than DP and DY, S SS demonstrates that its predictive information is orthogonal to that of the other macroeconomic variables, even to those with strong stock return predictability such as the book-to-market ratio (B/M) and the default return spread (DFR).
The out-of-sample test results, which are detailed in Table 8, reinforce these findings.We find that S SS makes a limited contribution to the predictive ability of the dividend-related variables (the dividend-price ratio (DP), dividend yield (DY), and dividend-payout ratio (DE)).As with the in-sample findings, S SS is found to add signifi- cant power to the other macroeconomic variables in predicting out-of-sample future stock returns.
To this end, we examine the complementary predictive power of S SS in addition to the 14 macroeconomic vari- ables provided by Welch and Goyal (2008).In particular, we re-implement the in-sample and out-of-sample predictive regression analyses used in Sect.6 with S LM being replaced by each of the 14 macroeconomic variables.It should be noted that, within this section, we use only the S &P 500 index as the market index.This is because several macroeconomic variables are derived from the S &P 500 index. 21e conjecture that this result is rooted to some extent in the discussions of firm managers in the MD &A section regarding the dividend payment plans.For example, in the MD &A section of Apple Inc. 's 10-K filing in 2015, the company wrote, "...In April 2014, the Company increased its share repurchase authorization to $90 billion and the quarterly dividend was raised to $0.47 per common share, resulting in an overall increase in its capital return program from $100 billion to over $130 billion.During 2014, the Company utilized $45 billion to repurchase its common stock and paid dividends and dividend equivalents of $11.1 billion... ... The Company currently anticipates the cash used for future dividends, the share repurchase program, and debt repayments will come from its current domestic cash, cash generated from ongoing U.S. operating activities and from borrowings... " Another example can be found in the MD &A in the 2012 10-K filing of Microsoft Corporation, in which the company wrote, "...Cash used for financing increased $1.0 billion to $9.4 billion due mainly to a $6.0 billion net decrease in proceeds from issuances of debt and a $1.2 billion increase in dividends paid, offset in part by a $6.5 billion decrease in cash used for common stock repurchases... ... We expect existing domestic cash, cash equivalents, short-term investments, and cash flows from operations to continue to be sufficient to fund our domestic operating activities and cash commitments for investing and financing activities, such as regular quarterly dividends, debt repayment schedules, and material capital expenditures, for at least the next 12 months and thereafter for the foreseeable future... " In general, we provide evidence that the predictive power of the management sentiment index S SS is fully absorbed by the information about the dividend payment plans of S &P 500 firms.This absorption can be attributed to discussions regarding dividends made by firm managers in the MD &A section of 10-K filings.Our findings, however, are not equivalent to the assertion that the dividend-related information located in the MD &A section is a cause of the predictive power of the management sentiment.The causal effects are left for future studies.

Conclusion
This paper sheds light on the ability of the sentiment contained in the MD &A section of 10-K filings from January 1994 to December 2018 to predict future returns.Unlike most existing studies, we introduce a novel method for in which the complementary predictive power of S SS in addition to the 14 macroeconomic variables (Welch & Goyal, 2008) is examined.The dependent variable, CER t→t+h , is the monthly returns on the S &P 500 index in excess of the risk-free rate, from month t to month t + h .The definitions of the 14 macroeconomic variables are given in Sect.3. A constant α and a recession dummy are also included in each regression equation.The coefficients, Newey- accurately gauging the MD &A sentiment.In particular, our method relies on three components: (i) the Google pre-trained Word2Vec model to nail word representations to initial semantic information; (ii) the knowledge distillation method; and (iii) a dataset with sentiment labels acting as sentiment guidance.The result of our approach is a set of word vectors capturing both sentiment and semantic meanings.
Our proposed method enhances sentiment classification at both word and document levels.Explicitly, we suggest that omitting either sentiment or semantic meanings leads to inefficient sentiment classification.This result underlines the importance of these two facets in obtaining an accurate sentiment measurement.
By using the sentiment-semantic word vectors, we build a management sentiment index, whose variations match well, conceptually, with different economic episodes.The index based on the semantic-only approach is, however, unable to produce meaningful interpretations of the states of the economy.This observation once again reaffirms the importance of sentiment nuances captured by word vectors in exploring the economic implications of MD &A documents.
Finally, our proposed management sentiment index is a strong negative predictor of future stock returns.Moreover, we show that it embraces predictive insights concerning future stock returns beyond the dictionary-based sentiment index.These findings hold in both in-sample and out-of-sample setups.Based on these results, three conclusions are drawn concerning the sentiment analysis of the MD &A documents.First, it is crucial to have an accurate measurement to obtain meaningful sentiment information.Second, the MD &A section of 10-K filings contains information regarding firm conditions that may lead to stock mispricing.Third, the predictive power of the management sentiment of the MD &A documents relates to the information about dividend payment plans.
A potential limitation, however, is that our model, although based on semantic word representation, remains statically contextualized.This is because a word is encoded by a single numerical vector regardless of the surrounding context in a sentence or paragraph.This limitation suggests an extension of the current work with language models like FinBERT (Huang, Wang, and Yang, 2023), associated with our proposed method.The dynamical contextualization of language models is anticipated to uncover more insights into corporate disclosures.

Appendix A: A Construction of the management sentiment index
Denote the tf.idf representation of the MD &A document i that is released in month t as X MDA i,t . Follow the instructions in Sect.4.2, the predicted probability of each sentiment class m, with m = 1, 2, 3 , conditioning on W SS is p(s i = m| φSS m , W SS , X MDA i,t ) .It is worth noting that φSS m is the estimated parameter of model C in Table 4.The sentiment score of this MD &A document based on W SS is computed as, where ω 1 = 1 604 , ω 2 = 1 2879 , and ω 3 = 1 1363 , which are the inverse proportions of the sentiment classes in the Financial Phrasebank dataset.Table 8 This table reports the out-of-sample stock return predictability of the management sentiment index S SS , in addition to the 14 macroeconomic variables (Welch & Goyal, 2008).The stock market returns are computed as the monthly returns on the S &P 500 index in excess of the risk-free rate, from month t + 1 to month t + h .The definitions of the 14 macroeconomic variables are given in Sect.3. R 2 OS is the outof-sample Campbell and Thompson (2008) R 2 .The adjusted MSPE statistic (in parentheses) is the mean squared prediction error statistic introduced by Clark and West (2007) to test the null hypothesis that the parsimonious models (i.e., the model contains exclusively the macroeconomic variables) have smaller or equal MSPE than the nested models.The tests are implemented in a recursivewindow manner (West & McCracken, 1998)   The adjusted MSPE statistic (in parentheses) is the mean squared prediction error statistic introduced by Clark and West (2007) to test the null hypothesis that the parsimonious model, i.e., the S LM -only model, has smaller or equal MSPE than the nested model.The tests are implemented in a recursive-window manner (West & McCracken, 1998) .m.p(s i = m| φSS m , W SS ,

Table 1
This table reports the top ten most similar words to the word "bad" based on Google pre-trained word vectors, the word vectors trained on our MD &A corpus, and FinText

Table 2
This table reports the confusion matrix of the word-level sentiment classification using W SS and W Fin with different values of the top n most similar words and similarity thresholds τThe positive and negative words are determined by the Loughran-McDonald dictionary.The bold numbers indicate the word vectors among W SS and W Fin that are more proficient in capturing sentiment, measured by their classification accuracy

Table 3
This table reports the top ten most similar words to the corresponding words based on the sentiment-semantic word vectors ( W SS ) and FinText ( W Fin ) SS and W Fin .For every doc- ument i, the predicted sentiment class ŝk i based on the word vectors W k is the one associated with the highest predicted probability.Technically, k ; the other notations are defined in Sect. 2. It should be noted that, with this sentiment classification, the word vectors W k are fixed and are not subject to further training.The predicted probability for each sentiment class m conditioning on classification accuracy of W table are calculated using the testing part of the Financial Phrasebank dataset Takahashi et al. (2023) and false negatives across all sentiment categories, thereby accounting for the imbalanced labels.The global F1 score under this aggregation is then derived from these aggregated counts.In contrast, the macro-averaged F1 score is computed by first determining the F1 score for each sentiment class individually.The global F1 score under this aggregation is then obtained as the unweighted average of these individual F1 scores, treating each class equally regardless of its size.Consequently, we prioritize the micro-averaged F1 score as our main evaluation metric, because of the imbalanced nature of the Financial Phrasebank dataset.We refer readers toGrandini et al. (2020)andTakahashi et al. (2023)for further technical details of F1 scores.Footnote 14 (continued)

Table 9
This table reports the correlations for the management sentiment index S SS and the 14 macroeconomic variablesThe definitions of the 14 macroeconomic variables are given in Sect.3. The data sample spans the period from 1994:01 to 2018:12

Table 10
This table reports the out-of-sample Test B in Sect.6.2 in case monthly dummies are included.R 2 OS is the out-of-sample Campbell and Thompson (2008) R 2

R 2 OS 1 month 3 months 6 months 9 months 1 year
, in which the data from 1994:01 to 1999:12 is the initial training set, and the data from 2000:01 to 2018:12 is used as the evaluation period.* and ** denote significance at the 10% and 5% levels, respectively