Extracting quantitative signals from text is transforming research across fields. Social scientists have built newspaper-based indicators of economic crises, partisan conflict, geopolitical risk, and economic policy uncertainty, to name just four examples. Marketing companies attempt to extract consumer insights from social media by identifying “trending” topics in a similar vein. Most such signals simply track the frequency of documents that match carefully-crafted keyword searches, and while this approach is simple and intuitive, it suffers from some important shortcomings that other methods avoid. At Kensho, we use machine learning algorithms to derive quantitative signals from documents, which consistently outperform keyword-based methods at finding relevant results and leaving out noise.
One problem with keyword searches is their inability to account for the context in which search terms appear, making them count articles they shouldn’t. For example, the plot below shows a naive signal tracking coups, built from a simple text search of our news corpus for the word “coup.” As a first approximation, the index is decent: it registers the August 2013 coup against Morsi in Egypt; the 2014 coup attempts in Libya; the July 2016 coup attempt in Turkey; and the removal from power of Zimbabwe’s Mugabe in December 2017. However, the noticeable uptick in 2018 should raise some eyebrows, as it was a coup-free year. What’s going on?
Essentially, 2018 saw geopolitically important events occur in countries where coups had taken place, and thus the word “coup” appeared as historical context in otherwise unrelated news stories. We saw elections in Thailand, Pakistan, the Maldives, and other developing countries that have experienced past coups; a brewing economic crisis in Turkey after the failed ouster of Erdogan; protests in Sudan against Bashir, who first seized power in a 1989 coup; and talk of “political coups” in the parliaments of Britain and Australia. Articles with these instances of “coup” didn’t feature regime change as a major theme, but counting them in our naive coup index caused its jump last year.
A second limitation of keyword-based signals is their inability to track topics whose recognition requires implicit knowledge. One classic example of this type of topic is authoritarian politics, which can be readily identified as a theme of articles upon reading but is not easily encapsulated in a list of search terms. A query for “state of emergency” would be a reasonable step in the right direction, since declaring one is a oft-used tactic of autocrats to entrench their hold on power. However, not all states of emergency reflect authoritarian tendencies, especially those that follow natural disasters. More importantly, the clearest indications of an authoritarian turn—crony capitalism, the erosion of judicial independence, and the suppression of free speech—do not lend themselves to description via keywords.
A final, related issue is that keyword searches are inherently clunky. Even for well-defined topics, crafting a precise enough search query is time-consuming. For example, an energy analyst searching for news about crude oil would need to ensure her results did not include “peanut oil,” “palm oil,” and the like, requiring a cumbersomely long query string.
To solve the inherent problems with keyword-based indicators, Koto has developed our own suite of topic tags, a collection of supervised machine learning models that identify diverse geopolitical themes in text. For example, our “Coups” topic tag is a binary classifier trained on hundreds of news stories that our experts hand-labeled to indicate whether or not they feature coups d'etat as a major theme. Articles about the failed drone attack on Maduro last October, which do not include the word “coup” but describe an attempted one in all but name, are correctly classified as coup-related stories with our model. However, articles with only passing mentions of recent coups are not labeled with our Coups tag.
Counting the number of articles with our Coups tag every day allows us to build a quantitative signal from our model, and comparing this to the keyword-based signal above highlights the ability of our approach to filter out noise. In statistical terms, our method is much less likely to produce false positives, where stories unrelated to coups are incorrectly classified as such. We see this gain in precision most clearly in 2018, when no coups occurred and our tag index displayed little variation, especially relative to the keyword-based coup index. At the same time, our tag-based index picked up major past coup events equally well. Our Coups tag had an F1 score of 0.91, so it did nearly as well as humans in classifying coup-related articles, while a keyword search for the word “coup” gave an F1 score of only 0.87
While far from perfect, we believe machine-learning-based approaches generally outperform keyword searches in finding relevant articles with minimal noise, and we continue to train models to identify a growing list of economic and geopolitical topics.