Online forums, ranging from Twitter to political prediction markets, offer valuable real-time information about geopolitical developments. One underappreciated source for this kind of information is Wikipedia -- the world’s largest open-source knowledge graph -- and this post introduces the geopolitical trend scores we’ve developed from the activity of millions of Wikipedia users.
The idea behind our Wikipedia trend scores is that revisions to Wikipedia pages are a rough proxy for the changing importance of their subjects, whether individuals, groups, places, events, or something else. Sometimes, heavy revisions to particular Wikipedia pages coincide with particular events, but not always. Our goal is not to predict events with Wikipedia, but to quantify the salience of geopolitical themes across countries over time.
Just as equity indices are composed of baskets of stocks, Koto’s Wikipedia trend scores consist of constituent Wikipedia pages, and identifying the pages in a given country-topic score is a two-step process. In step one, we make use of Wikipedia’s graph structure to determine which pages are strongly associated with the country of interest. In step two, we identify which pages in this set also relate to the geopolitical topic of interest using Koto’s Topic Tags, a collection of supervised machine learning models we’ve trained to identify diverse geopolitical themes in text. The result is a subset of Wikipedia pages highly relevant to both the country and geopolitical topic under consideration. For example, suppose we’d like to track corruption in Italy. In this case, our aim is to identify all Wikipedia pages relating to Italy that our models also suggest are about corruption. This would exclude from our Italy + corruption index irrelevant pages like "Sfogliatella", a delicious Italian pastry unrelated to corruption.
After determining the constituent Wikipedia pages of a given country-topic trend score, we use Wikipedia’s open-source API to compile the pages’ complete revision histories. The most basic aggregation of this unstructured text into a high-frequency score is the total number of characters edited across constituent pages each day. Though this raw tally could be normalized in a number of ways, it offers plenty of interesting insights on its own.
An example Wikipedia trend score based on this methodology -- for protests in Venezuela -- nicely demonstrates the power of these scores to capture shifts in the importance of geopolitical themes. The timing of the trend score’s peaks and the Wikipedia activity that generated them are revealing. First, the marked jump beginning in late 2005 coincides with the increased political turmoil surrounding Chavez’s re-election in December of the following year. In fact, the WIkipedia edits responsible for the heightened activity during this period overwhelmingly come from the pages "Hugo Chavez" and "2002 Venezuelan coup attempt", which restored Chavez to power. After Maduro takes office in April, 2013 the trend score picks up again, reflecting the increased incidence of protests and opposition activity. The spike on February 25, 2014, which came one day after a nationwide protest, and the spike on June 10, which followed two protests earlier that month, were both driven by large edits to the page "Venezuelan protests (2014–present)." Similar spikes in the trend score due to that page occured on April 1, 5 and 9, 2017, when protests occurred throughout the month.
Other peaks are less obviously related to demonstrations in the streets, but correspond to important moments in the opposition's political battle against Maduro. The spike on July 14, 2014 coincided with the sale of leading opposition newspaper El Universal, sparking new fears about press freedom in Venezuela. The February 9, 2016 spike came at a low for living standards in Venezuela as the country approached sovereign default, and the October 30, 2017 peak coincided with opposition parties’ announced boycott of December 2017 municipal elections.
Besides being informative barometers of geopolitical salience, these trend scores might be useful in other contexts. One possibility is incorporating them into forecasting models to see, for example, if our trend score covering organized crime in Mexico helps predict monthly crime rates. Another potential application is event discovery, where analysts can search for more obscure events based on the dates of spikes and the edits responsible for them.
Over the next few months, Interested readers will be able to explore the full set of these geopolitical trend scores through Koto Korpus, our flagship geopolitical analysis platform. To find out more information, including the set of geopolitical themes covered by our Topic Tags, feel free to contact us.