InsightsIntroducing CountryData.io: Our new platform for better, more efficient data sourcing

Introducing CountryData.io: Our new platform for better, more efficient data sourcing

Bernhard Obenhuber
Aug 21, 2023

Photo by Burak The Weekender from Pexel

Data and analysis are critical to how we help our clients make better country risk decisions. But so are the attractive price points of our products and services, which we can only deliver through efficient end-to-end processes. And, all too often, there’s significant tension between the need for quality data and that of efficiency. Indeed, we find that data cleaning and pre-processing often amounts to 80% of time we spend on research projects, with just 20% left for the actual analysis that yields insights. Our readers might be familiar with this ratio. 

A few months ago, we decided to fix this problem, both for ourselves and our clients. Our goal was to create a tool that enables us and our community to spend most of our time being curious and creative, not whipping our data into shape. 

The solution we came up with is CountryData.io, a new product which is now available as a public beta. In this blog post, we’ll explain why we devoted so many resources to its development and how you can use CountryData.io. 

Why, why, why?

On the CountryRisk.io Insights Platform, we offer datasets for the sovereign rating model consisting of around 300 time series per country. We also give our members access to our country risk scores for a range of objectives; from tracking sovereign default or money laundering country risk, to the identification of supply chain issues at a country level. 

To do all this, we need to process data from many different sources. For sovereign risk, we usually process time-series economic data, such as fiscal balance and debt ratios. We can easily access most of these indicators via an API. But even here, some datasets are only available as bulk downloads, such as IMF reserves adequacy data and—one of the worst examples—Organisation for Economic Co-Operation and Development (OECD) country risk rating data. If you want to use these datasets for country risk analysis, you first need to parse a PDF table that makes up for a dearth of standardised ISO country codes with a plethora of footnotes—and you must do so manually. 

The potential for frustration is even higher for compliance officers and supply chain analysts, as much of the data they need are only available in poorly structured spreadsheets or buried in a PDF or webpage. This would be a minor issue if the data in question were esoteric. However, we’re also talking about a lot of information that every financial institution and payment service provider is obliged to consider, including Financial Action Task Force (FATF) assessments and the EU’s list of high-risk countries. 

Then there are other sources—such as NGOs and think tanks—whose data are highly valuable, interesting, unique, or all of the above. These organisations put a lot of effort into collecting and analysing their data, but they often stop short of making them both publicly available and easy to work with. Which is a problem. Because such data must be integrated into automated workflows—like project approval and risk monitoring in the case of country risk data, or transaction screening and onboarding processes for money laundering risk data—to maximise their potential impact. And if someone has to manually maintain the integrated data by regularly checking the relevant webpage for an update, downloading the necessary files, combining them, and then uploading the new data to another tool; the efficiency of all that automation and, in turn, the potential impact of the data, are significantly reduced.

Spurred on by these issues, we contacted several large, public organisations to find out whether any of them had plans to offer some kind of API. They all said ‘no’. So, we decided to build a solution ourselves. 

The frustrations we fixed

If you’re reading this, you’re likely all too familiar with the primary pain points associated with processing data acquired from somewhere other than one of the more expensive providers. Here are some of the issues that frustrate us most, along with how we built CountryData.io to fix them.

Country names

When it comes to identifying the country to which a particular indicator refers, many sources just use the name of the country. This isn’t very helpful, as there are usually many ways to refer to a specific country—and combining datasets with different naming conventions is no fun. Is it Swaziland, or Eswatini? Or worse: is it the US, the USA, the United States, or the United States of America? Using ISO (3166) codes would help a lot; but even then, some data sources would use different country codes: XKK, KOS, or RKS for Kosovo, anyone?

When we built CountryData.io, we implemented alternative country names and codes and linked them to our reference table. We also designed the platform to notify us whenever it spots a country name or code in newly imported data that’s not yet in our reference table, so we can decide whether we need to add it to the platform and repeat the import. Crucially, this all happens in the background, so community members never have to worry about mismatched country identifiers when using CountryData.io.

Dates

Unclear date information can also cause headaches. In many cases, it isn’t clear which date format is being used; or whether it refers to data for that date only, or for a longer period beginning or ending on that date. 

Before we add any data to CountryData.io, we clarify the data format and make this information available for downstream processing. We can even have multiple dates per observation in additional fields that we call dimensions. Dimensions are a powerful tool for tracking a range of meta-information, such as whether you’re looking at data for historic observations or forecasts, a specific date or range of dates, or even whether a default is in a local or foreign currency.

Manually updating messy data

If you’ve ever grown frustrated at the amount of time you must regularly spend manually checking several websites for new data, only to have to then spend even more time cleaning it up, we feel your pain. Unstructured data, in particular, is arguably the most significant pain point in our field, and is often why such information may be ignored.

We built CountryData.io to solve both of these problems for our community. The platform automatically notifies us whenever one of our sources uploads a new dataset—highlighting any new data, indicators, and changes to past data—which we make available to our community on CountryData.io.

While we’re big fans of the latest developments in artificial intelligence and PDF-to-CSV tools, we’ve yet to find a solution that would enable us to automate the data cleaning stage. Until then, we’ll keep doing the tedious job of manually cleaning the data we publish on CountryData.io— all of which must first go through our four-eyes checking process to eliminate any human errors—so you don’t have to.

Data types

Before we started building our own data platform, we looked at some off-the-shelf solutions to make sure we weren’t about to reinvent the wheel. Unfortunately, these existing solutions were too expensive or only designed to handle numerical data. 

Creating our own platform turned out to be far more cost-effective, enabling us to maintain our competitive pricing. And it allowed us to build in all of the customisation we need. CountryData.io can handle a wide range of data types, from Boolean (e.g. FATF blacklist: true or false) and categorical data (e.g. Trafficking in Persons Rating: Tier 1, Tier 2, Tier 2 WL, or Tier 3 ) to numbers, text, and metadata. 

Try it out

We invite you take our new platform for a spin at countrydata.io. Once you’ve selected the data sources you’re interested in, we’ll send you an API key for testing purposes (we’ve published the API documents here). The data are returned in the JSON format, which you can then convert into a Pandas DataFrame using a single line of code.

We launched CountryData.io as a public beta to give our community the chance to help us improve the platform before we launch it officially. So, should you run into any problems, fail to find the data sources you want, or believe we should add more API endpoints, you can let us know by emailing [email protected].

Written by:
Bernhard Obenhuber