Recent Posts

Wednesday, May 2, 2018

Loading OECD Data With Python

One of the projects I have been juggling is building a package to import data from the OECD. The OECD helpfully provides an API which allows for custom queries into their large data sets: https://data.oecd.org/api/ (for free!). Unless you really like XML or JSON (which I do not), you want to find a wrapper for the query and download protocols. As a Python developer, the solution that worked best for me was the pandaSDMX Python library:  https://pandasdmx.readthedocs.io/en/latest/

The package is highly object oriented, which means that I probably should have read the documentation. However, I was able to grab the data I wanted with a bit of experimentation.

This was my initial stab of grabbing the LEI data and then dumping it into a CSV (based on code from a question on Stack Exchange). (The 'df' variable is a Pandas dataframe.)

oecd = pandasdmx.Request('OECD')
data_response = oecd.data(resource_id='MEI_CLI', key='all?startTime=2018')
df = data_response.write(data_response.data.series, parse_time=False)
df.to_csv('c:\\Temp\\test_lei.txt', sep='\t')


The OECD data provides a couple of challenges to work with.

The first issue is that the OECD does not provide some of the meta data navigation tools provided by the other public databases that pandaSDXM connects to. The data navigation calls done in the pandaSDMX documentation will not work on the OECD data (which is another reason I largely skipped over the documentation).

The next challenge with the OECD data is finding out where the data you want live. The leading indicators are found with the "Main Economic Indicators" (MEI) database, but that database is big, and the communication protocol used (SDMX-ML) is bloated. In order to keep the data transmission reasonable, you need to use a smaller subset of the database, which has its own database identifier (such as 'MEI_CLI'). Of course, trying to find that database identifier takes a bit of work on the web page: you need to find the database in question, and then extract the database code.

For example, the Composite Leading Indicators page is: https://www.oecd-ilibrary.org/economics/data/main-economic-indicators/composite-leading-indicators_data-00042-en

If you then go to "Data", you get sent to: http://stats.oecd.org/viewhtml.aspx?datasetcode=MEI_CLI&lang=en, where you can read off the 'MEI_CLI' code (or get the query for data by choosing the appropriate API export option). This is somewhat painful to do if you want data scattered across a variety of data sets.

I have not tested it yet, but the OECD API documentation shows an extremely useful option: you can query for data that has been changed since a particular date. For financial data that is not revised, one can just query for data after the last data points you have locally archived. However, this is not sufficient for economic data, as back history can be revised. According to the documentation, those revised data points will also be returned. This greatly reduces the burden for both sides for an automated update of data sets.

The major challenge of such work is not the raw query, rather marshaling the data so that it can be imported into a database. You need to map the series meta data to a whatever the retrieval mechanism used in your database (typically some internal ticker system). The reason why you want to use a single package for interfacing with all the data sources is that it easier to set up the mapping code, which will (hopefully!) work for all the supported data providers.

One reason I put this up was that finding this package took some time. Before I found pandaSDMX, I looked at a few other options, and all of them had defects. Since I did not want to waste my time seeing whether there were fixes for the problems I ran into, it may be that my decision to reject them was unfair. As a result, I will not name any names with regards to the other options.

(c) Brian Romanchuk 2018

No comments:

Post a Comment

Note: Posts may be moderated, and there may be a considerable delay before they appear.

Although I welcome people who disagree with me, please be civil.

Please note that my spam comment filter appears to dislike long "anonymous" posts. I get no warning about this, and only go through my "spambox" infrequently. The best bet it to keep comments short, and if you think the spam filter struck, let me know with a short comment.