Skip to content
13 min read · Tags: api multiple-files grouping apply-function geopandas

BW #77: Paris Olympics (solution)

Get better at: Working with APIs, multiple files, grouping, applying functions to a data frame, and GeoPandas

BW #77: Paris Olympics (solution)

[Reminder: If you haven't done so already, please fill out my course survey, at https://www.surveymonkey.com/r/2024-learn-survey ! I'll be announcing new Python and Pandas courses next week, and this is your chance to influence the topics that I'll teach. ]

The 2024 Olympic Games are on! This massive sporting event gives us a chance to think about friendly competition among countries, rather than the economic, social, and political problems that we face in our day-to-day lives.

Of course, here at Bamboo Weekly, we're less interested in the sports than the data about the sports. And with something as big as the Olympics, you can be sure that huge amounts of data are being generated – per team, athlete, sport, and event.

This week, we thus looked this week at data coming from the Olympics. Unless you're doing these exercises after the games close in mid-August 2024, you should expect that even if your queries are exactly the same as mine, you might well get different data, as the numbers are updated.

Data and six questions

This week's data came from two different sources:

We'll also use the pycountry package on PyPI (https://pypi.org/project/pycountry/).

Here are this week's six tasks and questions. As always, the Jupyter notebook that I used to solve the problems are at the bottom of this post.

Using the API from apis.codante.io, download all of the per-country medal information. As of this writing, the country API has a total of five pages to download; you'll want to combine them into a single data frame. Set the index to be the 3-letter country ID.

Before doing anything else, I loaded Pandas and a few other modules and names:

import pandas as pd
from pandas import Series, DataFrame
import requests 
import pycountry

Why did I import each of these?

With that in place, I can start to retrieve Olympics data via the API. The easiest way to do this is with requests, since I can just say requests.get(URL), for a given URL. In the case of this API, though, we'll need to retrieve five pages of data, with each page (according to the API documentation) specified by passing a page name-value pair along with an integer.

We can do that with requests by passing not only the URL, but also {'pages':1}, a dict containing the key-value pairs we want to add to our request. The integer passed along with 'pages' will have to change, with values 1-5, as we retrieve each page of results.

The results themselves will come as JSON. Fortunately, we can easily turn most JSON data into a data frame by simply passing it to DataFrame. We'll thus end up with one data frame for each page. If we create a list of data frames, we can then combine them into a single one with pd.concat.

Let's start by setting up a base URL and an empty list, all_data, where we'll collect the data frames:


url_base = 'https://apis.codante.io/olympic-games'
all_data = []

Next, we'll use requests to retrieve each of the five pages:

for page_number in range(1, 6):
    print(f'Getting page {page_number}')
    r = requests.get(f'{url_base}/countries', {'page':page_number})
    all_data.append(DataFrame(r.json()['data']))

Notice that when we get a response back from requests, we can invoke json on it to get Python data structures (lists and dicts). I originally tried to invoke DataFrame directly on the result of invoking r.json(), but saw that the actual data was under the 'data' dict key. So I ran DataFrame(r.json()['data']), giving me a data frame; I then appended it to all_data.

Notice that I added a call to print, indicating what page was being retrieved in each iteration of the for loop. I often do that when things will take a while, so that I can know where the code stands – and where it had a problem, if something goes wrong.

Note that I used range(1,6) to iterate over the numbers I wanted, starting with 1 and ending with 5 – because range, like most Python methods, always counts "up to and not including" the final value.

With all_data in place, I ran pd.concat on it, and then ran set_index to use id as the index on the new data frame:

df = (pd
      .concat(all_data)
      .set_index('id')
     )
     

The resulting data frame has 203 rows and 9 columns.

What countries don't seem to have any continent? What's the deal with them?

While exploring this data set, I decided to see how many countries are on each continent:

df['continent'].value_counts()

I got the following result:

continent
AFR    53
EUR    47
ASI    44
AME    41
OCE    15
        2
-       1
Name: count, dtype: int64

I could identify the first five continents, but didn't understand what the blank continent was, or the one marked with -. I thus did some digging:

(
    df
    .loc[
        lambda df_: df_['continent'].isin(['', '-']), 
        ['continent', 'name']
      ]
)

In the above code, I use loc to retrieve a subset of rows and columns

For the row selector, I used lambda, creating an anonymous function that takes a single argument, a data frame. We then, inside of the function, run isin(['', '-']) on the data frame, getting a boolean series back. The series is True when the continent is either an empty string or just -. Specifying a boolean series in this way is often more natural and flexible than other methods.

For the column selector, I pass a list of strings, the names of the columns we want to see.

The result:

    continent   name
id                  
EOR              EOR
AIN              AIN
SAM         -  Samoa

We thus see that three of the teams competing in the Olympics have no continent. Which are they?

I had no idea about these latter two groups, and am glad that I got a chance to learn about them.