BW #77: Paris Olympics (solution)

BW #77: Paris Olympics (solution)

[Reminder: If you haven't done so already, please fill out my course survey, at https://www.surveymonkey.com/r/2024-learn-survey ! I'll be announcing new Python and Pandas courses next week, and this is your chance to influence the topics that I'll teach. ]

The 2024 Olympic Games are on! This massive sporting event gives us a chance to think about friendly competition among countries, rather than the economic, social, and political problems that we face in our day-to-day lives.

Of course, here at Bamboo Weekly, we're less interested in the sports than the data about the sports. And with something as big as the Olympics, you can be sure that huge amounts of data are being generated – per team, athlete, sport, and event.

This week, we thus looked this week at data coming from the Olympics. Unless you're doing these exercises after the games close in mid-August 2024, you should expect that even if your queries are exactly the same as mine, you might well get different data, as the numbers are updated.

Data and six questions

This week's data came from two different sources:

We'll also use the pycountry package on PyPI (https://pypi.org/project/pycountry/).

Here are this week's six tasks and questions. As always, the Jupyter notebook that I used to solve the problems are at the bottom of this post.

Using the API from apis.codante.io, download all of the per-country medal information. As of this writing, the country API has a total of five pages to download; you'll want to combine them into a single data frame. Set the index to be the 3-letter country ID.

Before doing anything else, I loaded Pandas and a few other modules and names:

import pandas as pd
from pandas import Series, DataFrame
import requests 
import pycountry

Why did I import each of these?

  • Obviously, I imported pandas so that I could analyze the data.
  • I then imported the specific names Series and DataFrame, which are convenient if and when I want to create a series.
  • I'll be using requests to perform API calls.
  • Finally, I'll be using pycountry (installed from PyPI) for country-specific information.

With that in place, I can start to retrieve Olympics data via the API. The easiest way to do this is with requests, since I can just say requests.get(URL), for a given URL. In the case of this API, though, we'll need to retrieve five pages of data, with each page (according to the API documentation) specified by passing a page name-value pair along with an integer.

We can do that with requests by passing not only the URL, but also {'pages':1}, a dict containing the key-value pairs we want to add to our request. The integer passed along with 'pages' will have to change, with values 1-5, as we retrieve each page of results.

The results themselves will come as JSON. Fortunately, we can easily turn most JSON data into a data frame by simply passing it to DataFrame. We'll thus end up with one data frame for each page. If we create a list of data frames, we can then combine them into a single one with pd.concat.

Let's start by setting up a base URL and an empty list, all_data, where we'll collect the data frames:


url_base = 'https://apis.codante.io/olympic-games'
all_data = []

Next, we'll use requests to retrieve each of the five pages:

for page_number in range(1, 6):
    print(f'Getting page {page_number}')
    r = requests.get(f'{url_base}/countries', {'page':page_number})
    all_data.append(DataFrame(r.json()['data']))

Notice that when we get a response back from requests, we can invoke json on it to get Python data structures (lists and dicts). I originally tried to invoke DataFrame directly on the result of invoking r.json(), but saw that the actual data was under the 'data' dict key. So I ran DataFrame(r.json()['data']), giving me a data frame; I then appended it to all_data.

Notice that I added a call to print, indicating what page was being retrieved in each iteration of the for loop. I often do that when things will take a while, so that I can know where the code stands – and where it had a problem, if something goes wrong.

Note that I used range(1,6) to iterate over the numbers I wanted, starting with 1 and ending with 5 – because range, like most Python methods, always counts "up to and not including" the final value.

With all_data in place, I ran pd.concat on it, and then ran set_index to use id as the index on the new data frame:

df = (pd
      .concat(all_data)
      .set_index('id')
     )
     

The resulting data frame has 203 rows and 9 columns.

What countries don't seem to have any continent? What's the deal with them?

While exploring this data set, I decided to see how many countries are on each continent:

df['continent'].value_counts()

I got the following result:

continent
AFR    53
EUR    47
ASI    44
AME    41
OCE    15
        2
-       1
Name: count, dtype: int64

I could identify the first five continents, but didn't understand what the blank continent was, or the one marked with -. I thus did some digging:

(
    df
    .loc[
        lambda df_: df_['continent'].isin(['', '-']), 
        ['continent', 'name']
      ]
)

In the above code, I use loc to retrieve a subset of rows and columns

For the row selector, I used lambda, creating an anonymous function that takes a single argument, a data frame. We then, inside of the function, run isin(['', '-']) on the data frame, getting a boolean series back. The series is True when the continent is either an empty string or just -. Specifying a boolean series in this way is often more natural and flexible than other methods.

For the column selector, I pass a list of strings, the names of the columns we want to see.

The result:

    continent   name
id                  
EOR              EOR
AIN              AIN
SAM         -  Samoa

We thus see that three of the teams competing in the Olympics have no continent. Which are they?

  • Samoa, a country that I think would classify itself as being in Oceania, much like Australia and New Zealand. I'm guessing (hoping) that the data I got was just a glitch.
  • EOR, a team of Olympic athletes who are refugees. This team has participated in Olympic games since 2016, and was previously known as ROT, for "Refugee Olympic Team." France insisted that the team be known by its French initials this time around, and the acronym for "Équipe olympique des réfugiés" is EOR.
  • AIN, a team of Russian and Belorusian athletes, since their national teams were banned in the wake of Russia's invasion of Ukraine. In English, they were known as INA ("Individual Neutral Athletes"), but this year they are known by AIN, the French initials of "Athlètes Individuels Neutres."

I had no idea about these latter two groups, and am glad that I got a chance to learn about them.