BW #77: Paris Olympics (solution)
[Reminder: If you haven't done so already, please fill out my course survey, at https://www.surveymonkey.com/r/2024-learn-survey ! I'll be announcing new Python and Pandas courses next week, and this is your chance to influence the topics that I'll teach. ]
The 2024 Olympic Games are on! This massive sporting event gives us a chance to think about friendly competition among countries, rather than the economic, social, and political problems that we face in our day-to-day lives.
Of course, here at Bamboo Weekly, we're less interested in the sports than the data about the sports. And with something as big as the Olympics, you can be sure that huge amounts of data are being generated – per team, athlete, sport, and event.
This week, we thus looked this week at data coming from the Olympics. Unless you're doing these exercises after the games close in mid-August 2024, you should expect that even if your queries are exactly the same as mine, you might well get different data, as the numbers are updated.
Data and six questions
This week's data came from two different sources:
- The main data comes from from Codante (https://codante.io/), a Brazilian company that offers free API to Olympic information. This is free, so it's limited to 100 requests/minute (which should be more than enough). If they feel that you're making too many requests, then they might block you IP address, so please be nice to them. The API is documented at https://docs.apis.codante.io/olympic-games-english .
- Geographic data about the Olympic venues come from Clockwork Micro (https://www.clockworkmicro.com/), which made shapefiles available in a GitHub repository at https://github.com/clockworkmicro/parisolympics2024 .
We'll also use the pycountry
package on PyPI (https://pypi.org/project/pycountry/).
Here are this week's six tasks and questions. As always, the Jupyter notebook that I used to solve the problems are at the bottom of this post.
Using the API from apis.codante.io
, download all of the per-country medal information. As of this writing, the country API has a total of five pages to download; you'll want to combine them into a single data frame. Set the index to be the 3-letter country ID.
Before doing anything else, I loaded Pandas and a few other modules and names:
import pandas as pd
from pandas import Series, DataFrame
import requests
import pycountry
Why did I import each of these?
- Obviously, I imported
pandas
so that I could analyze the data. - I then imported the specific names
Series
andDataFrame
, which are convenient if and when I want to create a series. - I'll be using
requests
to perform API calls. - Finally, I'll be using
pycountry
(installed from PyPI) for country-specific information.
With that in place, I can start to retrieve Olympics data via the API. The easiest way to do this is with requests
, since I can just say requests.get(URL)
, for a given URL. In the case of this API, though, we'll need to retrieve five pages of data, with each page (according to the API documentation) specified by passing a page
name-value pair along with an integer.
We can do that with requests
by passing not only the URL, but also {'pages':1}
, a dict containing the key-value pairs we want to add to our request. The integer passed along with 'pages'
will have to change, with values 1-5, as we retrieve each page of results.
The results themselves will come as JSON. Fortunately, we can easily turn most JSON data into a data frame by simply passing it to DataFrame
. We'll thus end up with one data frame for each page. If we create a list of data frames, we can then combine them into a single one with pd.concat
.
Let's start by setting up a base URL and an empty list, all_data
, where we'll collect the data frames:
url_base = 'https://apis.codante.io/olympic-games'
all_data = []
Next, we'll use requests
to retrieve each of the five pages:
for page_number in range(1, 6):
print(f'Getting page {page_number}')
r = requests.get(f'{url_base}/countries', {'page':page_number})
all_data.append(DataFrame(r.json()['data']))
Notice that when we get a response back from requests
, we can invoke json
on it to get Python data structures (lists and dicts). I originally tried to invoke DataFrame
directly on the result of invoking r.json()
, but saw that the actual data was under the 'data'
dict key. So I ran DataFrame(r.json()['data'])
, giving me a data frame; I then appended it to all_data
.
Notice that I added a call to print
, indicating what page was being retrieved in each iteration of the for
loop. I often do that when things will take a while, so that I can know where the code stands – and where it had a problem, if something goes wrong.
Note that I used range
(1,6)
to iterate over the numbers I wanted, starting with 1 and ending with 5 – because range
, like most Python methods, always counts "up to and not including" the final value.
With all_data
in place, I ran pd.concat
on it, and then ran set_index
to use id
as the index on the new data frame:
df = (pd
.concat(all_data)
.set_index('id')
)
The resulting data frame has 203 rows and 9 columns.
What countries don't seem to have any continent? What's the deal with them?
While exploring this data set, I decided to see how many countries are on each continent:
df['continent'].value_counts()
I got the following result:
continent
AFR 53
EUR 47
ASI 44
AME 41
OCE 15
2
- 1
Name: count, dtype: int64
I could identify the first five continents, but didn't understand what the blank continent was, or the one marked with -
. I thus did some digging:
(
df
.loc[
lambda df_: df_['continent'].isin(['', '-']),
['continent', 'name']
]
)
In the above code, I use loc
to retrieve a subset of rows and columns
For the row selector, I used lambda
, creating an anonymous function that takes a single argument, a data frame. We then, inside of the function, run isin(['', '-'])
on the data frame, getting a boolean series back. The series is True
when the continent is either an empty string or just -
. Specifying a boolean series in this way is often more natural and flexible than other methods.
For the column selector, I pass a list of strings, the names of the columns we want to see.
The result:
continent name
id
EOR EOR
AIN AIN
SAM - Samoa
We thus see that three of the teams competing in the Olympics have no continent. Which are they?
- Samoa, a country that I think would classify itself as being in Oceania, much like Australia and New Zealand. I'm guessing (hoping) that the data I got was just a glitch.
- EOR, a team of Olympic athletes who are refugees. This team has participated in Olympic games since 2016, and was previously known as ROT, for "Refugee Olympic Team." France insisted that the team be known by its French initials this time around, and the acronym for "Équipe olympique des réfugiés" is EOR.
- AIN, a team of Russian and Belorusian athletes, since their national teams were banned in the wake of Russia's invasion of Ukraine. In English, they were known as INA ("Individual Neutral Athletes"), but this year they are known by AIN, the French initials of "Athlètes Individuels Neutres."
I had no idea about these latter two groups, and am glad that I got a chance to learn about them.