BW #91: Roller coasters (solution)
[I'll soon be starting the sixth cohort of my four-month, online Python Data Analytics Bootcamp. This is the best (and most intensive) course that I offer, giving you a chance to level up your Python and Pandas knowledge with frequent meetings, interactions, and collaboration. Learn more at https://PythonDAB.com, or join me at a free webinar I'm giving one hour from now: https://store.lerner.co.il/pythondab-6-info-webinar-1 . ]
This week, we took a (hopefully) somewhat amusing detour from the serious news that has been surrounding us over the last few weeks. We looked at roller coasters — where they are located, and what they're made out of. I hope that any screaming you did while trying to solve these problems was because you identified with the amusement-park rides, rather than out of frustration with trying to solve the problems.
Data and six questions
This week's data comes from RCDB, the Roller Coaster Database, at https://rcdb.com . RCDB has a very extensive database of roller-coaster information. However, it's not downloadable directly; instead, you need to scrape the page, and ask Pandas to turn it into a data frame. Along the same lines, we used (in one question) the Wikipedia page for the G20 countries, at https://en.wikipedia.org/wiki/G20.
Below are my solutions and explanations for these six questions. As always, a link to the Jupyter notebook I used to solve them at the bottom of this post.
The "Locations" page at RCDB, https://rcdb.com/location.htm, lists every country and territory, and the number of roller coasters that it has. Create a Pandas data frame from the table on this page. Make the "Location" column the index, and make the "Roller Coasters" column an integer value, treating missing values as 0.
The first thing we'll have to do is load Pandas:
import pandas as pd
With that in place, we can now import the location list from RCDB into a Pandas data frame, using read_html
. This top-level (pd
) method goes to a URL and returns each of the HTML tables it finds there as a Pandas data frame. It's sometimes surprising to find just how many tables there are on a page; HTML tables are sometimes used for layout purposes, not just to organize data into rows and tables.
I set the URL, and then used read_html
to retrieve a list of data frames. After a quick examination, I found that the second data frame was the one that I wanted. I thus retrieved index [1]
from the list that we got back:
url = 'https://rcdb.com/location.htm'
df = (pd
.read_html(url)[1]
)
The result is indeed a data frame with a Location
column – the column that I asked you to make into its index. We can do that by invoking set_index
on the data frame:
url = 'https://rcdb.com/location.htm'
df = (pd
.read_html(url)[1]
.set_index('Location')
)
I then asked you to make some changes on the Roller Coasters
column, which indicates how many roller coasters are in a given country. By default, the column has a dtype
of object
, which means that Pandas couldn't interpret the values as either integers or floats, and thus treated them as strings. This is because RCDB's missing values are indicated with '-'
.
In order to set the dtype
to be int
, we'll need to do three things:
- Replace the
'-'
values withNaN
- Replace
NaN
values with 0 - Set the
dtype
to beint
I did this first by passing the na_values
keyword argument to read_html
, indicating that '-'
should be treated as NaN
. I then used assign
to create a new column, roller_coasters
. That column is the result of invoking fillna(0)
(turning NaN
values into 0) and then astype(int)
, getting back an integer series.
Notice that because of Python's syntax, and the fact that assign
uses keyword arguments to name the column to which we want to assign, I had to rename to the column to roller_coasters
:
url = 'https://rcdb.com/location.htm'
df = (pd
.read_html(url, na_values='-')[1]
.set_index('Location')
.assign(roller_coasters=lambda df_: df_['Roller Coasters'].fillna(0).astype(int))
)
We now have two columns indicating how many roller coasters are in each country, one containing integers (roller_coasters
) and one containing strings (Roller Coasters
). I decided to remove the latter column with drop
.
In the past, I've named the columns I wanted to remove, and then passed the keyword argument axis='columns'
. But I've recently learned that there's a cleaner and nicer way to do this, just passing the keyword argument columns
, and giving it the list of columns we want to remove:
url = 'https://rcdb.com/location.htm'
df = (pd
.read_html(url, na_values='-')[1]
.set_index('Location')
.assign(roller_coasters=lambda df_: df_['Roller Coasters'].fillna(0).astype(int))
.drop(columns=['Roller Coasters'])
)
The result is a 177-row, 6-column data frame.
Three country names came out garbled, with non-alphabetical characters. Which ones? How can we resolve this?
read_html
works quite well. But it's not infallible, and I noticed when reviewing the results that three country names contained weird characters. How can we identify these countries?
I decided to use regular expressions, a powerful tool for describing patterns of text. (If you don't know regular expressions, you can learn them via my free Regular Expressions Crash Course, at https://regexpcrashcourse.com/.)
My logic was as follows:
- Go through each element in the index (
df.index
), and check if it contains only characters that we would want/expect to be in country names. The regular expression will be^[- \w\']+$
, which means:- Must match from the start (
^
) - We then have a character class, meaning that we want a character from the set that we define. The set here includes
-
(minus),' '
(space), and\w
(any letter, digit, or_
, including characters from non-Latin character sets) - We want one or more elements from the character class (
+
) - We want to anchor the regexp to the end of the string (
$
)
- Must match from the start (
- If the country name does not match that expectation, then we want to keep it
- I put this in a list comprehension, iterating over every element in
df.index
- The result was a Python list, which I then put through
sorted
to get the country names in alphabetical order.
In the end, the code looks like this:
import re
sorted([one_name
for one_name in df.index
if not re.search(r'^[- \w\']+$', one_name) ])
This worked just fine, and gave me the following results:
['Curaçao', "Côte d'Ivoire", 'São Tomé and PrÃ\xadncipe']
As we can see, these names contain some weird characters. But why these characters? What makes them so weird?
The answer, in a word, is "encoding." Computers use bytes to store information, and if you're using English, then one byte is the same as one character. But in order to handle other languages and character sets, we often need to use more than one byte per character. How we translate several bytes into a character is a question that has multiple answers, each known as an "encoding."
The most common encoding used in modern Python is UTF-8, which gives us access to the full Unicode character set – allowing us to use nearly any character from most languages, as well as emojis and similar signals. UTF-8 uses a single byte for characters that existed in ASCII, but two, three, or more bytes for other characters. That's why we're seeing two funny characters in these countries' names where there should be a single one. Those funny characters aren't legitimate UTF-8 characters, so they show up, not matching the \w
builtin character class in our regular expression.
The simplest solution to this problem is to return to our call to read_html
, telling it that we want to interpret the bytes we downloaded from the Web site in UTF-8. We can do that by passing encoding='utf-8'
as a keyword argument:
df = (pd
.read_html(url, na_values='-', encoding='utf-8')[1]
.set_index('Location')
.assign(roller_coasters=
lambda df_: df_['Roller Coasters'].fillna(0).astype(int))
.drop(columns=['Roller Coasters'])
)
Sure enough, this does the trick! If I re-run my list comprehension, I get an empty list – meaning that all of the characters in all of the country names are now legitimate.