BW #88: Hot summers (solution)
This week, we looked at a data set that discussed the temperature from June, July, and August of this year, showing how much hotter it was than usual. The data came from Climate Central (https://www.climatecentral.org), an organization aimed at explaining climate change to the general public, whose recently released report showed that this summer was particularly hot.
I heard the CEO of Climate Central, Ben Strauss, on last week's Slate Political Gabfest podcast (https://slate.com/podcasts/political-gabfest/2024/10/election-kamala-harris-donald-trump-vote-hurricane-milton-helene-climate-change-legal-win). He said, among other things, that the temperature of ocean water is rising to never-before-seen levels, leading in part to the extremely destructive hurricanes we've seen over the last few weeks.
The data all points to climate change as the reason for increased temperatures. But it would seem that meterologists who mention that as a reason for the huge storms are receiving threats (https://www.nytimes.com/2024/10/14/us/meteorologists-threats-conspiracy-theories-hurricanes.html?unlocked_article_code=1.SU4.uwM2.zFc50u9q7LVB&smid=url-share).
Data and seven questions
This week, we'll look at the data set from Climate Central. Their summary of the data is from this page:
https://www.climatecentral.org/report/people-exposed-to-climate-change-june-august-2024
The data itself can be downloaded via the "Download data" link at the bottom of the page (i.e., the final bullet point), which links to the following URL:
Here are my detailed solutions to the seven questions I asked this week. As usual, a link to my Jupyter notebook is at the bottom of the post.
Read the Excel file into a Pandas data frame. We want a three-level multi-index, with the continent, region, and country names.
We'll start by loading Pandas:
import pandas as pd
To read the Excel file into a data frame, we can use read_excel
:
filename = 'Climate_Central_global_temperature_attribution_data_June-August_2024.xlsx'
df = pd.read_excel(filename)
That works fine, but I asked you to create a data frame with a three-level multi-index. When using read_csv
, we can pass the index_col
keyword argument. This tells the method which column (a string or integer) or columns (a list of strings, or a list of integers) should be used for the index.
read_excel
also has an index_col
keyword argument, and I imagined that I could similarly give it a list of strings. But no, it turns out that in read_excel
, you pass index_col
an integer (for a one-column index) or a list of integers (for a multi-index). I ended up with the following:
filename = 'Climate_Central_global_temperature_attribution_data_June-August_2024.xlsx'
df = pd.read_excel(filename,
index_col=[23,24,0])
The result was a 218-row, 23-column data frame.
The column names are rather long. Shorten or remove repetitive (or unnecessary) text. For example, you can replace the text "Climate Shift Index level 2 (heat made 2x more likely by climate change)" with "CS2".
I was quite surprised (and even amazed) by the length and detail of the column names in this data set. They make the purpose of each column clear... but they're rather long and annoying to use in queries. For this task, I thus asked you to rename the columns to something shorter.
The short and easy way to do this would probably be to define a list of 23 strings, and then assign them to df.columns
. But why do things the fast and easy way, when we can have some fun? After all, there are a number of patterns that repeat themselves in the column names; perhaps we can exploit that repetition, and let the computer do some of the hard work for us?
It's especially true that there are column names that are identical to one another, save for one or two numbers. Such names would be annoying to identify as specific strings, but with regular expressions, it becomes much easier.
I decided that we could do the following:
- Define a dict in which the keys and values are both strings. The keys will be the regular expressions for which we're searching, and the values will be the string (occasionally including backreferences to the search string) we'll use for replacing. The
re.sub
method in there
module (for regular expressions) is just what we want. - However, that's only good for replacing one string. We'll want to replace each key-value pair from our replacements dict. For that, we'll use a
for
loop, using thedict.items
method to get each key-value pair in turn. - Finally, we can use
df.rename
to rename the columns of our data frame.df.rename
returns a new data frame with the new columns; we'll assign the result back todf
in each iteration.df.rename
accepts alambda
expression (i.e., a function), which we'll use to applyre.sub
to each column in the data frame.
In other words, we'll create a dict of replacement text, iterate over that dict, and use df.rename
along with re.sub
to perform some substitutions.
You undoubtedly had your own substitutions; here are the ones that I came up with:
import re
replacements = {
'temperatrures': 'temperatures',
'population exposed to at least': 'pop >=',
'number of days of temperatures': 'days',
r'reaching Climate Shift Index level (\d).*$': r'CS\1',
'of temperatures at or above the 90th percentile': '>= 90 pct',
'historical ranking.*': 'historical ranking',
'mean temperature anomaly.*': 'mean anomaly',
'days at or above the 90th percentile temperature.*$': 'days >= 90 pct',
r'number of \*additional\* days.*$': 'additional days >= 90 pct',
r'\s+': ' '
}
for key, value in replacements.items():
df = df.rename(columns=lambda colname_: re.sub(key, value, colname_))
In one case, I used capturing to identify a digit – (\d)
– and then refer to it in the replacement string, 'CS\1'
. In a few, I used .*
to indicate that I wanted to grab the text until the end of the line, which I made even clearer by anchoring it there with $
. Wherever I had backslashes in my regexp string, I used a raw Python string (i.e., a string with r
before the leading quote) to automatically double all backslashes, avoiding Python's confusion or warnings for backslashed characters.
I also fixed their misspelled version of "temperatures", and combined multiple whitespace characters into a single space. Note that because a dict retains its order, we can know that the whitespace compression regexp will be the final replacement executed, so it can clean up any words we separated with too much whitespace.
After renaming the columns, they are:
Index(['historical ranking', 'mean anomaly', 'days CS2', 'days CS3',
'days CS5', 'pop >= 60 days of temperatures CS5', 'days >= 90 pct',
'additional days >= 90 pct', 'pop >= 1 day >= 90 pct',
'pop >= 7 days >= 90 pct', 'pop >= 30 days >= 90 pct',
'pop >= 61 days >= 90 pct', 'pop >= 1 day of temperatures CS2',
'pop >= 1 day >= 90 pct AND CS2', 'pop >= 7 days >= 90 pct AND CS2',
'pop >= 30 days >= 90 pct AND CS2', 'pop >= 61 days >= 90 pct AND CS2',
'pop >= 1 day >= 90 pct AND CS3', 'pop >= 7 days >= 90 pct AND CS3',
'pop >= 30 days >= 90 pct AND CS3', 'pop >= 61 days >= 90 pct AND CS3',
'population', 'ISO3'],
dtype='object')
This might be a bit too terse, and you undoubtedly did it differently, but the technique is useful.