BW #92: Climate disaster costs (solution)
Administrative notes:
- Like many others, I've been getting active on BlueSky. Follow me there, at https://bsky.app/@lernerpython.com!
- Pandas office hours, where you can ask me any questions you want about Pandas and data, will take place on Sunday. It's open to all paid members, which includes members of my Python+Data subscription service. I'll send a Zoom link in the coming day or so.
And now, back to calculating the cost of climate change...
This week, as the annual UN conference on climate change (COP 29, https://unfccc.int/cop29/about-cop29) opened in Baku, Azerbaijan, we learned that climate disasters have had massive economic implications.
Specifically, the International Chamber of Commerce release a report (downloadable from https://iccwbo.org/news-publications/policies-reports/new-report-extreme-weather-events-cost-economy-2-trillion-over-the-last-decade/) showing that climate-related disasters have cost more than $2 trillion over the last decade. The report was partly based on information at EM-DAT (https://www.emdat.be/), "The international disaster database").
To get their data, you'll need to register (for free) for the site, and then get it from https://public.emdat.be/data, choosing both classifications (ecological and technological) and all regions of the world, and from all years 2000 - 2024.
The data dictionary for this information is at https://doc.emdat.be/docs/data-structure-and-content/emdat-public-table/ .
Data and seven questions
This week, I have seven tasks and questions for you, based on the EM-DATA data, which I downloaded into Excel format.
As always, a link to the Jupyter notebook I used to solve these problems is at the bottom of this post.
Download the data. Set the index to use region, subregion, and country.
First and foremost, let's load up Pandas. We'll also be using NumPy down below, for doing linear regression, so we can load that, too:
import numpy as np
import pandas as pd
Next, we'll load the Excel file from DM-DATA into a data frame. The original filename was incredibly long, so I changed it to be emdat.xlsx
. I then used read_excel
to read the data into a data frame:
filename = 'emdat.xlsx'
df = pd.read_excel(filename)
This actually worked, but the Excel document has more than one sheet, one with the data and another with a bit of metadata. I decided to specify, using the sheet_name
keyword argument, that we wanted to retrieve the first one. Rather than name the sheet with a string, I used 0, the numeric (zero-based) index of the sheet that was of interest:
df = pd.read_excel(filename,
sheet_name=0)
Finally, I asked you to create a three-part multi-index with the Region, Subregion, and Country columns. As with read_csv
, we can pass the index_col
keyword argument, indicating which column or list of columns should be treated as the index. But unlike read_csv
, read_excel
requires that we pass the numeric indexes of the columns we want to use for our multi-index. I thus passed:
df = pd.read_excel(filename,
sheet_name=0,
index_col=[12, 11, 10])
I'll note that there wasn't much of a practical advantage or use to having this multi-index, at least for the questions I asked this week. But there are a lot of questions we could potentially ask that would benefit from such a multi-index.
How much in total economic damage, in adjusted dollars, did we see for climate disasters (see "Disaster Subgroup") in the last decade (see "Start Year")? For what proportion of disasters do we even have damage amounts? Does this mean the report is likely understating or exaggerating the amounts?
The statistic that originally caught my eye for this data set was, as mentioned above, the idea that climate disasters have cost us more than $2 trillion in the last decade. In this question, we try to replicate that number using the data.
To answer it, we first need to keep only those rows where Disaster Subgroup
has as value of 'Climatological'
. We can do that with a combination of loc
and lambda
:
(
df
.loc[lambda df_: df_['Disaster Subgroup'] == 'Climatological']
)
We can then use loc
and lambda
again to keep only years from 2014 and onwards. Note that in both of these lambda
expressions, we have a parameter called df_
. This helps us to remember that we aren't working with the global df
variable, but rather with the data frame that we got from the previous line, thanks to method chaining. (Using df
would actually give us an error, because of the mismatched index.)
The expanded query is thus:
(
df
.loc[lambda df_: df_['Disaster Subgroup'] == 'Climatological']
.loc[lambda df_: df_['Start Year'] >= 2014]
)
We want to find out the total amount of damage done in this period, in adjusted dollars. We can thus select just that one column, using []
, and then run sum
on it:
(
df
.loc[lambda df_: df_['Disaster Subgroup'] == 'Climatological']
.loc[lambda df_: df_['Start Year'] >= 2014]
['Total Damage, Adjusted (\'000 US$)']
.sum()
)
I got a result of 248,717,368. We need to multiply it by 1,000 to get the actual amount, which comes out to about $249 billion. That's certainly a lot of money – but it's not $2 trillion. Where did the number in the report come from?
It seems that the $2 trillion figure is accurate if we look at all natural disasters, i.e., those with a Disaster Group
of 'Natural'
:
(
df
.loc[lambda df_: df_['Disaster Group'] == 'Natural']
.loc[lambda df_: df_['Start Year'] >= 2014]
['Total Damage, Adjusted (\'000 US$)']
.sum()
)
The result from this is 2,094,600,017 – that is, just over 2 billion. After multiplying it by 1,000, we get the $2 trillion figure that the report discussed.
The data dictionary (https://doc.emdat.be/docs/data-structure-and-content/disaster-classification-system/) breaks the different subgroups down, and while I thought it was reasonable to look just as the "Climatological" data, it would seem that they were looking more broadly at all natural data – which, given the breakdown in the data dictionary, which includes climate-related issues including floods, storms, and epidemics – strikes me as pretty reasonable.
However, do we have dollar amounts reported for all disasters? Or just for some? A lack of data is NaN
, and we can check for that with the isna
method. This returns True
or False
, which we can then feed to value_counts
to find out how many appear. Better yet, we can pass the normalize=True
keyword argument, which gives us a percentage (rather than a raw number):
(
df
.loc[lambda df_: df_['Disaster Group'] == 'Natural']
.loc[lambda df_: df_['Start Year'] >= 2014]
['Total Damage, Adjusted (\'000 US$)']
.isna()
.value_counts(normalize=True)
)
The result:
Total Damage, Adjusted ('000 US$)
True 0.697168
False 0.302832
Name: proportion, dtype: float64
In other words, we only have dollar amounts for about 30 percent of the disasters tracked in this database. Which means that the dollar amount is likely much higher than $2 trillion, but we just don't have the numbers for it.
How much higher? If we take the median (not the mean) disaster cost from the values we do have, we get $141,261,000. Replace the NaN
values in the cost column with that:
df['Total Damage, Adjusted (\'000 US$)'].fillna(141261.0).sum()
And we find that the total cost is $6,418,361,060,000, or $6.5 trillion. We know that the number in this data set is understating things, but it might be overstating things to multiply it by 3. Even so... we know that natural disasters are costing a great deal of money – as well as injuries and lives!