BW #64: Coal power (solution)

We keep hearing and talking about climate change as something that will happen, or that will affect us in the future — but things are already changing, all over the world. One way to reduce and slow climate change is by reducing the degree to which we generate energy using fossil fuels.

Just this week, representatives from the Group of Seven (“G7”) nations met in Turin, Italy, announced a plan to reduce the use of coal, a particularly popular (and thus problematic) fossil fuel. Their plan would largely eliminate the use of coal by the year 2035. (An Associated Press report is here: https://www.msn.com/en-us/money/companies/g7-nations-commit-to-phasing-out-coal-by-2035-but-give-japan-some-flexibility/ar-AA1nVWIO )

Upon reading this news, I started to wonder just how much coal is used, and what countries are using it. I soon discovered Global Energy Monitor (https://globalenergymonitor.org/), whose Global Coal Plant Tracker contains lots of interesting and useful data about precisely this subject. That's the data we looked at this week, exploring data about the history and future of coal-powered plants, we'll be looking at this week, trying to understand which countries are still running coal plants, what kind of coal and process they're using, and how much emissions they're creating.

Data and seven questions

The Global Energy Monitor's Global Coal Plant Tracker has its home page here:

https://globalenergymonitor.org/projects/global-coal-plant-tracker/

To download the data, you'll need to go to the "download data" link:

https://globalenergymonitor.org/projects/global-coal-plant-tracker/download-data/

If you prefer (and you probably will), I downloaded the data already via their e-mail link; it’s available here:

/content/files/wp-content/uploads/2024/02/global-coal-plant-tracker-january-2024.xlsx

It takes 6-8 hours to research and write each edition of Bamboo Weekly. Thanks to those of you who support my work with a paid subscription! Even if you can’t, it would mean a lot if you would share BW with your Python- and Pandas-using colleagues. Thanks!

Download the Excel spreadsheet, and load the "Units" sheet from that document into a data frame. We'll only want the following columns: "Country", "Capacity (MW)", "Status", "Start year", "Combustion technology", "Coal type", "Region", and "Annual CO2 (million tonnes / annum)".

Before doing anything else, I loaded up Pandas:

import pandas as pd

I then wanted to load data from the Excel file. The file, as is often the case with Excel, several different “sheets,” which you can think of as sub-documents in one spreadsheet or tabs in a browser window. We can specify which sheet we want to load by passing the “sheet_name” keyword argument, either passing the name of the sheet as a string or as an integer. Here, the “Units” sheet has a simple enough name, so I passed it as a string to “read_excel”:

filename = 'Global-Coal-Plant-Tracker-January-2024.xlsx'

df = pd.read_excel(filename, sheet_name='Units')

The above worked just fine. Moreover, “read_excel” is smart enough to use the first row of the spreadsheet as the column names. (If it doesn’t, then you can always pass the “header” keyword argument, to indicate which row should be used for the column names.)

We aren’t going to be using all of the columns in the spreadsheet in our analysis, however, For that reason, I thought it might be a good idea to limit which columns we use. We can do that by passing the “usecols” keyword argument, with a list of strings indicating which columns we want to keep:

filename = 'Global-Coal-Plant-Tracker-January-2024.xlsx'

df = pd.read_excel(filename, sheet_name='Units',
                  usecols=["Country", "Capacity (MW)", "Status", 
                           "Start year", "Combustion technology", 
                           "Coal type", "Region", 
                           "Annual CO2 (million tonnes / annum)"])

With this in place, we’re now set to do our analysis.

How much memory does the data frame take up? How much memory do you save by turning columns into categories? Which columns are most (and least) likely to save us memory in this way? Are there any columns that we could turn into categories, but shouldn't?

How big is our data frame? The easiest way to find out is to invoke the “memory_usage” method. It’ll return one result for each column in the data frame. I invoke:

df.memory_usage()

And here’s what I get back:

Index                                     132
Country                                111248
Capacity (MW)                          111248
Status                                 111248
Start year                             111248
Combustion technology                  111248
Coal type                              111248
Region                                 111248
Annual CO2 (million tonnes / annum)    111248
dtype: int64

Does this seem a bit suspicious to you? It should; I find it a bit hard to believe that every single column (except for the index) uses the same amount of memory. I mean, the “Start year” column contains integers, whereas the “Country” column contains strings. Could it be that they use precisely the same amount of memory?

Moreover, the data frame contains 13,906 rows. If we divide 111,248 by 13,906, we get 8, meaning that each of these is an 8-byte (i.e., 64-bit) value. Something seems weird here.

And indeed, something is weird here, namely the way that we’re calculating memory usage. When Pandas uses NumPy as a backend (which is standard), it doesn’t want to use NumPy’s strings. So it instead uses Python strings, storing pointers to those Python objects in NumPy. That’s what it means for the dtype to be “object” in Pandas; we’re referring to Python objects.

When we call “memory_usage”, Pandas doesn’t by default calculate the usage of each individual Python object, because that’ll take far longer. So it returns the size of the NumPy-allocated memory, instead. If we want “memory_usage” to truly gather the size of each Python string, we need to tell it to go deep with the “deep=True” keyword argument:

df.memory_usage(deep=True)

The result, this time, is a bit different:

Index                                     132
Country                                778963
Capacity (MW)                          111248
Status                                 801876
Start year                             111248
Combustion technology                  839171
Coal type                              800189
Region                                 747644
Annual CO2 (million tonnes / annum)    111248
dtype: int64

Sure enough, we now have very different amounts of memory. Because “memory_usage” returns its results as a series, we can get the total memory usage by running “sum” on the output:

before_memory_usage = df.memory_usage(deep=True).sum()
print(f'Before:\t{before_memory_usage:>12,}')

I asked you to calculate the total memory usage, and this is the best / easiest way to do so. The result, on my computer:

Before:	   4,301,719

I asked you to consider using categories to shrink down this memory usage. You can think of a category in Pandas as similar to an “enum” in a programming language, allowing us to store integers (which are small) instead of strings (which are large). A strings column with many repeated values will benefit greatly from being turned into a category, because we’ll store the string once and the integers referring to it many times.

Can we turn non-string columns into categories in Pandas? The answer is “yes,” but you have to be careful. That’s because if you turn an integer column into a category, you won’t be able to perform mathematical operations on them any more. (By contrast, you can definitely perform string operations on category columns.) So while it’s theoretically a good idea to crunch down numeric columns using categories, you should be a bit careful about doing so.

Let’s thus turn all non-numeric columns in our data frame into categories. We’ll first select all of the columns without “float64” dtypes using “select_dtypes”. This method lets us select all columns with a particular dtype (using “include”) or without a particular dtype (using “exclude”). Here, I’m going to ask for all columns that are not npfloat64. I’ll then run “astype(‘category’)” on that column, assigning the result back to the original column:

for one_column in df.select_dtypes(exclude='float64').columns:
    df[one_column] = df[one_column].astype('category')

After this transformation, I can then gain run “memory_usage” and sum up the size of the data frame:

after_memory_usage = df.memory_usage(deep=True).sum()
print(f'After:\t{after_memory_usage:>12,}')

The result:

After:	     416,504

That looks like quite a savings of memory! How much did we save, in total?

print(f'Saved:\t{(before_memory_usage - after_memory_usage):>12,}')

Which gives us output of:

Saved:	   3,885,215

In other words, we saved more than 3.8 MB of memory by switching to categories for those columns. It took almost no time, and doesn’t affect the functionality or performance. That’s why I often point to categories as one of the easiest and fastest optimizations you can do in Pandas.

By the way, I decided to try specifying the use of PyArrow for the dtype backend, rather than NumPy, to see what sort of memory savings we might enjoy:

dfa = pd.read_excel(filename, sheet_name='Units',
                  usecols=["Country", "Capacity (MW)", "Status", 
                           "Start year", "Combustion technology", 
                           "Coal type", "Region", 
                           "Annual CO2 (million tonnes / annum)"],
                   dtype_backend='pyarrow')

I then calculated the memory usage. Note that because PyArrow stores all of the values right in Apache Arrow, and not in Python, there’s no need to say “deep=True”. Then let’s see how much we save:

arrow_memory_usage = dfa.memory_usage().sum()
print(f'Arrow:\t{(before_memory_usage - arrow_memory_usage):>12,}')

The result:

Arrow:	   3,130,846

So Arrow still saved us a lot of memory, and we didn’t have to use categories. However, it didn’t save quite as much memory as categories. I was, however, able to run “astype(‘category’)” on my Arrow columns, and I got the same memory usage as before.

BW #64: Coal power (solution)

Data and seven questions

Download the Excel spreadsheet, and load the "Units" sheet from that document into a data frame. We'll only want the following columns: "Country", "Capacity (MW)", "Status", "Start year", "Combustion technology", "Coal type", "Region", and "Annual CO2 (million tonnes / annum)".

How much memory does the data frame take up? How much memory do you save by turning columns into categories? Which columns are most (and least) likely to save us memory in this way? Are there any columns that we *could* turn into categories, but shouldn't?

Read next

BW #64: Coal power

BW #65: Microplastics

How much memory does the data frame take up? How much memory do you save by turning columns into categories? Which columns are most (and least) likely to save us memory in this way? Are there any columns that we could turn into categories, but shouldn't?