BW #87: Nuclear power (solution)

This week, we looked at power plants around the world, and specifically at nuclear power plants. My interest was piqued by a recent set of stories describing how Microsoft is investing in the re-activation of Three Mile Island's nuclear power plant, in order to generate enough electricity for OpenAI's data needs without generating too many carbon emissions, (More details are in this New York Times article: https://www.nytimes.com/2024/09/20/climate/three-mile-island-reopening.html?unlocked_article_code=1.Qk4.saFR.wp4WwwvU0CYJ&smid=url-share)

Moreover, this gave us a chance to work a bit more with GeoPandas – creating a GeoDataFrame from non-geographical data, and then creating interactive maps with information about nuclear power plants.

Data set and six questions

This week's data set comes from the World Resources Institute (https://wri.org). Specifically, the data comes from their Global Power Plant database:

https://datasets.wri.org/datasets/global-power-plant-database

You can use that Web page to explore the map of power plants around the world. But of course, we're looking to download the data and analyze it. On the main database page, make sure that the "data files" tab is selected. There are several versions of the database schema, as well as of the data itself. We're looking for version 1.3.0, which is at the end of the list. As of this writing, the data was last updated on September 25th. Click on the "download" button, which will download a zipfile from Amazon S3. That file, when opened, includes a CSV file that we'll use for our data.

Load the data into a GeoDataFrame, using the `longitude` and `latitude` columns as inputs for a `Point` in the `geometry` column.

For starters, we'll need to load some libraries. This week, because we'll be using not only Pandas, but also GeoPandas, we'll need to load both of them.

import pandas as pd
import geopandas as gpd

Now let's load the data from the downloaded CSV file into a Pandas data frame using read_csv. The only unusual thing here is that I passed the low_memory=False keyword argument. This tells Pandas that there's enough memory to read the entire file into memory, thus allowing it to analyze the contents of each column and decide what dtype it should assign. Without this keyword argument, read_csv complained to me that it had found competing dtypes in some of the columns, because it reading the data into memory in smaller chunks, and choosing a dtype for each chunk:

filename = 'global_power_plant_database.csv'

df = pd.read_csv(filename, low_memory=False)

For most queries against this data set, that would be enough. But I asked you to turn it into a GeoDataFrame. A GeoDataFrame is just like a regular DataFrame object, except that it has a geometry column containing the geometric data we want to work with. If we call a regular Pandas method, then a GeoDataFrame works just like a DataFrame. But if we call a special GeoPandas method, then it uses the data in the geometry column to perform its calculations.

We can create a GeoDataframe based on an existing data frame, but we need to supply a geometry column. How can we do that?

One way would be to create a list of Point values, using the shapely library. We could iterate over the longitude and latitude columns in df, creating a Point for each one, and using the resulting list to create our GeoDataFrame.

But GeoPandas comes with a convenience method, gpd.points_from_xy, which we can use to create our geometry column more easily and efficiently:

gdf = gpd.GeoDataFrame(
        df,
        geometry=gpd.points_from_xy(df.longitude, df.latitude),
        crs="EPSG:4326"
    )

The result of this is that gdf is now a GeoDataFrame containing the same information as df did, but with an additional geometry column containing Point values, the location of each power plant.

It's important to remember, though, that geographic data can be described in a variety of ways, using a number of different coordinate systems. It's thus not enough for us to indicate the geometry column. We also need to indicate what "coordinate reference system," or CRS, we plan to use. In this case, because we got data in real-world longitude and latitude, we can use "EPSG:4326", the designation for such coordinates.

The result is a GeoDataFrame containing 34,936 rows and 37 columns. (This is one more column than df, thanks to the geometry column.) If we run gdf.dtypes, to get the dtype of each column, we find:

country                             object
country_long                        object
name                                object
gppd_idnr                           object
capacity_mw                        float64
latitude                           float64
longitude                          float64
primary_fuel                        object
other_fuel1                         object
other_fuel2                         object
other_fuel3                         object
commissioning_year                 float64
owner                               object
source                              object
url                                 object
geolocation_source                  object
wepp_id                             object
year_of_capacity_data              float64
generation_gwh_2013                float64
generation_gwh_2014                float64
generation_gwh_2015                float64
generation_gwh_2016                float64
generation_gwh_2017                float64
generation_gwh_2018                float64
generation_gwh_2019                float64
generation_data_source              object
estimated_generation_gwh_2013      float64
estimated_generation_gwh_2014      float64
estimated_generation_gwh_2015      float64
estimated_generation_gwh_2016      float64
estimated_generation_gwh_2017      float64
estimated_generation_note_2013      object
estimated_generation_note_2014      object
estimated_generation_note_2015      object
estimated_generation_note_2016      object
estimated_generation_note_2017      object
geometry                          geometry
dtype: object

Notice that the final column, geometry, has a dtype of geometry, which is exactly what we wanted.

How many power plants are there in the world? Which 10 countries have the most plants? If we compare the mean and median number per country, what do we see, and what can we learn from that?

To find out how many power plants there are, we can use the count method on a column – say, on the country_long column, which contains the full country names:

gdf['country_long'].count()

This returns 34,936, the number of non-NaN values in the series.

But beyond that, we want to know the number of plants per country. There are a few ways to do this, among them using groupby – but Pandas provides us with a more convenient function, value_counts, which gives us the number of times each value appears in country_long. The values are automatically sorted from most to least common, so we can get the top 10 with head:

(
    gdf
    ['country_long'].value_counts()
    .head(10)
)

Here's what I get:

country_long
United States of America    9833
China                       4235
United Kingdom              2751
Brazil                      2360
France                      2155
India                       1589
Germany                     1309
Canada                      1159
Spain                        829
Russia                       545
Name: count, dtype: int64

We see that the United States has more than twice as many power plants as China, which again has about twice as many as the UK. India, which has roughly the same population as China, has fewer power plants than France, which I wouldn't have expected.

Of course, this doesn't necessarily mean anything, because a country could have fewer, larger (i.e., generating more megawatt) plants rather than many, smaller plants.

If we want to compare the mean and median number of plants per country, we can use describe, which provides a full set of descriptive statistics:

(
    gdf
    ['country_long'].value_counts()
    .describe()
)

The result:

count     167.000000
mean      209.197605
std       896.913269
min         1.000000
25%         7.000000
50%        16.000000
75%        60.500000
max      9833.000000
Name: count, dtype: float64

Notice that the mean is 209 power plants, but the median is 16. The difference shows just how much a few outliers can skew the mean, potentially making us think that the "average" country has a very large number of power plants. We can see, though, that even countries in the 75th percentile have only 60 power plants. The small number of countries with a huge number of power plants don't make the mean incorrect, but certainly make it less useful as a way to understand the data.

BW #87: Nuclear power (solution)

Data set and six questions

Load the data into a GeoDataFrame, using the longitude and latitude columns as inputs for a Point in the geometry column.

How many power plants are there in the world? Which 10 countries have the most plants? If we compare the mean and median number per country, what do we see, and what can we learn from that?

Read next

BW #87: Nuclear power

BW #88: Hot summers

Load the data into a GeoDataFrame, using the `longitude` and `latitude` columns as inputs for a `Point` in the `geometry` column.