BW #87: Nuclear power (solution)
This week, we looked at power plants around the world, and specifically at nuclear power plants. My interest was piqued by a recent set of stories describing how Microsoft is investing in the re-activation of Three Mile Island's nuclear power plant, in order to generate enough electricity for OpenAI's data needs without generating too many carbon emissions, (More details are in this New York Times article: https://www.nytimes.com/2024/09/20/climate/three-mile-island-reopening.html?unlocked_article_code=1.Qk4.saFR.wp4WwwvU0CYJ&smid=url-share)
Moreover, this gave us a chance to work a bit more with GeoPandas – creating a GeoDataFrame from non-geographical data, and then creating interactive maps with information about nuclear power plants.
Data set and six questions
This week's data set comes from the World Resources Institute (https://wri.org). Specifically, the data comes from their Global Power Plant database:
https://datasets.wri.org/datasets/global-power-plant-database
You can use that Web page to explore the map of power plants around the world. But of course, we're looking to download the data and analyze it. On the main database page, make sure that the "data files" tab is selected. There are several versions of the database schema, as well as of the data itself. We're looking for version 1.3.0, which is at the end of the list. As of this writing, the data was last updated on September 25th. Click on the "download" button, which will download a zipfile from Amazon S3. That file, when opened, includes a CSV file that we'll use for our data.
Load the data into a GeoDataFrame, using the longitude
and latitude
columns as inputs for a Point
in the geometry
column.
For starters, we'll need to load some libraries. This week, because we'll be using not only Pandas, but also GeoPandas, we'll need to load both of them.
import pandas as pd
import geopandas as gpd
Now let's load the data from the downloaded CSV file into a Pandas data frame using read_csv
. The only unusual thing here is that I passed the low_memory=False
keyword argument. This tells Pandas that there's enough memory to read the entire file into memory, thus allowing it to analyze the contents of each column and decide what dtype it should assign. Without this keyword argument, read_csv
complained to me that it had found competing dtypes in some of the columns, because it reading the data into memory in smaller chunks, and choosing a dtype for each chunk:
filename = 'global_power_plant_database.csv'
df = pd.read_csv(filename, low_memory=False)
For most queries against this data set, that would be enough. But I asked you to turn it into a GeoDataFrame. A GeoDataFrame is just like a regular DataFrame object, except that it has a geometry
column containing the geometric data we want to work with. If we call a regular Pandas method, then a GeoDataFrame works just like a DataFrame. But if we call a special GeoPandas method, then it uses the data in the geometry
column to perform its calculations.
We can create a GeoDataframe based on an existing data frame, but we need to supply a geometry
column. How can we do that?
One way would be to create a list of Point
values, using the shapely
library. We could iterate over the longitude
and latitude
columns in df
, creating a Point
for each one, and using the resulting list to create our GeoDataFrame.
But GeoPandas comes with a convenience method, gpd.points_from_xy
, which we can use to create our geometry
column more easily and efficiently:
gdf = gpd.GeoDataFrame(
df,
geometry=gpd.points_from_xy(df.longitude, df.latitude),
crs="EPSG:4326"
)
The result of this is that gdf
is now a GeoDataFrame containing the same information as df
did, but with an additional geometry
column containing Point
values, the location of each power plant.
It's important to remember, though, that geographic data can be described in a variety of ways, using a number of different coordinate systems. It's thus not enough for us to indicate the geometry
column. We also need to indicate what "coordinate reference system," or CRS, we plan to use. In this case, because we got data in real-world longitude and latitude, we can use "EPSG:4326", the designation for such coordinates.
The result is a GeoDataFrame containing 34,936 rows and 37 columns. (This is one more column than df
, thanks to the geometry
column.) If we run gdf.dtypes
, to get the dtype
of each column, we find:
country object
country_long object
name object
gppd_idnr object
capacity_mw float64
latitude float64
longitude float64
primary_fuel object
other_fuel1 object
other_fuel2 object
other_fuel3 object
commissioning_year float64
owner object
source object
url object
geolocation_source object
wepp_id object
year_of_capacity_data float64
generation_gwh_2013 float64
generation_gwh_2014 float64
generation_gwh_2015 float64
generation_gwh_2016 float64
generation_gwh_2017 float64
generation_gwh_2018 float64
generation_gwh_2019 float64
generation_data_source object
estimated_generation_gwh_2013 float64
estimated_generation_gwh_2014 float64
estimated_generation_gwh_2015 float64
estimated_generation_gwh_2016 float64
estimated_generation_gwh_2017 float64
estimated_generation_note_2013 object
estimated_generation_note_2014 object
estimated_generation_note_2015 object
estimated_generation_note_2016 object
estimated_generation_note_2017 object
geometry geometry
dtype: object
Notice that the final column, geometry
, has a dtype
of geometry
, which is exactly what we wanted.
How many power plants are there in the world? Which 10 countries have the most plants? If we compare the mean and median number per country, what do we see, and what can we learn from that?
To find out how many power plants there are, we can use the count
method on a column – say, on the country_long
column, which contains the full country names:
gdf['country_long'].count()
This returns 34,936, the number of non-NaN values in the series.
But beyond that, we want to know the number of plants per country. There are a few ways to do this, among them using groupby
– but Pandas provides us with a more convenient function, value_counts
, which gives us the number of times each value appears in country_long
. The values are automatically sorted from most to least common, so we can get the top 10 with head
:
(
gdf
['country_long'].value_counts()
.head(10)
)
Here's what I get:
country_long
United States of America 9833
China 4235
United Kingdom 2751
Brazil 2360
France 2155
India 1589
Germany 1309
Canada 1159
Spain 829
Russia 545
Name: count, dtype: int64
We see that the United States has more than twice as many power plants as China, which again has about twice as many as the UK. India, which has roughly the same population as China, has fewer power plants than France, which I wouldn't have expected.
Of course, this doesn't necessarily mean anything, because a country could have fewer, larger (i.e., generating more megawatt) plants rather than many, smaller plants.
If we want to compare the mean and median number of plants per country, we can use describe
, which provides a full set of descriptive statistics:
(
gdf
['country_long'].value_counts()
.describe()
)
The result:
count 167.000000
mean 209.197605
std 896.913269
min 1.000000
25% 7.000000
50% 16.000000
75% 60.500000
max 9833.000000
Name: count, dtype: float64
Notice that the mean is 209 power plants, but the median is 16. The difference shows just how much a few outliers can skew the mean, potentially making us think that the "average" country has a very large number of power plants. We can see, though, that even countries in the 75th percentile have only 60 power plants. The small number of countries with a huge number of power plants don't make the mean incorrect, but certainly make it less useful as a way to understand the data.