BW #101: Los Angeles Fires (solution)
This week, we're looking at data about the wildfires raging in and near Los Angeles, California. The fires have caused astonishing harm to both people and property, and have been described as the largest natural disaster in the US history.
Our questions this week will use data about the LA fires, collected and published by NASA. They, along with the US Forest Service, run FIRMS (https://firms.modaps.eosdis.nasa.gov/usfs/), which provides information about wildfires from its EOS (Earth Observing System) network of satellites. EOS, as the name indicates, looks back at Earth, rather than out into space. Using a variety of sensors, we can learn where fires are taking place, and how hot they are burning. Moreover, the data is frequently updated, giving us information about the current California fires, not just historical data.
We'll use this data to examine and visualize the fires. Along the way, we'll get a chance to explore some ideas and techniques with GeoPandas.
Data and six questions
This week's main data, as I indicated above, comes from FIRMS. The data files are available at
https://firms.modaps.eosdis.nasa.gov/usfs/active_fire/
This page contains links to lots of different data files, in many different formats. We're going to use the 7-day VIIRS data from NOAA-20. You can download that from the above page, or from this link:
This is a CSV file containing much of the data we want. However, we'll also be using some data about Los Angeles and the surrounding counties. We will get that from the TIGER 2024 data, which includes everything that we need to work with counties:
https://www2.census.gov/geo/tiger/TIGER2023/COUNTY/tl_2023_us_county.zip
Because this is Census data, they don't use state names. Rather, they use "STATEFP" codes, which you can translate from here:
https://www2.census.gov/geo/docs/reference/codes2020/national_state2020.txt
The learning goals for this week mainly involve working with GeoPandas, including joining and plotting. But we'll also do some work with dates and times, non-geo joins, grouping, and pivot tables.
Here are my six tasks and questions. A link to the Jupyter notebook I used to solve these problems is at the bottom of this message.
Create a Pandas data frame from the VIRRS / NOAA-20 data that NASA provides. Include a date
column, of dtype datetime
, based on the acq_date
and acq_time
columns. The latter is in HHMM format, reflecting the time (GMT) at which the data was collected. Remove acq_date
, acq_time
, and satellite
when you're done.
For starters, I'll load both Pandas as GeoPandas:
import pandas as pd
import geopandas
Next, I want to create a Pandas data frame from the downloaded CSV file. We can do that with read_csv
:
filename = 'J1_VIIRS_C2_USA_contiguous_and_Hawaii_7d.csv'
df = (
pd
.read_csv(filename)
)
This is fine, but we want to combine the acq_date
and acq_time
columns into a single date
column. We can do that if we have a string column in a format that pd.to_datetime
recognizes – or we can pass a format string (as specified in such places as https://www.strfti.me/) to give it a further hint.
The problem is that the acq_time
column is seen as integers by read_csv
. Moreover, it's supposed to be a four-digit time in HHMM format, but sometimes it's just HMM.
What we'll do is use assign
to create a new date
column. Its contents will be the result of invoking pd.to_datetime
on a combination of acq_date
and acq_time
. We use a lambda
expression here, because pd.to_datetime
isn't a method, and thus needs to be invoked in another context.
However, we can't use acq_time,
because it's an integer column. Instead, we'll use astype
to turn it into a string column. We'll then use str.zfill
to pad our string with leading zeroes, ensuring that we end up with four characters total.
The result will be a string in the format of YYYY-MM-DD HHMM
. We can tell pd.to_datetime
to use this format by passing the format
keyword argument '%Y-%m-%d %H%M'
. In other words, we end up with:
filename = 'J1_VIIRS_C2_USA_contiguous_and_Hawaii_7d.csv'
df = (
pd
.read_csv(filename)
.assign(date = lambda df_: pd.to_datetime(
df_['acq_date'] + ' ' + df_['acq_time'].astype(str).str.zfill(4),
format='%Y-%m-%d %H%M'))
)
Following this, we invoke drop
on the three columns that we can remove:
filename = 'J1_VIIRS_C2_USA_contiguous_and_Hawaii_7d.csv'
df = (
pd
.read_csv(filename)
.assign(date = lambda df_: pd.to_datetime(
df_['acq_date'] + ' ' + df_['acq_time'].astype(str).str.zfill(4),
format='%Y-%m-%d %H%M'))
.drop(columns=['acq_date', 'acq_time', 'satellite'])
)
The result is a data frame with 7,895 rows and 11 columns.
Create a GeoDataFrame based on the data in the regular Pandas data frame you created. Use the latitude
and longitude
columns to create the special geometry
column. Use the EPSG:4326
coordinate reference system (CRS).
GeoPandas defines a subclass of DataFrame
known as a GeoDataFrame
. The big difference between the two is that everyGeoDataFrame
has a special geometry
column, which we can use to perform special geographic calculations. In all other ways, a GeoDataFrame is the same as a regular data frame.
To get a GeoDataFrame from what we've created in df
, we need to invoke geopandas.GeoDataFrame
, passing it df
. But then we need to tell it how to define the geometry
column. In this case, it's pretty simple – df
has longitude
and latitude
columns, and GeoPandas has a special geopandas.points_from_xy
function, designed for precisely these occasions.
We invoke the function on df
, passing the keyword arguments that tell it what to use for longitude
and latitude
. We also have to indicate which coordinate reference system (CRS) we want to use; in this case, I asked you to choose EPSG:4326
, which is often used in GPS systems (https://epsg.io/4326).
We create gdf
, the GeoDataFrame, and it's just like df
was before it – but now it has a geometry
column, one with POINT objects that represent the location of where the satellite picked up information.