BW #113: US airport traffic (solution)

This week, we looked at US airport traffic. This topic has been in the news quite a bit lately, with numerous reports of people without US citizenship being jailed, expelled, or just refused entry into the country. This has led to a reported huge drop in air traffic between Canada and the United States (https://onemileatatime.com/news/airline-demand-canada-united-states-collapses/), and even to the Python Software Foundation's announcement that if people are uncomfortable traveling to the US from other countries, they can get a full refund on their conference ticket.

This week, we'll thus be looking at air travel into the US over the last two years, at a number of US airports. We'll try to determine whether the number of non-US citizens has really dipped in recent weeks, and generally look at the distribution of passengers across various times and locations.

Data and six questions

This week, we're looking at data from the "airport wait time" site (https://awt.cbp.gov/) run by US Customs and Border Protection, part of the Department of Homeland Security. I downloaded Excel data from April 1, 2023 through April 7, 2025, for the following airports:

Los Angeles International, Tom Bradley terminal
Chicago O'Hare, terminal 5
Atlanta airport
Dallas / Fort Worth
Newark Liberty airport
San Francisco airport

Note that in two cases, I limited my search to only one terminal. This is because the site took a long time to generate those reports, and I grew tired of waiting. That's also why I limited my search to the last two years; for some airports, generating three years of data extended the wait time greatly.

Learning goals this week include working with multiple files, pivot tables, grouping, dates and times, filtering, and plotting.

If you're a paid subscriber, then you'll be able to download the data directly from a link at the bottom of this post, download my Jupyter notebook, and click on a single link that loads the notebook and the data into Google Colab, so you can get to work experimenting right away.

Also: A video walk-through of me answering this week's first two questions is on YouTube at https://www.youtube.com/watch?v=uK1VqjQhndg. Check out the entire Bamboo Weekly playlist at https://www.youtube.com/playlist?list=PLbFHh-ZjYFwG34oZY24fSvFtOZT6OdIFm !

Meanwhile, here are the six tasks and questions:

Download the data for the six airports, and combine them into a single data frame. Turn the "Flight Date" and "HourRange" columns into a single `datetime` column (using the first time in "HourRange"). Set the index to combine the "IATA" column (i.e., the airport name) and the "datetime" column.

Before doing anything else, let's load Pandas:

import pandas as pd

I downloaded the files and put them onto the data directory beneath my Jupyter notebook directory. It was clear that I would have to iterate over those Excel files, creating a data frame for each one with read_excel, and then somehow combine them.

Fortunately, Pandas comes with pd.concat, which takes a list of data frames and returns a single data frame based on them. By default, it assumes that each data frame in the list has the same column names, and stacks them vertically.

Now, we could create an empty list, and then use a for loop to iterate over each filename, invoking list.append to add each new data frame to the list. But experienced Python developers know that this is a great opportunity to use a list comprehension, creating a list of data frames from a list of filenames.

We could manually create the list of filenames. Or we could invoke glob.glob, from the "globbing" module in Python's standard library. It takes a string argument, describing a pattern of filenames — not regular expressions, but with some similar ideas and syntax – that we want to see.

Combining the use of glob.glob and a list comprehension, we end up with the following code, which defines the list all_dfs:


import glob

all_dfs = [pd.read_excel(one_filename)
           for one_filename in glob.glob('data/Awt.cbp.gov*xlsx')]

As I wrote above, we can then, in theory, get a single data frame with:

df = (pd
      .concat(all_dfs)
      )

But I also asked for the data frame's index to contain two levels – the outer one being the IATA column, with the airport's three-letter code, and the inner one containing a datetime value, created from a combination of the Flight Date column (containing a date value) and the HourRange column (containing a starting and ending hour).

We can create a new datetime value with pd.to_datetime, but that requires giving it a string with the date and time.

To extract the time from the HourRange column, I turned to regular expressions and the str.replace method. I tell my regular expression to search for:

^, anchoring the regexp to the start of the string
(\d\d), looking for two digits – and capturing them as the first group
(\d\d), looking for two more digits – and capturing them as the second group
.*, zero or more additional characters
$, anchoring the entire regexp to the end of the string

That regular expression should match every value in the HourRange column. But the important thing is the replacement string we'll hand str.replace, r'\1:\2'. That means we want \1, the first captured group (i.e., the first two digits), then a literal :, then \2, the second captured group (i.e., the second two digits). Because backslashes can cause trouble in Python strings, I put r before the opening quote, making this a "raw" string whose backslashes will themselves be escaped. I also added regex=True, to tell Pandas that I wanted to use regular expressions here.

Invoking str.replace on the HourRange column in this way works just great, but how can I integrate it into my invocation of pd.concat and creating a new data frame? Well, pd.concat returns a data frame. We can then invoke assign on the data frame, temporarily adding a new column via keyword arguments. The keyword argument's name is the name of the new column, and its value describes the new column's value. By using lambda, we can invoke str.replace on the original column, assigning the result to a new one, time:

df = (pd
      .concat(all_dfs)
      .assign(time = lambda df_: (df_['HourRange']
                                  .str.replace(r'^(\d\d)(\d\d).*$', 
                                               r'\1:\2', 
                                               regex=True))
     )

Now that we have the time column, we can create a new datetime column, with a datetime dtype: We'll again use assign – indeed, we'll use a second keyword argument in the same invocation – whose lambda combines the string value of the date from Flight Date, along with a space and the value of the time column we just created:

df = (pd
      .concat(all_dfs)
      .assign(time = lambda df_: (df_['HourRange']
                                  .str.replace(r'^(\d\d)(\d\d).*$', 
                                               r'\1:\2', 
                                               regex=True)),
             datetime = lambda df_: pd.to_datetime(df_['Flight Date']
                                                   .dt.date.astype(str) 
                                                   + ' ' + df_['time']))
     )

Finally, we invoke set_index to create our two-level multi-index with IATA and datetime, and then sort the data frame by its newly created index using sort_index. We assign the whole thing to df:

df = (pd
      .concat(all_dfs)
      .assign(time = lambda df_: (df_['HourRange']
                                  .str.replace(r'^(\d\d)(\d\d).*$', 
                                               r'\1:\2', 
                                               regex=True)),
             datetime = lambda df_: pd.to_datetime(df_['Flight Date']
                                                   .dt.date.astype(str) 
                                                   + ' ' + df_['time']))
      .set_index(['IATA', 'datetime'])
      .sort_index()
     )

The result is a data frame with 102,152 rows and 25 columns.

Create a line graph showing, on a monthly basis, how many non-US passengers entered the US at each airport in this data set. Do we see any recent downturn?

In order to create such a line graph, we'll need a data frame in which the rows represent the months, the columns represent the airports, and the values are the total number of passengers for that month in each airport.

Given that we have the airport names in the IATA column (currently in the index) and the dates in the datetime column (also in the index), we can start by creating a pivot table:

The index will be datetime
The columns will be IATA
The values will be NonUsaPassengerCount
The aggregation function will be sum

This won't solve our problem completely, but it's a good first start. We invoke pivot_table as follows:

(
    df
    .pivot_table(index='datetime',
                 columns='IATA',
                 values='NonUsaPassengerCount',
                 aggfunc='sum')
)

If we want a line graph based on hours, then we're doing just fine. But we actually want the total on a monthly basis. To do that, we'll use resample, which does a form of groupby on the data set, aggregating values across chunks of time that we determine. In this case, we want monthly data, so I specify 1ME, for "end of every 1-month period."

But before doing that, I first use loc to keep only those values in the pivot table that are through March 31, 2025. Because the pivot table's index contains datetime values, and because it's sorted by date, we can use a slice in this way:

(
    df
    .pivot_table(index='datetime',
                 columns='IATA',
                 values='NonUsaPassengerCount',
                 aggfunc='sum')
    .loc[:'2025-March-31']
    .resample('1ME').sum()
)

Finally, we invoke plot.line, to get a plot:

(
    df
    .pivot_table(index='datetime',
                 columns='IATA',
                 values='NonUsaPassengerCount',
                 aggfunc='sum')
    .loc[:'2025-March-31']
    .resample('1ME').sum()
    .plot.line()
)

Here's the plot we get:

We see some ups and downs, at all of these airports, over the last year, with clear peaks during the summer months. I definitely see a steep drop in February 2025, but it also seems to me that the numbers grew again in March. Moreover, we can see that there was a similar drop from January to February 2024.

So while it's definitely possible that people from outside of the US are cancelling trips en masse, I don't see a dramatic dip beyond what we saw last year at this time.

BW #113: US airport traffic (solution)

Data and six questions

Download the data for the six airports, and combine them into a single data frame. Turn the "Flight Date" and "HourRange" columns into a single datetime column (using the first time in "HourRange"). Set the index to combine the "IATA" column (i.e., the airport name) and the "datetime" column.

Create a line graph showing, on a monthly basis, how many non-US passengers entered the US at each airport in this data set. Do we see any recent downturn?

Read next

BW #113: US airport traffic

BW #114: International trade

Download the data for the six airports, and combine them into a single data frame. Turn the "Flight Date" and "HourRange" columns into a single `datetime` column (using the first time in "HourRange"). Set the index to combine the "IATA" column (i.e., the airport name) and the "datetime" column.