BW #12: Tourism
This week, I'm traveling to the US to attend PyCon in Salt Lake City. What changes do we see in these numbers over time?
This week, I'm in Salt Lake City for PyCon US 2023! I was trying to think of an appropriate topic while packing for the long trip to from Israel to the US, and then saw a report in the Washington Post about how many tourists are now entering the US (https://www.washingtonpost.com/business/2023/04/14/foreign-toursts-america), and how trends have changed over the years.
So in honor of so many people making their way to Salt Lake City for PyCon, I decided to look at tourism numbers in the US. FYI, I don't believe that I'm included in these numbers, since I'm a US citizen.
The numbers come from the International Trade Administration, part of the US Commerce Department. They keep track of many trade-related numbers, and since tourism is a major part of US trade, they keep track of tourists, too. Their “arrivals program” has been tracking the number and origin of tourists every month for more than 20 years, as described here: https://www.trade.gov/i-94-arrivals-program
Data and questions
This week, we're going to look at monthly arrivals by plane from outside of the United States. I'm warning you now that we're going to be working hard to clean this data up, in order to analyze it. It might appear at first as though the data is nice and tidy, but bringing it into Pandas will be a bit painful.
Once we get over that hump, we'll be able to find out some interesting things about where people come from when they visit the US, and how that has changed over the years.
The data is in a single Excel spreadsheet, containing monthly arrivals by airplane into the US starting in the year 2000. You can get it from:
/content/files/sites/default/files/2022-02/monthly-arrivals-2000-present.xlsx
We'll start by taking this file and turning it into two data frames:
- In the smaller data frame (countries_df), there are only two columns. The index contains country names, and the one non-index data column (region) comes from Excel's column B. You'll have to ignore a bunch of blank/empty rows in the Excel spreadsheet.
- In the larger data frame (travel_df), there are many columns, once for each monthly report. Make the index the same as in countries_df, with country names from Excel's column B. The other columns will all be datetime values, with a year and month. (The day doesn’t matter, and the time will always be midnight.)
As promised, there will be some cleaning to do to create travel_df:
- The final columns in the file have the word “Preliminary”; that word should be stripped.
- All column names/headers should be turned into datetime objects. Remove any columns whose names that cannot be turned into datetimes.
- In the data, replace any instances of ' - ' (i.e., a string containing one space, a minus sign, and three spaces with 0.
- In the data, replace NaN with 0.
- Set the dtype of all columns to int.
Note: For reasons I still don't understand, the dates in our columns are being interpreted by Pandas as being of a certain year and month (as written), and then with increasing days, starting with 1. So column 20 will be listed as the 20th of the month, and column 30 will be listed as being on the 30th of the month. This shouldn't affect things.
Now let's ask some more general questions:
- In the most recent report, which 10 countries had the greatest number of tourists enter the US?
- In the first report, which 10 countries had the greatest number of tourists enter the US?