[Administrative note: I'll be holding live Pandas office hours, where you can ask any questions you have, on Monday, April 7th. This is open to all paid subscribers, including members of my LernerPython+Data membership plan. Full Zoom info will go out tomorrow, on Friday.]
The world economy is going through some weird and turbulent times. Indeed, things just got more turbulent after yesterday's issue went out, as Donald Trump announced massive tariffs on virtually every country in the world. (We analyzed historical tariff data back in BW 95, https://www.bambooweekly.com/bw-95-tariffs/ .)
The job market has also been a bit weird lately – but one of the weirdest things I read lately was in the Washington Post, where they asked whether a quarter of computer-programming jobs had disappeared in the last number of years (https://wapo.st/42shmfm). Their analysis, in case you're wondering, is that it's a matter of job definitions; perhaps "programmers," who are told what to implement, are now less common, but "software engineers," who make decisions and also implement the code, are more common.
This week, we looked at coding-related jobs in the US – how many people work in them, where they work, and how much that has changed in the last decade.
Data and five questions
This week's data comes from the Bureau of Labor Statistics at the Department of Commerce, specifically their data on occupational employment and wage statistics (https://www.bls.gov/oes/tables.htm). They publish data about once each decade, although from what I can tell, more recent data is available from the Census Bureau.
We'll mostly look at data from 2023, which you can download by clicking on "all data" from the "May 2023" section of the OES page, or by using this link:
https://www.bls.gov/oes/special-requests/oesm23all.zip
We'll also spend a bit of time looking at data from 2013, and comparing it with the 2023 data, which you can similarly download by clicking on the link next to "all data" in the 2013 section, or by clicking here:
https://www.bls.gov/oes/special-requests/oesm13all.zip
This week's learning goals include: Working with Excel, regular expressions, optimizing speed, using PyArrow, grouping, joining, and plotting.
If you're a paid subscriber, then you'll be able to download the data directly from a link at the bottom of this post, download my Jupyter notebook, and click on a single link that loads the notebook and the data into Google Colab, so you can get to work experimenting right away.
Here are this week's five questions and tasks:
Read in the data from Excel. We only need some of the columns (AREA_TITLE, AREA_TYPE, OCC_TITLE, I_GROUP, TOT_EMP, JOBS_1000, O_GROUP, PCT_TOTAL, and A_MEAN. Treat "*", "**", and "#" as a NaN
value. Does it take less time to read the data if we limit the columns? How much does it change the size of the data (in memory) by limiting the columns?
To start things off, we'll load up Pandas:
import pandas as pd
Next, we want to load the Excel file into a data frame with read_excel
. The file contains more than one sheet, which means that we'll have to pass the sheet_name
keyword argument, with either a string value (the sheet's name) or an integer indicating the index of the sheet.Because the text was long and I'm always afraid of spelling and punctuation issues, I decided to just specify 0
as the sheet name.
Because I wanted to treat , *
, and #
as NaN
values, I also passed the na_values
keyword argument, giving it a list of strings as a value:
df = pd.read_excel(filename, sheet_name=0,
na_values=['*', '#', '**'])
This worked just fine, but how long did it take to load the entire file? Rather than cook up my own way to time things, I used the %%timeit
cell-level magic command in Jupyter. This runs the same command many times in a row, showing us the mean and standard deviation:
%%timeit
df = pd.read_excel(filename, sheet_name=0,
na_values=['*', '#', '**'])
On my machine, it took an average of 1 minute and 13 seconds to load the data from Excel, which is quite a while!
How much memory did this data frame take up? I ran the memory_usage
method on the data frame, which returns a series of integers. The series index contains the columns from the original data frame, and the values represent the amount of memory each column takes. We must pass the deep=True
keyword argument to ensure that we really count the memory correctly; if we don't, then we'll find out how much memory the pointers take up, not the Python strings to which they refer. Because we get a series back, we can just invoke sum
on them, to get the total memory usage:
df.memory_usage(deep=True).sum()
I got a result of 317,796,833, meaning that the data takes up about 318 MB in RAM. Just to note, the Excel file itself is about 80 MB on disk.
Next, I asked you to load the file again, but including only some of the columns. We can choose which columns to import using the usecols
keyword argument:
%%timeit
df = pd.read_excel(filename, sheet_name=0,
na_values=['*', '#', '**'],
usecols=['AREA_TITLE', 'AREA_TYPE', 'OCC_TITLE',
'I_GROUP', 'TOT_EMP', 'JOBS_1000',
'O_GROUP', 'PCT_TOTAL', 'A_MEAN' ])
I asked if loading things in this way took less time, or if it consumed less memory. Somewhat obviously, having fewer columns means having less data, and less data means less memory. Another call to memory_usage
returned 126,157,950, about 126 MB. That's about 40% of the original size of the data set. This is why it's almost always a good idea to specify usecols
, paring down the data set from its original size.
But what about the time it took to load the data? Somewhat surprisingly, reducing the number of columns doesn't change that; it was 1 minute and 12 seconds.
Once we've read it into Pandas, we end up with a data frame that contains 413,327 rows and 8 columns.
Store the data in Feather and Parquet format, and read from these files back into Pandas. Does it take less time to read from these formats? Which is faster?
These load times are pretty staggeringly long. I was especially surprised, because I would have expected that Excel, a binary file, would be relatively easy and quick to load into Pandas. I would expect CSV files, which contain text, and thus require that Pandas figure out what dtype to use for each column, to take a long time – but Excel?
Fortunately, there are several other formats we can use. In particular, the Apache Arrow project (and the Python bindings, known as PyArrow) has created two binary formats, Feather and Parquet, which Pandas can both read and write. This has nothing to do with using PyArrow for a back-end data storage; that functionality is still considered experimental. I asked you to write the data to both Feather and Parquet files, and then to read it back in, and to compare the timing for each of these actions.
I started with Feather:
%timeit df.to_feather('/tmp/data.feather')
Notice that I used the single-line %timeit
magic command, which starts with one %
symbol, and not two %%
symbols. Writing our data to a Feather file took 64.7 ms, on average.
I also tried Parquet:
%timeit df.to_parquet('/tmp/data.parquet')
This took about twice as long, for an average of 133 ms.
What if we try to read these files back into a data frame? How long will that take, and is it any faster than Excel? We can start, once again, with Feather:
%timeit pd.read_feather('/tmp/data.feather')
This took, on average 35.1 ms. That's right – less than half a second to load the same data that previously took longer than one minute. I then tried Parquet:
%timeit pd.read_parquet('/tmp/data.parquet')
This took, on average, 43.7 ms.
How big are these files?
-rw-r--r-- 1 reuven wheel 13M Apr 3 13:07 /tmp/data.feather
-rw-r--r-- 1 reuven wheel 2.8M Apr 2 16:01 /tmp/data.parquet
From all of this, we can see a few things:
- Just because a format is binary, such as Excel, doesn't mean that it'll be faster to read or write.
- Feather files are larger, but they're faster to read and write. Parquet files are smaller, but they're slower to read and write – presumably because of the compression that takes place. Both, however, are extremely speedy, in part because their binary types are a nearly one-for-one match with Pandas dtypes.
- If you'll be loading a large data set many times from either CSV or Excel files, you might be wise to read the data into Pandas once, write it out into Feather or Parquet, and then read from the Arrow-formatted files, rather than the original ones.
I'll just add that I'm giving a talk at PyCon US next month, "The PyArrow revolution in Pandas," which will include this sort of topic as well as many others. If you'll be at PyCon, please come and say "hi"!