BW #14: JOLTS (solution)
Is the US really experiencing a record wave of people quitting their jobs? Is that wave taking place equally across different US regions? This week, we analyze the government's JOLTS data.
This week, we looked at the JOLTS data produced by the Bureau of Labor Statistics. I should add that I was introduced to this data over the last year or two by Marketplace, a terrific daily public-radio (not NPR!) program about business and economics. They’ve often mentioned these statistics, and I was waiting for a good opportunity to talk about them here.
Data and questions
As I wrote yesterday, the data comes in a number of tab-separated CSV files, all available from https://download.bls.gov/pub/time.series/jt/. Sadly, you cannot easily download them en masse; I found myself downloading each of the files I wanted via my browser, manually, one by one. The file that explains the full contents and structure of the data is at https://download.bls.gov/pub/time.series/jt/jt.txt.
I realized after publishing yesterday’s newsletter that the text of question 6 didn’t reflect what I really wanted to ask. I updated the question on the Web site and added a comment; the new text is reflected here, as well.
Here’s what I asked you to do:
- Read the series data (https://download.bls.gov/pub/time.series/jt/jt.series) into a data frame. Make the first column ("series_id") into the index. Note that the column names might contain extra whitespace, causing some trouble. Remove the "footnote_codes" column.
- Read the JOLTS data (https://download.bls.gov/pub/time.series/jt/jt.data.1.AllItems) into a data frame. Again, make the first column ("series_id") into the index. Again, remove the "footnote_codes" column.
- Create a new data frame combining all of the columns from these two.
- Find the most recent quits rate, across the entire United States, for all nonfarm jobs. This means getting the `QU` code, the 0 `industry_code`, and the `'00'` value for `state_code`. We'll ask for seasonally adjusted numbers and will be asking for a rate, rather than the absolute number (i.e., `'R'` for `ratelevel_code` and `'S'` for `seasonal`).
- There are 12 JOLTS reports per year. In each year on record, how many reports showed the quits rate to be higher than this most recent value?
- The states-name data (https://download.bls.gov/pub/time.series/jt/jt.state) lists four regions of the United States. Show the most recent quits data for the most recent time period. In which region are people quitting the most? The least?
- Create a plot showing the quits level for each of these four regions over time. The x axis should be time (year + period) and the y axis should be the quits rate. What region's quits rate is consistently the highest? Which is consistently the lowest?
- Create a data frame in which the index contains the year and period, and with two columns -- the quits rate (QU) and the number of job openings (JO), both for the entire United States. How highly correlated are these values?
Let’s get to it!
Read the series data (https://download.bls.gov/pub/time.series/jt/jt.series) into a data frame. Make the first column ("series_id") into the index.
When I started to look at the JOLTS data, I was a bit confused. What is this weird “series_id” column in the main data files? I finally understood that the series_id is a combination of serial number (uniquely identifying the records), a two-letter code indicating what data it’s tracking (as described in https://download.bls.gov/pub/time.series/jt/jt.dataelement), and a one-letter code indicating whether it’s a an absolute numeric measurement (L) or a percentage rate (R).
The series data (in jt.series) gives us the information we need to get the data that’s of greatest interest to us.
I started my work with a simple line to load Pandas:
import pandas as pd
I then wanted to load the series data into Pandas. I did so via “read_csv”, specifying that the field separator is a tab:
jt_series_filename = 'jt.series'
jt_series_df = pd.read_csv(jt_series_filename, sep='\t')
This was fine, except for one problem that only came up as I continued working with this data: The “series_id” column name contained a huge number of extra spaces. How and why this happened, I’m not sure; it might just be a fluke of how the BLS creates its data files. But it was quite frustrating to work with, especially given that I wanted to use series_id as the index column.
I thus decided to use a Python list comprehension to process each of the column names, removing any surrounding whitespace with str.strip. That returned a new list of strings, which I assigned back to the “columns” attribute of my data frame:
jt_series_df.columns = [one_column.strip()
for one_column in jt_series_df.columns]
With my columns now in a reasonable format, I got rid of the “footnote_codes” column, simply because it was unnecessary for any of the analysis I was going to do. (The only code I saw is “P”, meaning that the data is preliminary. I do hope that you’ll forgive me for including preliminary labor-market data in this newsletter.) I removed it with the “drop” method:
jt_series_df = jt_series_df.drop('footnote_codes', axis='columns')
Notice two things about using “drop”: First, in order to remove a column, I need to pass the “axis” keyword argument. You could use a number (1, in the case of columns), but I much prefer to use the word “columns”, which is clearer to me.
The second thing to remember is that “drop”, like so many other methods in Pandas, can theoretically take an “inplace=True” keyword argument. In such a case, the original data frame is modified, and the method returns None. However, the core Pandas developers have been saying for a while that inplace=True doesn’t necessarily improve speed or memory use, and that we should avoid using it.
Next, I wanted to set the index of our data frame to the (in)famous “series_id” column. That’s easily done with “set_index”; note that here, as well, I don’t pass inplace=True, which means that we get a new data frame back, one which we assign back to the original variable “jt_series_df”:
jt_series_df = jt_series_df.set_index('series_id')
Read the JOLTS data (https://download.bls.gov/pub/time.series/jt/jt.data.1.AllItems) into a data frame. Again, make the first column ("series_id") into the index.
This is where we take the actual data file (AllItems) and turn it into a data frame. Note that the BLS provides a file containing just the current JOLTS values. Because some of the questions I’ve asked look back over time, we need the full data.
I basically repeated the same commands that I used with the series data here, so I’ll just present the code:
jt_allitems_filename = 'jt.data.1.AllItems'
jt_df = pd.read_csv(jt_allitems_filename, sep='\t')
jt_df.columns = [one_column.strip()
for one_column in jt_df.columns]
jt_df = jt_df.drop('footnote_codes', axis='columns')
jt_df = jt_df.set_index('series_id')
I now have a second data frame, jt_df, with the actual JOLTS values, and with series_id as the index.