BW #100: Sports betting (solution)
This week, we're looking at data about sports betting in the United States. My interest was piqued by the latest season of Michael Lewis's "Against the rules" podcast (https://www.pushkin.fm/podcasts/against-the-rules), where he's discussing the changes that happened in American law that made it possible for people to bet on sports via their phones, as well as the societal impacts of those changes.
I had no idea that people were spending so much money on betting — or how much the betting platforms were taking advantage of their users. I have since seen stories in a number of news publications, including the Washington Post (https://www.washingtonpost.com/opinions/2024/12/29/sports-gambling-sportsbooks-bradley-act-betting/) and the Economist (https://www.economist.com/finance-and-economics/2024/12/05/how-sports-gambling-became-ubiquitous).
I decided to look for data about sports betting, in an effort to better understand just how popular it is, and how much money states are earning from the taxes on these the bets.
Data and six questions
This week's data comes from Legal Sports Report (https://www.legalsportsreport.com/), a Web site that "covers the legal online sports wagering industry, including sports betting and daily fantasy sports."
Specifically, the data comes from a page (https://www.legalsportsreport.com/sports-betting/revenue) breaking down the amounts spent on sports betting across all states, across all months, and then for each month in each state. Each table has several columns:
- Market or month
- Handle, meaning the total amount of money that was bet.
- Revenue, meaning how much money the sports-better company kept after paying out winning bets. You can think of this as the company's gross income. (Note that revenue is only calculated after paying people who made winning bets.)
- Hold, the amount of bet money (handle) that the company kept (revenue), expressed as a percentage. So if all of the bets, on all sports, were \$1,000, and the company paid out \$900 to all of the people who won bets, then we would say that the handle was \$1,000, the revenue was \$100, and the hold was 10%.
- Taxes, how much the sports-betting company paid in taxes.
Note that the LSR site doesn't have data for all states. You'll only need to retrieve and work with data from the following state abbreviations:
AZ AK CO CT DE DC IL IN IA KS KY LA ME MD MA MI MS MT NV NH NJ NY NC OH OR PA RI SD TN VT VA WV WY
I gave you six tasks and questions for this week. Here are my solutions and explanations; my Jupyter notebook is downloadable from the end of the message.
You cannot use requests
or pd.read_html
to retrieve the page from LSR. Rather, you'll need to go there with your browser, save the HTML ,and then use pd.read_html
on that saved HTML file. Do so, creating a single data frame in which the columns are market
(i.e., the state's name or abbreviation), along with the month, handle, revenue, hold, and taxes from the LSM site. Treat ?
as a NaN
value. Also remove the $ from the start of dollar columns, commas (,) from integer columns, and % from the Hold column. Remove rows in which the Month
column contains the string "Total".
Before doing anything else, I'll first load up Pandas:
import pandas as pd
If you look at the LSR page, you'll see that it contains many HTML tables, many of which are hidden from the user via +
signs. But those just hide and unhide the data from the viewer using JavaScript; the tables are still there. My go-to in such situations is to use pd.read_html
, which returns a list of data frame, one for each HTML table in the page we pointed it to.
However, LSR clearly doesn't want people to be downloading its data; trying to do so from Pandas gave me a 503 error. I know that Pandas can sometimes be stymied by these sorts of things, so I tried requests
, which didn't work either.
I thus tried (and suggested) a separate approach: I went to the page with my browser, and then saved the HTML to a file with the "save" command. I then used read_html
on the file, rather than from the network. This worked like a charm:
filename = 'Sports Betting Revenue Tracker_ US Betting Revenue & Handle By State.html'
all_dfs = pd.read_html(filename)
However, LSR uses ?
to indicate where there is missing data. We can have read_html
(or other read_*
methods) use one or more other strings to indicate missing values (which we call NaN
) by passing the na_values
keyword argument:
all_dfs = pd.read_html(filename, na_values='?')
We then get a list of data frames – one describing the overall handle, revenue, hold, and taxes from each state, a second describing the handle, revenue, hold, and taxes for each month, and then a month-by-month breakdown for each of the states I mentioned above.
I created a markets
variable, containing a list of strings (i.e., two-letter state abbreviations), using str.split
:
markets = 'AZ AK CO CT DE DC IL IN IA KS KY LA ME MD MA MI MS MT NV NH NJ NY NC OH OR PA RI SD TN VT VA WV WY'.split()
I then wanted to take our existing list of data frames and return a new list of data frames. The returned list should be identical to the input list, except that it should have a new column, market
, with the two-letter state abbreviation.
I decided to do this with a combination of a list comprehension and the assign
method. assign
allows us to set a new column on any data frame. I thus iterated over each of the data frames in all_dfs[2:]
, so that we ignore the first two (which aren't for specific markets, anyway).
But wait: How can I know which market I should apply to each data frame? I decided to iterate not over all_dfs[2:]
, but rather over zip(markets, all_dfs[2:]
. The builtin zip
method takes two iterables of the same length, and converts them into what's essentially a list of tuples. With each iteration, we'll thus get two items, the state abbreviation and the current data frame.
Here's how that looks:
[(one_df
.assign(market=state_abbreviation)
)
for state_abbreviation, one_df in zip(markets, all_dfs[2:])
]
In other words: Iterate over output from zip
, giving us a state abbreviation and data frame with each iteration. Grab the data frame, use assign
to add the current state abbreviation to every row in the market
column, and go onto the next one.
The result, as I said, is a list of data frames. Because all of the data frames in that list have the same columns, we can create one new data frame from all of them using pd.concat
, passing it the output from the list comprehension:
df = pd.concat([(one_df
.assign(market=state_abbreviation)
)
for state_abbreviation, one_df in zip(markets, all_dfs[2:])
])
This works great, except that we still have a bunch of rows to clean up. For example, I asked you to:
- Remove the
$
- Remove the
,
- Remove the
%
We can do that with the replace
method on a data frame, which does a global search-and-replace. I decided to do it in three stages, rather than one, which certainly helped me with debugging. In two of the three cases, I took advantage of regular expressions, using parentheses to capture a character following one I didn't want, then using a backreference (e.g., \1
) to refer back the captured part:
df = pd.concat([(one_df
.assign(market=state_abbreviation)
.replace(r'\$(\d)', r'\1', regex=True)
.replace(r',', r'', regex=True)
.replace(r'(\d)%', r'\1', regex=True)
)
for state_abbreviation, one_df in zip(markets, all_dfs[2:])
])
This worked well, but I still had to remove the rows in which Month
had the string 'Total'
. I removed those by using loc
and lambda
to keep only those rows in which Month
was not equal to 'Total'
:
df = pd.concat([(one_df
.assign(market=state_abbreviation)
.replace(r'\$(\d)', r'\1', regex=True)
.replace(r',', r'', regex=True)
.replace(r'(\d)%', r'\1', regex=True)
.loc[lambda df_: df_['Month'] != 'Total']
)
for state_abbreviation, one_df in zip(markets, all_dfs[2:])
])
The result from this is a data frame with 1,640 rows and 6 columns.
Remove NaN
values. Turn the Month
column into a datetime, turn the handle, revenue, and taxes into integers, and turn the hold into a float.
Next, I asked you to take our new, unified data frame and clean it up a bit more.
First, I asked you to remove the NaN
values. We can do that with dropna
, which returns a new data frame identical to the input data frame, but without any rows containing NaN
values. We normally can't throw away all NaN
values, but I know that any row with NaN
is one we can't and won't use:
df = (
df
.dropna()
)
Next, I asked you to turn the Month
column into a datetime
type. We can do that by applying the pd.to_datetime
function. In theory, we could run the function on pd['Month']
, get back a new column of datetime
values, and assign it back to the data frame.
But there are two problems with that: First, the dates are all in MONTH YEAR format. We can tell pd.datetime
to handle that input format by passing the format
keyword argument, giving it the string '%B %Y'
, which means "full month name and four-digit year".
The other issue is that pd.to_datetime
is a function, not a method. But I can use the assign
method (again!), giving it a lambda
expression as an argument. That expression returns a function, one which will be invoked our data frame. The result of that invocation will be the datetime
column, which we'll assign back to Month
, overwriting the original text Month
values:
df = (
df
.dropna()
.assign(Month = lambda df_: pd.to_datetime(df_['Month'], format='%B %Y'))
)
We now have our data in a format that's acceptable, but the dtype
for all columns is still object
, which basically means "strings." We can use astype
to convert one or more columns to another type; by passing a dictionary, we can use a different type for each named column:
df = (
df
.dropna()
.assign(Month = lambda df_: pd.to_datetime(df_['Month'], format='%B %Y'))
.astype({'Handle':int, 'Revenue':int, 'Hold':float, 'Taxes':int})
)
After doing the above, our data frame now contains numeric (integer and float) data, as well as datetime
values in our 'Month'
column. We're ready to do some analysis!
I'll just add that people sometimes complain about the amount of cleanup you have to do on these Bamboo Weekly problems. The thing is, this is exactly the sort of thing that you have to do with real-world data. In fact, it'll often take you much longer, and will require making many more trade-offs. Like it or not, cleaning data is a necessary and important part of what we do in the world of data, and getting better at it is part of the game.