Methodology
I am not a data scientist by trade so I’ve done my best to pull together data from multiple sources to generate this site. I’ll walk you through my process but as always, please verify the data yourself.
Overall Strategy
This is a simple one-off app so I opted for Astro for static site generation. Initially I was pulling data from excel files but this was slow and unwieldy so I pulled all of that data into a SQLite database and then used Astro to query that. The database is larger than 1GB so I opted to keep that stored locally, do the site builds on my machine, and then push the static assets only. It’s not scalable or pretty but it got things up and running very quickly without a huge need for infrastructure.
If this site draws good traffic and seems useful for subsequent years, I’d like to port the db to a managed Postgres instance and do the static generation and deploy with GH Actions (as God intended).
Data Sources
H-1B Data (filtered LCA Disclosure Data)
I pulled H-1B data straight from the government teat. As mentioned, these Excel files were huge and even Pandas struggled to load and parse through just one quarter’s worth of data. I wrote some Python scripts to yeet this data into my SQLite database for faster querying.
Please note: all data on this site is based on certified applications only. Most of those applications will not result in a visa being granted. This only serves to illustrate the appetite of American companies for H-1B visa workers.
Layoff Data
Layoff data was scraped from Layoffs.fyi - I didn’t scrape it, I got this dataset from Kaggle. Again, simple Python scripts to get this data into my SQLite database.
Combined Data
It was really hard to associate all the data due to company naming. Company names are simple and title-cased in the layoff data, e.g. Amazon
. In the LCA Disclosure Data, apparently anything goes, e.g. Amazon
shows up as Amazon.com Services LLC
. Compounding the problem is the fact that gigacorps like Amazon have a bunch of different names, e.g. Amazon Web Services
, Amazon.com Services LLC
, Amazon Corporate LLC
, etc.
Approach
The leaderboard is just counting the number of H-1B applications for each company. This provides a reasonably accurate picture despite the split between Amazon
companies for instance. I am adding UPPER()
to the company names in my SQL queries to make the comparison case-insensitive and them summing the counts which handled for cases like Tesla Inc
vs TESLA INC
(sorry Elon).
On a /company
page all of the topline stats are just for that company name (with UPPER()
applied). The layoff data is a bit more tricky. I’m doing a few things here:
- Create an array of
[{Company Name},{First Word of Company Name}]
from the LCA Disclosure Data company name. - Iterate through array
- On each, do fuzzy search (Fuse.js, threshold 0.3) on the layoff data company name filtered by year = 2024.
This gets us our layoff count and the urls for Layoff Sources
. I spot checked these for any errors against the sources (if the sources are right, the number is right) and it looked good but this is probably the highest-risk part of the data.