Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data

Download File
100125-WP-Extracting ONET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data-Meisenbacher Nestorov and Norlander

Authors:
Stephen Meisenbacher, Technical University of Munich
Svetlozar Nestorov
, Loyola University Chicago
Peter Norlander
, Loyola University Chicago

Abstract:

Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 – 2025. We illustrate the potential for research and future uses in education and workforce development.

Related

Connect with us!

Explore the Equitable Growth network of experts around the country and get answers to today's most pressing questions!

Get in Touch