Scalable Resume Processing for Enterprises

Introduction

Resume processing in bulk is a tedious, time consuming and difficult task for Human Resource people. Medium and Large enterprises receive thousands of resumes every month for a variety of job openings. Furthermore, job search portals like jobs.com, Monster.com and countless others also need to process resumes at scale. In this project, DreamAI has implemented a scalable resume processing solution for enterprises that automates resume processing and mining data from them in bulk, and thus helps reduce the time and manual labor of extracting information from resumes. The solution developed in this project is a critical component of office process automation for enterprises and human resource companies.

Customer Story

This solution has been developed in response to requirements provided by two separate customers to one of DreamAI’s consulting partners on Google Cloud Platform. This solution is currently being demonstrated to both of these customers and a software license agreement is being worked upon at the time of this writing. Therefore, the names of both customers are withheld at this moment.
Both of the said customers are working on ways to avoid manual labor of processing and mining hundreds of thousands of resumes. The combined set of requirements are as follows:

Parse the resumes and store their important terms and entities in a datastore such that they are searchable across the complete resume database, thus helping shortlist the suitable candidates with simple search.
Extract different sections of each resume such as Education, Experience, Skills and Contact Info etc. and store them in an appropriate form in a datastore.
Identify named entities such as education institutions and companies and store them with each resume separately so they can be readily looked up to quickly find the resumes that mention them.

Our Solution

We developed an end-to-end ML pipeline for resume processing. The pipeline involves the following steps:

Detecting text in different areas of resume files given in PDF or DOC formats
Recognizing and extracting text from each area where text is detected
Creating text embeddings of each sentence/paragraph through a Natural Language Processing (NLP) based vectorization model.
Store the embedding vectors with each resume in a data store
Cluster resume embeddings formed in the previous step using an unsupervised learning algorithm to group each text as belonging to one of the sections such as Education, Experience, Contact Info etc.
Extract named entities from each section to classify words and phrases as Organization, Location and others.
Store the processed information with each resume with original extracted text as well as all the extracted information and embedding vectors
Build a global search index from Embeddings to facilitate similarity search based on keywords and phrases
Build an intra-resume index for each resume to quickly extract information based on keywords and phrases

Key Technologies Used

- Optical Character Recognition (OCR) for Text Detection
- OCR for Text Recognition
- Natural Language Processing
- Clustering

Key Frameworks

- Pytorch-Lightning
- Huggingface Transformers
- Jupyter Notebooks
- Pandas
- Numpy
- FastAPI
- Google Cloud Storage
- Google Kubernetes Engine

Customer Benefits

This solution is a critical component of office process automation for enterprises and human resource companies. It can reduce human engagement in the tedious process of extracting useful information from potentially hundreds of resumes daily. This leads to significantly more efficient office operations and saves valuable time for the Human Resource personnel.