Scoring Speaker Quality from Videos

Introduction

Evaluating the quality and effectiveness of speech from a speaker’s video recording is an important use case in many environments including education, public speaking, politics and corporate training to name a few. Modern AI systems can extract key features from audio and video information and identify the timestamps of occurrences of specific types of events such as long pauses in speech, stuttering, use of filler words, loss of eye contact with the camera, loss of eye contact, unusual tilts of the speaker to one side and unusual hand waving etc. These extracted features can be used to evaluate and give an overall rating to the given speaker’s video. This helps identify the areas of improvement in communication skills of a person and focus on them to improve their overall effectiveness. This project uses Computer Vision, Speech Recognition, Speech To Text and Natural Language Processing to extract important features from a given video with timestamps of each feature within the video.

Customer Story

Our customer is a well reputed US University’s Media Studies department. They want to develop an automated Speech evaluation and scoring solution for their students participating in courses and training where the students have to submit a video presentation as an assignment. The department wants to eventually build a commercial product that could be marketed to other departments, Universities and any businesses and institutions dealing with public speaking and improvement of communication skills in general.
The customer approached us via one of our partner companies which work closely with us for AI/ML based consultancy projects on Google Cloud Platform.

Our Solution

We have built a Proof-of-Concept and a working prototype that successfully extracts key features using Computer Vision, Speech-To-Text and NLP techniques from videos based on multiple AI/ML models. The customer provided us with several videos of students giving presentations in front of a home camera for a given course. Our solution extracts the following features from each video:

Body Tilt
- This detects the timestamps in the video where the speaker is tilting or leaning to the right or left in an unusual way. We used a Deep Learning based body tilts detection model using labelled data that identifies the tilt situation. We added more videos to the given video set to simulate this type of leaning/tilting movement where the speaker was asked to perform tilting actions frequently by adding some self-made videos from our side. This created an effective dataset for model training.
Hand Waving
- This detects the timestamps in the video where the speaker is using unusual hand waving movements that may look odd to the audience. We used a Deep Learning based hand waving detection model using labelled data that identifies the occurrences of hand movement gestures. We added more videos to the given video set by adding some self-made videos from our side where the speaker was asked to perform frequent hand movements to create an effective dataset labelled for effective training.
No Eye Contact
- This detects the timestamps in the video where the speaker loses eye contact with the front camera with which the speaker is supposed to maintain eye contact most of the time. This helps in determining whether the speaker is losing eye contact too often with the audience. We used a combination of OpenCV and Deep Learning techniques to identify the events of no eye contact in a given video. We first detect the face and then the eyes of a speaker and then a Deep L:earning model identifies the points where the speaker is looking to one or the other side instead of straight to the front. Several datasets and pretrained models are available for this use case and we used them to create an efficient solution.
Long Pauses in Speech
- This detects unusually long pauses in speech that may be irritating for the audience and may create a boring or bad impression about the speaker. We used a set of Python libraries that internally use pre-trained Deep Learning models to detect silence periods with an ongoing audio file. We first extracted the audio from the video file and then applied the models to detect silence areas, and mark the timestamps where the silence was above a pre-set threshold.
Use of Filler Words
- This detects the timestamps in the speech where the speaker is using filler words such as um, uh, oh, er, ah, you know, you see, I mean, I guess etc. We used a set of Python libraries to first extract audio from the video, then converted the audio to text using Speech-To-Text techniques, and then identified the filler words and their corresponding timestamps in the video. These could then be used to evaluate whether their frequency is unusually high that may be irritating for the audience or could be considered normal.
Stutter in Speech
- This detects the timestamps where the speaker was stuttering. We used a set of Python libraries that internally use pre-trained Deep Learning models to detect stutter in speech segments. We first extracted the audio from the video file and then divided it into segments of a fixed number of seconds. We then applied the models to detect the presence of stutter in those segments and marked the timestamps where the stutter was observed.

Key Technologies Used

- Video Classification and Action Recognition
- Image Object Detection (for face and eyes)
- Audio Extraction
- Speech-To-Text
- Natural Language Processing

Key Frameworks

- Pytorch-Lightning
- OpenCV
- PyAudio
- Jupyter Notebooks
- FastAPI
- Optuna
- Kubeflow
- Google Cloud Platform

Customer Benefits

This solution has provided an effective Proof-of-Concept and a working prototype for the customer to embark upon a full product development based on the core feature extractor module we have developed. The customer is actively engaged with us in taking the development further and demonstrating to potential customers and investors. It solves a significant problem for scoring and evaluating speaker performance by identifying areas of improvement and focusing on them for practice and speaking skills refinement.