A Robust Audio-Visual Speech Recognition using Improved Features
Ali Salih Mahmoud Saudi;
Abstract
This research investigates the enhancement of a speech recognition system that uses both audio and visual speech information in noisy environments by presenting contri- butions in two main system stages: front-end and back-end. The double use of Gabor filters is proposed as a feature extractor in the front-end stage of both modules to capture robust spectro-temporal features. The Gabor features simulate the underlying processing chain that occurs in the Primary Audio Cortex (PAC) in conjunction with Primary Visual Cortex (PVC). We named it GAF and GVF. The performance obtained from the resulted GAF and GVF is compared to the performance of other traditional features such as MFCC, PLP, RASTA-PLP audio features, and DCT2 visual features. The experimental results show that a system utilizing GAF and GVF has attained a 98.89% and 69.23% recognition accuracy, respectively, which significantly outperforms the traditional audio and visual features, especially in a low-Signal to Noise Ratio (SNR) scenario.
To improve the back-end stage, a complete framework of synchronous Multi-Stream Hidden Markov Model (MSHMM) is used to solve the dynamic stream weight estimation problem for Audio-Visual Speech Recognition (AVSR). To demonstrate the usefulness of the dynamic weighting in the overall performance of AVSR system, we empirically show the preference of Late Integration (LI) compared to Early Integration (EI) especially when one of the modalities is corrupted. The results confirm that the proposed AVSR- LI model that utilize the dynamic weighting scheme outperforms the AVSR-EI model by a large difference by improving the average recognition accuracy from 90.65% to 92.83% with approximately 23.33% relative improvement.
To improve the back-end stage, a complete framework of synchronous Multi-Stream Hidden Markov Model (MSHMM) is used to solve the dynamic stream weight estimation problem for Audio-Visual Speech Recognition (AVSR). To demonstrate the usefulness of the dynamic weighting in the overall performance of AVSR system, we empirically show the preference of Late Integration (LI) compared to Early Integration (EI) especially when one of the modalities is corrupted. The results confirm that the proposed AVSR- LI model that utilize the dynamic weighting scheme outperforms the AVSR-EI model by a large difference by improving the average recognition accuracy from 90.65% to 92.83% with approximately 23.33% relative improvement.
Other data
| Title | A Robust Audio-Visual Speech Recognition using Improved Features | Other Titles | التعرف بطريقة قوية سمعية و مرئية على الكلام عن طريق تحسين الملامح | Authors | Ali Salih Mahmoud Saudi | Issue Date | 2019 |
Attached Files
| File | Size | Format | |
|---|---|---|---|
| CC3470.pdf | 1.21 MB | Adobe PDF | View/Open |
Similar Items from Core Recommender Database
Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.