A Comparative Study on MFCC, GFCC, BFCC, and CQCC Spectral Speech Feature Performance in X-Vector Clustering
URI
Date
2023-07-25
Access
2025-07-01
Authors
Abueg, Abelson
Journal Title
Journal ISSN
Volume Title
Publisher
East Carolina University
Abstract
Speaker diarization plays a crucial role in accurately identifying speakers in audio or video streams with multiple speakers. However, the use of Mel-frequency cepstral coefficients (MFCC) as the default speaker feature has posed a significant limitation in speech processing research. Existing literature suggests a lack of research addressing this limitation. This thesis aims to fill this gap by exploring alternative speech features and conducting a comprehensive investigation of their performance in the clustering step of speaker diarization. By conducting a comparative analysis of various spectral features, including Gammatone Frequency Cepstral Coefficients (GFCC), Constant-Q Cepstral Coefficients (CQCC), and Bark Frequency Cepstral Coefficients (BFCC), this study trains four distinct x-vector embedding deep neural networks (DNNs) and evaluates their effectiveness using four clustering algorithms. The results highlight the potential of the investigated alternative spectral features to outperform MFCC, emphasizing the need to move beyond the default MFCC approach and encouraging further exploration of alternative speech features for enhancing speaker diarization and related speech-processing tasks.