A Robust Ensemble Deep Learning Framework for Detecting Deepfake Audio Using Mel-Spectrograms
Kamatchy, B and N, Kalaichelvi (2025) A Robust Ensemble Deep Learning Framework for Detecting Deepfake Audio Using Mel-Spectrograms. In: NextGen Computing and Future Technologies(NCNCFT'25).
AKJ PAPER.pdf - Published Version
Restricted to Repository staff only until 31 December 2027.
Download (627kB) | Request a copy
Abstract
This paper presents a robust deepfake audio detection framework leveraging Mel spectrogram
representations combined with ensemble deep learning models. The input audio is first converted into Mel
spectrograms, capturing essential time-frequency characteristics crucial for distinguishing synthetic speech
from genuine audio. Our approach evaluates three classification strategies: (1) training custom deep learning
architectures including CNN, RNN, and CRNN directly on Mel spectrograms; (2) applying transfer learning
using state-of-the-art computer vision models such as ResNet-18, MobileNet-V3, and (3) utilizing embeddings
extracted from advanced pre-trained audio models like YAMNet, PANNs, ECAPA-TDNN, and PyAnnote,
which are then classified by a multilayer perceptron (MLP). By ensembling the top-performing models from
these strategies, our system achieves a highly competitive Minimum Detection Cost Function (minDCF)of
0.021on the ASVspoof 2021 benchmark dataset. The experimental results demonstrate that combining Mel
spectrogram features with ensemble deep learning enhances the accuracy and robustness of deepfake audio
detection, making this framework suitable for real-world security applications.
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| Subjects: | Computer Applications > Artificial Intelligence |
| Domains: | Computer Science |
| Depositing User: | Mr IR Admin |
| Date Deposited: | 07 May 2026 14:07 |
| Last Modified: | 07 May 2026 16:58 |
| URI: | https://ir.vistas.ac.in/id/eprint/13954 |
