A Robust Ensemble Deep Learning Framework for Detecting Deepfake Audio Using Mel-Spectrograms

Kamatchy, B and N, Kalaichelvi (2025) A Robust Ensemble Deep Learning Framework for Detecting Deepfake Audio Using Mel-Spectrograms. In: NextGen Computing and Future Technologies(NCNCFT'25).

[thumbnail of AKJ PAPER.pdf] Text
AKJ PAPER.pdf - Published Version
Restricted to Repository staff only until 31 December 2027.

Download (627kB) | Request a copy

Abstract

This paper presents a robust deepfake audio detection framework leveraging Mel spectrogram
representations combined with ensemble deep learning models. The input audio is first converted into Mel
spectrograms, capturing essential time-frequency characteristics crucial for distinguishing synthetic speech
from genuine audio. Our approach evaluates three classification strategies: (1) training custom deep learning
architectures including CNN, RNN, and CRNN directly on Mel spectrograms; (2) applying transfer learning
using state-of-the-art computer vision models such as ResNet-18, MobileNet-V3, and (3) utilizing embeddings
extracted from advanced pre-trained audio models like YAMNet, PANNs, ECAPA-TDNN, and PyAnnote,
which are then classified by a multilayer perceptron (MLP). By ensembling the top-performing models from
these strategies, our system achieves a highly competitive Minimum Detection Cost Function (minDCF)of
0.021on the ASVspoof 2021 benchmark dataset. The experimental results demonstrate that combining Mel
spectrogram features with ensemble deep learning enhances the accuracy and robustness of deepfake audio
detection, making this framework suitable for real-world security applications.

Item Type: Conference or Workshop Item (Paper)
Subjects: Computer Applications > Artificial Intelligence
Domains: Computer Science
Depositing User: Mr IR Admin
Date Deposited: 07 May 2026 14:07
Last Modified: 07 May 2026 16:58
URI: https://ir.vistas.ac.in/id/eprint/13954

Actions (login required)

View Item
View Item