SwinET-IoT: A Mask-Guided Multimodal Transformer Framework for Real-Time Emotion Prediction in Intelligent Learning Environments

Dimensions

P, Dinesh and Thailambal, G. (2026) SwinET-IoT: A Mask-Guided Multimodal Transformer Framework for Real-Time Emotion Prediction in Intelligent Learning Environments. In: 2026 International Conference on Electronics and Renewable Systems (ICEARS), Tuticorin, India.

Full text not available from this repository. (Request a copy)

Official URL: https://doi.org/10.1109/ICEARS67481.2026.11416576

Abstract

Real-time emotion understanding in intelligent learning environments is becoming an increasingly important requirement in classrooms integrating visual, audio, physiological and ambient IoT data. However, existing unimodal, or low context models of affect struggle with temporal instability, low interpretability and inefficient deployment. This work is aimed to solve a problem of robust multimodal emotion prediction (across seven affective states) using heterogeneous signals with real-time constraint. The task then is to offer true, negligible and explainable anticipations that can be pushed to the edge. The study proposes a SwinET-IoT, which combines a multimodal emotion recognition framework based on the extended Mask R-CNN and Swin Transformer backbone, lightweight audio and physiological encoders, IoT telemetry processing, and temporal attention transformer and cross-modal co-attention. The information from the critical facial and posture areas will be guaranteed to be provided to the fusion stage by mask-guided spatial attention so that the fusion stage will be provided by hierarchical multimodal fusion and temporal consistency losses to increase the robustness. Edge oriented optimization using pruning, distillation and quantization using 8-bit to reduce the footprint with minimal drawback to accuracy. Experiments conducted on the CRAFT multimodal classroom dataset demonstrate high improvements (92% accuracy, 0.91 (macro-F1) and 0.96 (AUC) and 42 ms/frame inference latency on edge device) than five state-of-the-art baselines. These results confirm that through combining spatially guided multimodal fusion with modeling in time and IoT context efficient, stable, and interpretable emotion prediction can be achieved that can be used for real-world emotion prediction in classrooms for efficient analytics and adaptive learning systems.

Item Type:	Conference or Workshop Item (Paper)
Subjects:	Computer Applications > Intelligent Systems
Domains:	Computer Science
Depositing User:	Mr IR Admin
Date Deposited:	07 May 2026 17:05
Last Modified:	07 May 2026 17:07
URI:	https://ir.vistas.ac.in/id/eprint/14032

Actions (login required)

: View Item