LLM-based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

Huang, Qing; Zhang, Pooja

doi:10.25673/123837

Proceedings of International Conference on Applied Innovation in IT · 2026/04/22 · Vol. 14 · Issue 2 · pp. 155–162

LLM-based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

Qing Huang, Pooja Pol and Jianing Zhang

📄 Download PDF DOI: 10.25673/123837

Abstract

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space. Specifically, retrieval is guided by prosodic descriptors derived from speech signals, whereas transcript-based semantic information is incorporated directly within the prompt and interpreted by the LLM through its reasoning capabilities. This modality-aware strategy promotes consistent affective labeling and improved generalization across previously unseen speech segments. Evaluated on multi-player VR audio recordings, our methodology demonstrates potential as a scalable, data-efficient component for data-driven team-based decision support. By integrating acoustic similarity-based retrieval with LLM-based semantic reasoning, this work contributes to emerging interdisciplinary methodologies at the intersection of scientific machine learning, multi-modal systems, and AI-driven decision-making.

Keywords

Large Language Models (LLMs) In-Context Learning (ICL) Synthetic Ground Truth Affective Computing Data-Driven Decision Support Virtual Reality (VR).

References

I. A. Castiblanco Jimenez, E. C. Olivetti, E. Vezzetti, S. Moos, A. Celeghin, and F. Marcolin, “Effective affective EEG-based indicators in emotion-evoking VR environments: An evidence from machine learning,” Neural Computing and Applications, vol. 36, pp. 22245-22263, 2024.
N. Gao, M. S. Rahaman, W. Shao, and F. D. Salim, “Investigating the reliability of self-report data in the wild: The quest for ground truth,” in Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp Adjunct), pp. 237-242, 2021.
M. Ihori, T. Yamane, N. Kawata, N. Makishima, T. Tanaka, S. Suzuki, S. Orihashi, and R. Masumura, “Few-shot personalization via in-context learning for speech emotion recognition based on speech-language model,” arXiv preprint arXiv:2509.08344, 2025.
M. Mosbach, T. Pimentel, S. Ravfogel, D. Klakow, and Y. Elazar, “Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, pp. 12284-12314, 2023.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, pp. 28492-28518, 2023, [Online]. Available: https://doi.org/10.48550/arXiv.2212.04356.
A. H. Liu, A. Ehrenberg, L. A. Lo, C. Denoix, C. Barreau, G. Lample, J.-M. Delignon, K. R. Chandu, P. von Platen, and P. R. Muddireddy, “Voxtral,” arXiv preprint arXiv:2507.13264, 2025.
J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the transformer era in speech emotion recognition: Closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10745-10759, 2023.
S. M. Mohammad, “Sentiment analysis: Automatically detecting valence, emotions, and other affectual states from text,” in Emotion Measurement, 2nd ed., H. L. Meiselman, Ed. Cambridge, U.K.: Woodhead Publishing, 2021, pp. 323-379.
S. M. Mohammad and P. D. Turney, “Crowdsourcing a word-emotion association lexicon,” Computational Intelligence, vol. 29, no. 3, pp. 436-465, 2012.
S. M. Mohammad, “NRC VAD lexicon v2: Norms for valence, arousal, and dominance for over 55k English terms,” 2025, doi: 10.48550/arXiv.2503.23547.
C. Lalk, K. Targan, T. Steinbrenner, J. Schaffrath, S. Eberhardt, B. Schwartz, A. Vehlen, W. Lutz, and J. Rubel, “Employing large language models for emotion detection in psychotherapy transcripts,” Frontiers in Psychiatry, vol. 16, Art. no. 1504306, 2025.