Enhancing Voice Activity Detection for an Elderly-Centric Self-Learning Conversational Robot Partner in Noisy Environments

Rajanayagam, Subashkumar; Ingrisch, Max; Müller, Pascal; Twieg, Patrick

doi:10.25673/119209

2025/04/26 · Vol. 13 · Issue 1 · pp. 1–7

Enhancing Voice Activity Detection for an Elderly-Centric Self-Learning Conversational Robot Partner in Noisy Environments

Subashkumar Rajanayagam, Max Andreas Ingrisch, Pascal Müller, Patrick Jahn and Stefan Twieg

📄 Download PDF DOI: 10.25673/119209

Abstract

Voice Activity Detection (VAD) is a root component in Human-Robot Interaction (HRI), especially for use cases such as a self-learning personalized conversational robot partner designed to support elderly users with high acceptance. While state-of-the-art, lightweight deep-learning–based VAD models achieve high precision, they often struggle with low recall in environments with significant background noise or music. In contrast, traditional lightweight rule-based VAD methods tend to yield higher recall but at the expense of precision. These limitations can negatively affect user experience, particularly among elderly individuals, by causing frustration from missed spoken inputs and reducing overall usability and acceptance of the conversational robot partners. This study investigates noise-suppressing preprocessing techniques to enhance both the recall and precision of existing VAD systems. Experimental results demonstrate that effective noise suppression prior to VAD processing substantially improves voice detection accuracy in noisy settings, ultimately promoting better interaction quality in elderly-centric robotic applications. Moreover, optimal sample rate, frame duration, thresholds and voice activity modes were identified for the robot Double3—the conversational robot partner platform for seniors in a care home, co-creatively developed by reflecting with the nursing staff. An open-source dataset and a dataset collected and annotated in-house with the Double3 robot were evaluated for robustness in benchmarks.

Keywords

Voice Activity Detection Human Robot Interaction Conversational Robot Partner Elderly-Centric.

References

S. Yadav, P. A. D. Legaspi, M. S. O. Alink, A. B. J. Kokkeler and B. Nauta, "Hardware Implementations for Voice Activity Detection: Trends, Challenges and Outlook," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 70, no. 3, pp. 1083-1096, March 2023, doi: 10.1109/TCSI.2022.3225717.
Silero Team, "Silero Models: State-of-the-Art Speech Processing Models," GitHub repository, 2024. [Online]. Available: https://github.com/snakers4/silero-models. [Accessed: 09-Feb-2025].
Google. (2011) WebRTC. https://webrtc.org/ [Online, accessed Feb 2025]
Atlantis-vzw.com, ““Learning a foreign language with more ease".” Accessed: Feb. 02, 2025. [Online]. Available: https://www.atlantis-vzw.com/vreemde-talen?lang=en
M. Sharma, S. Joshi, T. Chatterjee, and R. Hamid, “A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows,” Neurocomputing, vol. 494, pp. 116–131, Jul. 2022, doi: 10.1016/J.NEUCOM.2022.04.084.
R. M. Patil and C. M. Patil, “Unveiling the State-of-the-Art: A Comprehensive Survey on Voice Activity Detection Techniques”, doi: 10.1109/APCIT62007.2024.10673721.
S. Chaudhuri, J. Roth, D. P. Ellis, A. C. Gallagher, L. Kaver, R. Marvin, C. Pantofaru, N. Reale, L. G. Reid, K. W. Wilson, and Z. Xi, “AVA-Speech: A densely labeled dataset of speech activity in movies,” arXiv preprint arXiv:1808.00606, Aug. 2018. [Online]. Available: https://arxiv.org/abs/1808.00606.
X. L. Zhang and M. Xu, “AUC optimization for deep learning-based voice activity detection,” Eurasip J. Audio, Speech, Music Process., vol. 2022, no. 1, pp. 1–12, Dec. 2022, doi: 10.1186/S13636-022-00260-9/TABLES/7.
T. Yoshimura, T. Hayashi, K. Takeda, and S. Watanabe, “End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2020-May, pp. 6999–7003, May 2020, doi: 10.1109/ICASSP40776.2020.9054358.
F. Gu, M.-H. Chung, M. Chignell, S. Valaee, B. Zhou, and X. Liu, “A Survey on Deep Learning for Human Activity Recognition,” 2021. A Surv. Deep Learn. Hum. Act. Recognition. ACM Comput. Surv, vol. 54, no. 8, p. 177, 2021, doi: 10.1145/3472290.
S. Twieg and B. Zimmermann. Acoustic clustering for vehicle based sounds, 2010.
P. Müller, H. K. Gali, S. Rajanayagam, S. Twieg, P. Jahn, and S. Hofstetter. Making Complex Technologies Accessible Through Simple Controllability: Initial Results of a Feasibility Study. Applied Sciences, 15(2), 1002, 2024 [Online]. Available: https://doi.org/10.3390/app15021002.
Audacity(R) software is copyright (c) 1999-2014 Audacity Team. [Website: http://audacity.sourceforge.net/. It is free software distributed under the terms of the GNU General Public License.] The name Audacity(R) is a registered trademark of Dominic Mazzoni.
H. Schröter, T. Rosenkranz, A. Escalante-B., and A. Maier, "DeepFilterNet: Perceptually motivated real-time speech enhancement," in Proc. 17th Int. Workshop Acoustic Signal Enhancement (IWAENC), 2022.