GTA-NarrativeTraj: Language-Aware Trajectory Prediction from GPS and Dialogue in an Open-World Simulator

Sapeha, Anastasiia; Sariiev, Eduard; Sapeha, Mykyta; Kovan, Ibrahim; Rajanayagam, Subashkumar; Karpov, Kirill; Gering, Maksim; Siemens, Dmitry

doi:10.25673/122853

Proceedings of International Conference on Applied Innovation in IT · 2025/12/22 · Vol. 13 · Issue 5 · pp. 193–199

GTA-NarrativeTraj: Language-Aware Trajectory Prediction from GPS and Dialogue in an Open-World Simulator

Anastasiia Sapeha, Eduard Sariiev, Mykyta Sapeha, Ibrahim Kovan, Subashkumar Rajanayagam, Kirill Karpov, Maksim Gering, Dmitry Kachan and Eduard Siemens

📄 Download PDF DOI: 10.25673/122853

Abstract

GTA–NarrativeTraj is presented as a simulation framework and dataset for Grand Theft Auto V (GTA V) that couples spatiotemporal trajectories with in-game narrative signals (speech audio, subtitles, speaker identity). A ScriptHookVDotNet-based logger records world coordinates and vehicle state at ≥ 1Hz and captures dialogue events (subtitle text, speaker tags, soundbank IDs) during story-mode play. The released dataset provides tightly time-aligned GPS-like traces and the complete dialogue stream for full playthroughs, yielding a resource in which coordinates, audio, and text jointly form a narrative constraining and explaining agent motion. The task of narrative-grounded mobility prediction is introduced: given recent GPS and ongoing utterances, infer the agent’s near-term path and next waypoint while recovering salient context such as interlocutors (who is speaking to whom), scene-level locations, and dialogue-implicated points of interest. The dataset serves as ground truth for these tasks by pairing GPS histories with contemporaneous narrative cues and future motion outcomes - enabling models that reason simultaneously over movement, interlocutors, and places. Reproducibility, offset stability, and licensing are discussed; the release includes code, logs, transcripts, and time-aligned audio features, while excluding raw copyrighted assets.

Keywords

Narrative Trajectory Prediction Language-Aware Forecasting Next Location Prediction Multimodal GPS Audio-to-Text Speech-to-Text Subtitles Alignment Dialogue Grounding Spatio-Temporal Knowledge Graph (ST-NKG) Map Matching Road Graph Synthetic Dataset Urban Environment Simulation Intent Extraction Named-Entity Recognition (NER) Ontology Grand Theft Auto V (GTA V).

References

S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in ECCV, 2016.
S. R. Richter, Z. Hayder, and V. Koltun, “Playing for benchmarks,” in ICCV, 2017.
B. Hurl, K. Czarnecki, and S. Waslander, “Precise synthetic image and lidar (PRESIL) dataset for autonomous vehicle perception,” in IEEE Intelligent Vehicles Symposium (IV), 2019.
D. Ott et al., “DeepGTAV: A system to easily extract ground truth from GTAV,” 2018. [Online]. Available: https://github.com/David0tt/DeepGTAV.
A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in CoRL (PMLR), 2017.
G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in CVPR, 2016.
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in CVPR, 2018.
H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments,” in CVPR, 2019.
K. M. Hermann, M. Malinowski, P. Mirowski et al., “Learning to follow directions in street view,” in AAAI, 2020.
A. B. Vasudevan, D. Dai, and L. V. Gool, “Talk2Nav: Long-range vision-and-language navigation with dual attention and spatial memory,” International Journal of Computer Vision, 2021.
T. Deruyttere, S. Vandenhende, D. Grujicic, L. V. Gool, and M.-F. Moens, “Talk2Car: Taking control of your self-driving car,” in EMNLP-IJCNLP, 2019.
Y.-H. L. Kuo et al., “Trajectory prediction with linguistic representations,” arXiv:2110.09741, 2022.
I. Bae et al., “Social reasoning-aware trajectory prediction via multimodal language model (LMTraj),” 2024. [Online]. Available: https://github.com/InhwanBae/LMTrajectory.
J. Xia et al., “Language-driven interactive traffic trajectory generation,” in NeurIPS, 2024.
W. J. Chang et al., “LangTraj: Diffusion model and dataset for language-conditioned trajectory simulation,” arXiv:2504.11521, 2025.
T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: A large-scale dataset for visual speech recognition,” 2018.
R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, “How2: A large-scale dataset for multimodal language understanding,” arXiv:1811.00347, 2018.
L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, and N. Schneider, “Abstract meaning representation for sembanking,” in Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, 2013.
M. Palmer, D. Gildea, and P. Kingsbury, “The Proposition Bank: An annotated corpus of semantic roles,” Computational Linguistics, vol. 31, no. 1, pp. 71–106, 2005.
J. Pustejovsky, J. M. Castaño, R. Ingria, R. Saurí, R. Gaizauskas, A. Setzer, G. Katz, and I. Mani, “TimeML: Robust specification of event and temporal expressions in text,” in AAAI Spring Symposium on New Directions in Question Answering, 2003.
J. Strötgen and M. Gertz, “HeidelTime: High quality rule-based extraction and normalization of temporal expressions,” in SemEval, 2010, pp. 321–324.