Costa, NunoOliveira, AntónioLobo, ArmindoTeixeira, RicardoFernandes, DuarteRodrigues, RicardoGouveia, Emanuel2026-04-232026-04-232026-01-019789897587962bf873a90-e1ab-4836-888b-96dddaaa5f59http://hdl.handle.net/10400.14/57581The detection of highlights in broadcast streams is essential for enhancing User Experience (UX) through automated summaries and efficient content retrieval. This is particularly relevant for live streaming environments common in sports and eSports, where audiences demand near real-time analysis. This paper presents a benchmark of models for highlight detection in broadcast audio, validated on the SoccerNet dataset but applicable to general competitive gaming streams. We propose a novel multi-modal architecture combining high-level semantic audio features (YAMNet) with Natural Language Processing (NLP) of transcribed commentary (analogous to eSports shoutcasting). Results show that fusing audio event detection with semantic text analysis significantly outperforms uni-modal baselines. The proposed framework offers a computationally efficient solution for AI-based broadcasting technologies, enabling scalable automation for content creators and improved viewer experiences.engAI-based sports technologiesAudio event detectionBroadcast stream automationMachine learning for real-time analysisMulti-modal deep learningMulti-modal highlight detection in broadcast audio: a deep learning approach for event recognition in sports and eSportsconference proceedings10.5220/0014585200004052105035626420