Abstract: As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new ...
Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on multiple key audio understanding tasks. Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to ...