arXiv 2410.00037
Moshi: a speech-text foundation model for real-time dialogue
By Alexandre Défossez, Laurent Mazaré, et al.
Published 2024-09-17
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Se…