arXiv 2410.00037

Moshi: a speech-text foundation model for real-time dialogue

By Alexandre Défossez, Laurent Mazaré, et al.

Published 2024-09-17

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Se…

View the original paper on arXiv