arXiv 2410.00037
Moshi: a speech-text foundation model for real-time dialogue
By Alexandre Défossez, Laurent Mazaré, et al.
Published 2024-09-17
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Se…