NVIDIA PersonaPlex-7B: Full-Duplex Speech AI with Role and Voice Control
VeröffentlichtKategorie: Künstliche Intelligenz
Veröffentlicht am 17.02.2026
NVIDIA PersonaPlex: The End of Robotic Pauses?
Voice assistants still sound like walkie-talkies. You speak. Pause. The AI thinks. Pause. Answer. That’s exactly what PersonaPlex-7B from NVIDIA is going after — with real full-duplex dialogue.
I didn’t just skim press coverage. I read the full preprint. Here’s the complete, technically clean breakdown — without the marketing filter.
TL;DR
PersonaPlex is the first open full-duplex speech model that can do all of this at once:
• listen and speak simultaneously
• control roles via a text prompt
• adopt voices via zero-shot voice cloning
• respond in under 300 ms
• run locally on NVIDIA GPUs
What the problem used to be
Modern voice systems typically work in three stages:
1. ASR (speech-to-text)
2. LLM (text processing)
3. TTS (text-to-speech)
This inevitably adds latency. Even if each stage is optimized, the conversation flow still feels artificial. Interrupting? Hard. Overlapping speech? Mostly impossible.
What PersonaPlex does differently
PersonaPlex is based on the Moshi architecture (a speech-text foundation model) and works with three parallel streams:
• User audio
• Agent text
• Agent audio
The model generates text and audio autoregressively — while still receiving live user audio. No more rigid turn-taking. No more forced waiting.
Response time
Under 300 milliseconds.
That’s below the human perception threshold for conversational pauses.
The real breakthrough: Hybrid System Prompt
Full-duplex alone isn’t new. What’s new is the combination with a hybrid prompting system.
1. Text role conditioning
Roles are defined like in instruction-following LLMs: “You are a customer service agent at an insurance provider…” The model follows context, company rules, and product constraints.
2. Voice prompting
A short audio sample is enough for zero-shot voice cloning. Voice, timbre, and prosody are carried over.
Both are combined temporarily — text prompt + voice prompt — and then the live dialogue starts.
What that means
PersonaPlex cleanly separates role and voice for the first time.
You can run the same service agent with 50 different voices — without retraining.
Training and data foundation
• 1,840 hours of synthetic customer service dialogues
• 410 hours of QA dialogues
• 105,410 customer service dialogues
• 39,322 QA dialogues
• additional real Fisher English data in the released checkpoint
Training: 24,576 steps on 8× A100 GPUs, around 6 hours.
Benchmarks
It was tested against:
• Gemini Live
• GPT-4o
• Qwen-2.5-Omni
• Freeze-Omni
• Moshi
PersonaPlex achieves:
• the highest speaker similarity among open models
• strong role adherence
• very natural turn-taking dynamics
• high interruption stability
Important limitation
Gemini Live performs slightly better in some individual service roles.
So PersonaPlex is strong — but not the absolute leader.
Released checkpoint — improvements
• more real conversational data
• unified TTS engine (ChatterboxTTS)
• higher speaker similarity (0.65 instead of 0.57)
• more natural backchannels (“mhm”, “yeah”)
• better pause handling
Hardware & reality check
PersonaPlex is a 7B model. That’s a deliberate choice:
• big enough for solid dialogue quality
• small enough for local inference
• GPU-optimized (RTX, A100, H100)
But: the hardware demand is real. This isn’t a Raspberry Pi toy.
Privacy
On-premise operation is possible.
No forced cloud dependency.
No per-minute API billing model for spoken audio.
What PersonaPlex is NOT
• not a GPT-4 replacement
• not a tool-calling ecosystem
• not a multimodal all-rounder
• not a universal reasoning monster
It’s a specialized real-time voice system.
My honest assessment
The paper is technically solid. No fluff.
PersonaPlex is the first open model that combines full-duplex + role conditioning + zero-shot voice cloning in one system.
Is it a death blow to OpenAI? No. Is it a serious architectural shift in voice AI? Yes.
The real impact is in enterprise use cases: automotive, call centers, gaming NPCs, compliance-heavy industries.
Conclusion
PersonaPlex marks the start of a new generation of speech models. Away from sequential pipelines. Toward integrated, low-latency duplex systems.
Right now it’s still English-centric, hardware-hungry, and not globally scaled. But the direction is obvious.
Voice systems won’t be three separate components in the future. Voice will be an integrated, multimodal real-time model.
Source: NVIDIA Research Preprint “PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models”
research.nvidia.com
https://research.nvidia.com/labs/adlr/files/personaplex/personaplex_preprint.pdf