February 18, 2026 | Engineering

Under the hood: Building low-latency AI Voice Interviews.

As we build AI-powered voice interviews at Enabl, we've learned that latency is the defining factor between a useful tool and a frustrating experience. In human conversation, response delays beyond 400ms feel unnatural. Our goal is to hit that threshold consistently, regardless of network conditions.

The Challenge

A typical voice AI pipeline involves multiple sequential steps: audio capture, speech-to-text transcription, language model inference, text-to-speech synthesis, and audio playback. Each step adds latency. In a naive implementation, you might see total round-trip times of 2-3 seconds — far too slow for natural conversation.

Our Approach: Parallel Processing & Speculative Execution

The key insight we're exploring is that many of these steps don't need to be strictly sequential. We're building our pipeline around three principles:

Streaming transcription — Processing audio in small chunks, allowing the language model to start "thinking" before the candidate has finished speaking.
Speculative response generation — Based on partial transcripts, pre-generating likely follow-up questions, discarding them if the conversation takes a different direction.
Edge-deployed TTS — Running text-to-speech models on edge nodes close to the user, eliminating cross-region latency for the final audio synthesis.

What We're Targeting

With this architecture, we're aiming for:

P50 latency: under 300ms
P95 latency: under 400ms
P99 latency: under 500ms

If we can hit these targets, the AI interviewer would respond faster than most humans in natural conversation.

Technical Trade-offs We're Navigating

This approach isn't without costs. Speculative execution means we sometimes generate responses that are never used, increasing compute spend. Streaming transcription can occasionally produce less accurate transcripts than batch processing. And edge deployment adds operational complexity.

We believe these trade-offs will be worth it. Early internal testing suggests that low latency makes the voice experience feel "surprisingly natural" — the highest compliment for a system designed to disappear into the background.

What We're Learning

Building real-time voice AI is teaching us that user experience is fundamentally about perception. A 300ms response feels instant. A 600ms response feels sluggish. The engineering challenge isn't just making things faster — it's understanding which milliseconds matter most.

We'll share more technical details as we progress. This is one of the most challenging and exciting problems we've worked on.

Interested in the technical details? We're hiring engineers who want to push the boundaries of real-time AI systems. Check out our open positions.