Building a Sub-Second Voice Agent on a Consumer GPU

Voice agents that lag feel robotic. That half-second pause between your question and the response breaks the conversational flow—you stop thinking of it as a conversation and start thinking of it as a command line with extra steps.

Sub-second response time changes everything. The interaction feels natural. You stay engaged. This is why I set out to build a real-time voice agent with the lowest latency I could manage, running entirely on local hardware.

Here’s what I learned.

The Stack

I recently built a PC with an RTX 5070ti, which gave me enough GPU headroom to prototype quickly without immediately reaching for cloud APIs. The full pipeline:

  • Audio input: FastAPI with WebSockets, using sounddevice to listen continuously from the mic
  • Speech-to-text: FasterWhisper
  • LLM: vLLM serving Llama 3.2 3B
  • Text-to-speech: Piper

Each component streams to the next. The WebSocket connection stays open, audio chunks flow in, transcriptions flow to the LLM, and synthesized speech flows back out. No waiting for complete sentences. No batch processing. Just continuous flow.

Key Decisions

Manual WebSockets Over Pipecat

I looked at Pipecat for orchestrating the pipeline, but ended up sticking with manual WebSocket handling via FastAPI. For my use case, it simply wasn’t necessary.

FastAPI’s WebSocket support is clean and well-documented. The additional abstraction layer didn’t buy me anything I needed, and keeping things simple meant fewer debugging sessions when latency spiked unexpectedly.

If you’re building something production-grade with complex state management, Pipecat might make sense. For prototyping and straightforward pipelines, vanilla WebSockets work fine.

vLLM Over Ollama

Ollama is great for getting started quickly—single command, model running, done. But when you’re chasing milliseconds, vLLM is the better choice.

vLLM is built for high-throughput, low-latency inference. It uses PagedAttention for efficient memory management and continuous batching to maximize GPU utilization. For a real-time voice agent where every 50ms matters, these optimizations add up.

The setup is slightly more involved, but the performance difference is noticeable. If latency is your priority, use vLLM.

Llama 3.2 3B Without Quantization

I started with Llama 3.2 3B because it fits comfortably on my GPU without quantization. Running the model at full precision means no quality loss from compression—one less variable when debugging response quality.

For basic conversational exchanges, this worked well. The model responds quickly and coherently for simple back-and-forth.

The limitation: complex, multi-turn conversations or tasks requiring deeper reasoning will push past what a 3B model can handle gracefully. At that point, you’re looking at either:

  • Quantizing a larger model (8-bit or 4-bit)
  • Using cloud APIs for the LLM step
  • Renting cloud GPU infrastructure

But for prototyping and proving out the pipeline? 3B gets you moving fast without infrastructure headaches.

The Quality/Speed Tradeoff

Here’s the non-obvious lesson: not every component in the pipeline deserves the same optimization attention.

Invest in speech-to-text quality. A larger Whisper model takes more time, but accurate transcription matters. If the LLM receives garbled input, it produces garbled output, and no amount of downstream optimization saves you. I found that bumping up the Whisper model size was worth the latency cost.

Save cycles on text-to-speech. Piper is fast and sounds decent. You likely don’t need studio-quality voice synthesis if the rest of your pipeline is solid. Users are more forgiving of slightly robotic speech than they are of an agent that mishears them or responds nonsensically.

Measure everything. Add latency metrics at each stage of the pipeline. Time the STT step. Time the LLM inference. Time the TTS synthesis. When you see exactly where the milliseconds go, you can make informed tradeoffs instead of guessing.

I spent an afternoon tweaking vLLM parameters and FasterWhisper settings with a stopwatch running. It’s tedious, but it’s how you find the configuration that actually works for your hardware and use case.

What’s Next

This setup is a prototype, not a product. Complex conversations, long context windows, and production reliability all require more than a 3B model on a single consumer GPU.

But for getting started? For proving that local, low-latency voice agents are possible without a cloud bill? This stack works.

Build it, measure it, talk to it. Then figure out what you actually need to scale.


Got questions or built something similar? I’d love to hear about your setup.