How we built sub-second transcription for 10,000 concurrent call streams

Building a real-time call analysis system is not primarily a machine learning problem. It is a distributed systems problem with a machine learning layer on top. The ML model that scores a disclosure miss or detects sentiment is the last 15% of the engineering challenge. The other 85% is making sure transcripts arrive fast enough, accurately enough, and in the right format to act on.

This post describes how we built a streaming transcription and analysis pipeline that handles 10,000 concurrent call streams at under 800ms end-to-end latency — from speech to flag in the supervisor dashboard.

Why latency is the product

A compliance flag that fires three seconds after a disclosure checkpoint passes is informative. A compliance flag that fires while the agent can still deliver the disclosure — and that a supervisor can act on — is valuable. The difference between informative and valuable in real-time call monitoring is latency.

Our design target from the start was under 800ms from speech to dashboard flag. That is fast enough for a supervisor whisper prompt to land before the natural end of the agent's current sentence. It leaves room for human reaction time — typically 2 to 4 seconds for a supervisor to see a flag, assess it, and respond.

Meeting this target at scale required rethinking the conventional approach to call analysis, which processes a complete call recording after call end. Post-call processing has no latency requirement — you just need the result before the next coaching session. Real-time processing has a hard latency requirement determined by human reaction time.

The streaming pipeline architecture

Calls enter Vivritt through two paths: SIP trunk integration (for on-premise telephony and CCaaS platforms like Five9 and Genesys) and API-based audio streaming (for cloud telephony platforms and custom integrations). Both paths deliver audio as a continuous stream of 200-millisecond chunks to our ingestion layer.

The ingestion layer performs speaker separation before transcription. Separating the agent channel from the customer channel is critical for accurate compliance rule evaluation — a rule that checks whether the agent mentioned the APR cannot fire correctly if the agent and customer speech are mixed in a single channel. Where channel-separated audio is not available from the telephony platform, we apply a diarisation model that identifies speaker turns from a mixed stream with approximately 94% accuracy across standard BPO call types.

Each 200ms audio chunk is passed to our ASR layer. We run a fine-tuned streaming ASR model optimised for the accent profiles of our primary markets: Philippine English, Indian English (across Hindi, Tamil, and Bengali-influenced variants), and Singapore English. Standard commercial ASR models — including the major cloud providers — perform measurably worse on these accent profiles than on North American and British English. We cover the fine-tuning story in a separate post.

The latency budget

Our 800ms target breaks into four components: audio ingestion and speaker separation (≤ 80ms), ASR transcription of the 200ms chunk (≤ 240ms), context window assembly and model inference (≤ 350ms), and dashboard update delivery via WebSocket (≤ 130ms). Each component has a p99 budget, not just a mean — a flag that fires within 800ms 50% of the time but takes 3 seconds on the 99th percentile is not a reliable real-time product.

The hardest latency challenge is the context window. Compliance rules and sentiment models need more than the most recent 200ms chunk — they need to see whether a phrase was completed, whether a topic was introduced earlier in the call, and whether a sentiment trajectory is improving or worsening. We use a sliding window of the last 120 seconds of transcript, which covers most disclosure requirement windows (typically the first 90 seconds of a regulated financial call). Assembling and scoring this window fast enough to meet the 350ms budget required significant work on the inference layer.

Handling multilingual calls

A substantial proportion of calls handled by Philippine BPOs involve code-switching — agents and customers moving between English and Filipino (Tagalog) mid-sentence. This is not a failure of English proficiency. It is a natural communication pattern that improves rapport, particularly for sensitive topics. A pure English-only ASR model produces frequent errors on code-switched utterances because it attempts to force non-English words into the nearest English phoneme sequence.

Our ASR model is trained on a dataset that includes code-switched Philippine English, which allows it to handle common Tagalog insertions accurately. Similarly, Indian call centre environments involve agent speech that includes Hindi, Tamil, and other regional language words within otherwise English sentences — a pattern we handle through similar multi-language training data.

For operations in Vietnam and Malaysia, where BPO is growing rapidly, we are building fine-tuned models on Vietnamese-accented English and Malay-accented English. These are in beta and available to early access customers.

Scaling to 10,000 concurrent streams

At 10,000 concurrent calls, each with a 200ms chunk arriving every 200ms, the pipeline processes 50,000 audio chunks per second. The ASR layer alone consumes a significant compute budget — streaming models are less efficient than batch models for throughput per compute unit, because they cannot amortise the model load across a large batch.

We addressed this through three approaches: horizontal scaling of the ASR inference nodes behind a load balancer with call-affinity routing (so each call is consistently processed by the same ASR node, maintaining context); a shared context cache that prevents duplicate context assembly work for calls handled by multiple nodes; and a priority queue that gives latency guarantees to calls with active compliance windows — calls in the first 90 seconds — while allowing slightly more latency for post-window scoring.

The result is a system that scales linearly with compute allocation, with no architectural bottleneck below 25,000 concurrent streams on current hardware.

What this enables for BPO operations

The engineering work described here is invisible to the average supervisor using the Vivritt dashboard. What they see is a flag that fires on time, a transcript snippet that confirms why it fired, and an action they can take. The pipeline exists to make that experience reliable — not just in the demo environment, but at 02:00 on the night shift when 600 agents are simultaneously on calls.

For BPO clients in Manila, Bengaluru, and Singapore evaluating call intelligence platforms, the questions worth asking any vendor are: what is your p99 latency from speech to flag? How do you handle Filipino or Indian English code-switching? What is your architecture for 100% of calls — not just a sample? The answers reveal whether a platform is built for your reality.