Performance
Latency budgets per endpoint, when to use sync vs streaming vs async, the warmUp pattern, HTTP/2 multiplexing, and bulk extraction.
This page is how to make the SDK fast, not just reliable. Everything here is opt-in — defaults are tuned for "good enough", but a handful of changes will halve perceived latency in real clinical apps.
Latency budgets per endpoint
Steady-state, single-call, warm-cache numbers from production traffic.
Cold-start adds ~300 ms (TLS handshake + first-LLM warmup) to whichever
call you make first — see warmUp() below.
| Endpoint | p50 | p95 | Notes |
|---|---|---|---|
ohm.extract({ text }) | 1.8 s | 4.0 s | Pure LLM call. Scales linearly with output token count. |
ohm.summarize({ text }) | 0.9 s | 2.0 s | Shorter output → faster than extract. |
ohm.audio.transcribe({ file }) | 1.5 s / min audio | 2.5 s / min | STT only. Indian-English code-mix is the slow case. |
ohm.audio.extract({ file }) | transcribe + extract | + extract | Same as the two above combined. |
ohm.audio.extractStream({ file }) | first chunk in 1.5 s | full in 4.0 s | Transcript appears before extraction completes. |
ohm.apis.list() | 80 ms | 150 ms | Cached server-side. |
ohm.audio.jobs.create() | ~100 ms | 300 ms | Only enqueues — actual work async. |
When to use sync vs streaming vs async
| Audio length | Best mode |
|---|---|
| 0 – 30 s | Sync — ohm.audio.extract({ file }). One round-trip, simplest UI. |
| 30 s – 5 min, UI present | Streaming — ohm.audio.extractStream({ file }). Transcript renders first; user reads while extraction runs. |
| 5 min – 30 min | Async polling — ohm.audio.extractAsync({ apiSlug, file }). Survives network blips. |
| > 30 min, mobile background, bulk replay | Async webhook — ohm.audio.jobs.create({ apiSlug, file, webhookUrl }). Device fires-and-forgets; your backend gets a signed callback. |
For the full async write-up, see /sdk/async-extraction.
Warm up the connection
The first OHM call from a fresh process pays:
- DNS lookup (~30 ms)
- TCP handshake (~30 ms)
- TLS handshake (~150 ms)
- First-LLM cold-start (~50–100 ms server-side)
Total cold cost: ~300–500 ms added to the first call. Subsequent calls reuse the TCP socket and feel snappy.
Fix: call warmUp() once at app boot. It does a cheap GET /api/health to
establish TCP + TLS without waiting for an LLM.
const ohm = new OHM({ apiKey: process.env.OHM_API_KEY! });
// Fire-and-forget at boot.
void ohm.warmUp();
// ... later, when the doctor taps "Record":
await ohm.audio.extract({ apiSlug, file }); // socket already warmwarmUp() is safe to call multiple times and a no-op in mock: true.
On React Native, call warmUp() inside your root <App /> component's
useEffect(() => { ohm.warmUp() }, []). On Next.js server actions, call
once during your server's startup hook (or import-time, since modules
cache).
HTTP/2 multiplexing (Node only)
Node 18+ uses undici for fetch. By default it speaks HTTP/1.1 with
connection pooling — fine for serial calls but every parallel call still
costs a fresh connection.
When your Node app makes many concurrent OHM calls (server proxies, bulk replays, background workers), opt-in to HTTP/2:
import { enableHttp2 } from "@ohm_studio/sdk/http2";
enableHttp2(); // call once at process start
// ... all `new OHM(...)` clients now multiplex over a single H2 socket.Saves 50–100 ms per parallel call. Silently no-ops on browsers, RN, Cloudflare Workers, and Deno.
Bulk extraction with concurrency cap
For batch replays (10 000 historical transcripts), don't loop one-by-one
or Promise.all the lot. Use extractBulk:
const results = await ohm.extractBulk(
transcripts.map((text) => ({ apiSlug: "opd-clinic", text })),
{
concurrency: 8,
onProgress: (done, total) => console.log(`${done} / ${total}`),
},
);
const succeeded = results.filter((r) => r.ok);
const failed = results.filter((r) => !r.ok);Default concurrency is 4 — enough to amortise round-trips without
blowing your per-key rate limit. Partial failures don't fail the batch —
each result is a discriminated union ({ ok: true, data } or
{ ok: false, error, input }).
keepalive for short POSTs
The SDK automatically passes keepalive: true to fetch on JSON bodies
under 60 KB (extract, summarize, insights). Saves ~30 ms on every
call after the first by reusing the TCP / TLS socket — no opt-in needed.
Multipart audio uploads skip keepalive (browser cap is 64 KB total
body).
Per-call timeout / retry overrides
Don't construct a second client for one slow call. Use withOverrides:
const slow = ohm.withOverrides({
timeoutMs: 5 * 60_000,
totalTimeoutMs: 6 * 60_000,
maxRetries: 1,
});
await slow.audio.extract({ apiSlug, file: hourLongAudio });The returned client is a thin clone — same auth, same baseUrl, same hooks — ready to use immediately.
Performance anti-patterns
Things customers do that hurt latency.
| Anti-pattern | Better |
|---|---|
await ohm.audio.extract(...) from a click handler with no warmUp() | Call void ohm.warmUp() at app boot |
for (const t of texts) await ohm.extract({ text: t }) | ohm.extractBulk(texts.map(...)) |
Setting timeoutMs: 5_000 to "fail fast" | Pair with maxRetries: 0 — otherwise you'll just retry 3× and wait 15 s |
Reconstructing new OHM(...) per request | Construct once and reuse — pooled connections matter |
Loose Promise.all of 100 calls | extractBulk({ concurrency: 8 }) or enableHttp2() |
Speed checklist for production
If you only do five things to make your OHM integration fast:
- ✅ Call
void ohm.warmUp()at boot - ✅ Use
extractStream()whenever a UI is rendering (halves perceived latency) - ✅ Use
extractAsync()for audio > 5 min (never holds an HTTP connection open) - ✅ Set
totalTimeoutMsso a runaway upstream can't hang your UI - ✅ Call
enableHttp2()once on Node servers that fan out 10+ parallel calls
See also
/sdk/reliability— what retries, deadline math, error classes/sdk/async-extraction— long-running job pattern/sdk/streaming— SSE chunk-by-chunk consumption/sdk/api-reference— endpoint reference