Performance

Latency budgets per endpoint, when to use sync vs streaming vs async, the warmUp pattern, HTTP/2 multiplexing, and bulk extraction.

This page is how to make the SDK fast, not just reliable. Everything here is opt-in — defaults are tuned for "good enough", but a handful of changes will halve perceived latency in real clinical apps.

Latency budgets per endpoint

Steady-state, single-call, warm-cache numbers from production traffic. Cold-start adds ~300 ms (TLS handshake + first-LLM warmup) to whichever call you make first — see warmUp() below.

Endpoint	p50	p95	Notes
`ohm.extract({ text })`	1.8 s	4.0 s	Pure LLM call. Scales linearly with output token count.
`ohm.summarize({ text })`	0.9 s	2.0 s	Shorter output → faster than `extract`.
`ohm.audio.transcribe({ file })`	1.5 s / min audio	2.5 s / min	STT only. Indian-English code-mix is the slow case.
`ohm.audio.extract({ file })`	transcribe + extract	+ extract	Same as the two above combined.
`ohm.audio.extractStream({ file })`	first chunk in 1.5 s	full in 4.0 s	Transcript appears before extraction completes.
`ohm.apis.list()`	80 ms	150 ms	Cached server-side.
`ohm.audio.jobs.create()`	~100 ms	300 ms	Only enqueues — actual work async.

When to use sync vs streaming vs async

Audio length	Best mode
0 – 30 s	Sync — `ohm.audio.extract({ file })`. One round-trip, simplest UI.
30 s – 5 min, UI present	Streaming — `ohm.audio.extractStream({ file })`. Transcript renders first; user reads while extraction runs.
5 min – 30 min	Async polling — `ohm.audio.extractAsync({ apiSlug, file })`. Survives network blips.
> 30 min, mobile background, bulk replay	Async webhook — `ohm.audio.jobs.create({ apiSlug, file, webhookUrl })`. Device fires-and-forgets; your backend gets a signed callback.

For the full async write-up, see /sdk/async-extraction.

Warm up the connection

The first OHM call from a fresh process pays:

DNS lookup (~30 ms)
TCP handshake (~30 ms)
TLS handshake (~150 ms)
First-LLM cold-start (~50–100 ms server-side)

Total cold cost: ~300–500 ms added to the first call. Subsequent calls reuse the TCP socket and feel snappy.

Fix: call warmUp() once at app boot. It does a cheap GET /api/health to establish TCP + TLS without waiting for an LLM.

const ohm = new OHM({ apiKey: process.env.OHM_API_KEY! });

// Fire-and-forget at boot.
void ohm.warmUp();

// ... later, when the doctor taps "Record":
await ohm.audio.extract({ apiSlug, file });   // socket already warm

warmUp() is safe to call multiple times and a no-op in mock: true.

On React Native, call warmUp() inside your root <App /> component's useEffect(() => { ohm.warmUp() }, []). On Next.js server actions, call once during your server's startup hook (or import-time, since modules cache).

HTTP/2 multiplexing (Node only)

Node 18+ uses undici for fetch. By default it speaks HTTP/1.1 with connection pooling — fine for serial calls but every parallel call still costs a fresh connection.

When your Node app makes many concurrent OHM calls (server proxies, bulk replays, background workers), opt-in to HTTP/2:

import { enableHttp2 } from "@ohm_studio/sdk/http2";

enableHttp2();  // call once at process start

// ... all `new OHM(...)` clients now multiplex over a single H2 socket.

Saves 50–100 ms per parallel call. Silently no-ops on browsers, RN, Cloudflare Workers, and Deno.

Bulk extraction with concurrency cap

For batch replays (10 000 historical transcripts), don't loop one-by-one or Promise.all the lot. Use extractBulk:

const results = await ohm.extractBulk(
  transcripts.map((text) => ({ apiSlug: "opd-clinic", text })),
  {
    concurrency: 8,
    onProgress: (done, total) => console.log(`${done} / ${total}`),
  },
);

const succeeded = results.filter((r) => r.ok);
const failed = results.filter((r) => !r.ok);

Default concurrency is 4 — enough to amortise round-trips without blowing your per-key rate limit. Partial failures don't fail the batch — each result is a discriminated union ({ ok: true, data } or { ok: false, error, input }).

`keepalive` for short POSTs

The SDK automatically passes keepalive: true to fetch on JSON bodies under 60 KB (extract, summarize, insights). Saves ~30 ms on every call after the first by reusing the TCP / TLS socket — no opt-in needed.

Multipart audio uploads skip keepalive (browser cap is 64 KB total body).

Per-call timeout / retry overrides

Don't construct a second client for one slow call. Use withOverrides:

const slow = ohm.withOverrides({
  timeoutMs: 5 * 60_000,
  totalTimeoutMs: 6 * 60_000,
  maxRetries: 1,
});

await slow.audio.extract({ apiSlug, file: hourLongAudio });

The returned client is a thin clone — same auth, same baseUrl, same hooks — ready to use immediately.

Performance anti-patterns

Things customers do that hurt latency.

Anti-pattern	Better
`await ohm.audio.extract(...)` from a click handler with no `warmUp()`	Call `void ohm.warmUp()` at app boot
`for (const t of texts) await ohm.extract({ text: t })`	`ohm.extractBulk(texts.map(...))`
Setting `timeoutMs: 5_000` to "fail fast"	Pair with `maxRetries: 0` — otherwise you'll just retry 3× and wait 15 s
Reconstructing `new OHM(...)` per request	Construct once and reuse — pooled connections matter
Loose `Promise.all` of 100 calls	`extractBulk({ concurrency: 8 })` or `enableHttp2()`

Speed checklist for production

If you only do five things to make your OHM integration fast:

✅ Call void ohm.warmUp() at boot
✅ Use extractStream() whenever a UI is rendering (halves perceived latency)
✅ Use extractAsync() for audio > 5 min (never holds an HTTP connection open)
✅ Set totalTimeoutMs so a runaway upstream can't hang your UI
✅ Call enableHttp2() once on Node servers that fan out 10+ parallel calls