OHMOHM Studio

Performance

Latency budgets per endpoint, when to use sync vs streaming vs async, the warmUp pattern, HTTP/2 multiplexing, and bulk extraction.

View as Markdown

This page is how to make the SDK fast, not just reliable. Everything here is opt-in — defaults are tuned for "good enough", but a handful of changes will halve perceived latency in real clinical apps.

Latency budgets per endpoint

Steady-state, single-call, warm-cache numbers from production traffic. Cold-start adds ~300 ms (TLS handshake + first-LLM warmup) to whichever call you make first — see warmUp() below.

Endpointp50p95Notes
ohm.extract({ text })1.8 s4.0 sPure LLM call. Scales linearly with output token count.
ohm.summarize({ text })0.9 s2.0 sShorter output → faster than extract.
ohm.audio.transcribe({ file })1.5 s / min audio2.5 s / minSTT only. Indian-English code-mix is the slow case.
ohm.audio.extract({ file })transcribe + extract+ extractSame as the two above combined.
ohm.audio.extractStream({ file })first chunk in 1.5 sfull in 4.0 sTranscript appears before extraction completes.
ohm.apis.list()80 ms150 msCached server-side.
ohm.audio.jobs.create()~100 ms300 msOnly enqueues — actual work async.

When to use sync vs streaming vs async

Audio lengthBest mode
0 – 30 sSyncohm.audio.extract({ file }). One round-trip, simplest UI.
30 s – 5 min, UI presentStreamingohm.audio.extractStream({ file }). Transcript renders first; user reads while extraction runs.
5 min – 30 minAsync pollingohm.audio.extractAsync({ apiSlug, file }). Survives network blips.
> 30 min, mobile background, bulk replayAsync webhookohm.audio.jobs.create({ apiSlug, file, webhookUrl }). Device fires-and-forgets; your backend gets a signed callback.

For the full async write-up, see /sdk/async-extraction.

Warm up the connection

The first OHM call from a fresh process pays:

  • DNS lookup (~30 ms)
  • TCP handshake (~30 ms)
  • TLS handshake (~150 ms)
  • First-LLM cold-start (~50–100 ms server-side)

Total cold cost: ~300–500 ms added to the first call. Subsequent calls reuse the TCP socket and feel snappy.

Fix: call warmUp() once at app boot. It does a cheap GET /api/health to establish TCP + TLS without waiting for an LLM.

const ohm = new OHM({ apiKey: process.env.OHM_API_KEY! });

// Fire-and-forget at boot.
void ohm.warmUp();

// ... later, when the doctor taps "Record":
await ohm.audio.extract({ apiSlug, file });   // socket already warm

warmUp() is safe to call multiple times and a no-op in mock: true.

On React Native, call warmUp() inside your root <App /> component's useEffect(() => { ohm.warmUp() }, []). On Next.js server actions, call once during your server's startup hook (or import-time, since modules cache).

HTTP/2 multiplexing (Node only)

Node 18+ uses undici for fetch. By default it speaks HTTP/1.1 with connection pooling — fine for serial calls but every parallel call still costs a fresh connection.

When your Node app makes many concurrent OHM calls (server proxies, bulk replays, background workers), opt-in to HTTP/2:

import { enableHttp2 } from "@ohm_studio/sdk/http2";

enableHttp2();  // call once at process start

// ... all `new OHM(...)` clients now multiplex over a single H2 socket.

Saves 50–100 ms per parallel call. Silently no-ops on browsers, RN, Cloudflare Workers, and Deno.

Bulk extraction with concurrency cap

For batch replays (10 000 historical transcripts), don't loop one-by-one or Promise.all the lot. Use extractBulk:

const results = await ohm.extractBulk(
  transcripts.map((text) => ({ apiSlug: "opd-clinic", text })),
  {
    concurrency: 8,
    onProgress: (done, total) => console.log(`${done} / ${total}`),
  },
);

const succeeded = results.filter((r) => r.ok);
const failed = results.filter((r) => !r.ok);

Default concurrency is 4 — enough to amortise round-trips without blowing your per-key rate limit. Partial failures don't fail the batch — each result is a discriminated union ({ ok: true, data } or { ok: false, error, input }).

keepalive for short POSTs

The SDK automatically passes keepalive: true to fetch on JSON bodies under 60 KB (extract, summarize, insights). Saves ~30 ms on every call after the first by reusing the TCP / TLS socket — no opt-in needed.

Multipart audio uploads skip keepalive (browser cap is 64 KB total body).

Per-call timeout / retry overrides

Don't construct a second client for one slow call. Use withOverrides:

const slow = ohm.withOverrides({
  timeoutMs: 5 * 60_000,
  totalTimeoutMs: 6 * 60_000,
  maxRetries: 1,
});

await slow.audio.extract({ apiSlug, file: hourLongAudio });

The returned client is a thin clone — same auth, same baseUrl, same hooks — ready to use immediately.

Performance anti-patterns

Things customers do that hurt latency.

Anti-patternBetter
await ohm.audio.extract(...) from a click handler with no warmUp()Call void ohm.warmUp() at app boot
for (const t of texts) await ohm.extract({ text: t })ohm.extractBulk(texts.map(...))
Setting timeoutMs: 5_000 to "fail fast"Pair with maxRetries: 0 — otherwise you'll just retry 3× and wait 15 s
Reconstructing new OHM(...) per requestConstruct once and reuse — pooled connections matter
Loose Promise.all of 100 callsextractBulk({ concurrency: 8 }) or enableHttp2()

Speed checklist for production

If you only do five things to make your OHM integration fast:

  1. ✅ Call void ohm.warmUp() at boot
  2. ✅ Use extractStream() whenever a UI is rendering (halves perceived latency)
  3. ✅ Use extractAsync() for audio > 5 min (never holds an HTTP connection open)
  4. ✅ Set totalTimeoutMs so a runaway upstream can't hang your UI
  5. ✅ Call enableHttp2() once on Node servers that fan out 10+ parallel calls

See also