OHMOHM Studio

Scale & throughput

Enterprise patterns for running OHM under heavy load.

View as Markdown

OHM Studio handles the operational concerns most clinical AI integrations get wrong. This page documents what you can rely on, and the few patterns you should adopt on your side.

What OHM does for you

ConcernHow it's handled
API resolution (look up /extract/:slug)Cached in Redis with 5-minute TTL; invalidated on Publish / Update / Archive. Hit rate >99% at production volumes.
API key validationbcrypt-hashed at rest, validated at the edge, last-used timestamp written async.
Rate limitingRedis-backed sliding window per key, with optional per-API override. 429 includes retry-after.
Org suspensionInstant lockout — every Studio call returns 401 the moment the org is suspended; SDK auto-rotates JWT.
Large audio uploadsMulter streams to disk, then to STT — never buffered fully in memory. SDK rejects files > 500 MB synchronously (OHMValidationError); server-side cap configurable via STUDIO_MAX_AUDIO_BYTES. Recordings > 55 min are split into chunks server-side, transcribed in parallel, merged.
LLM retriesThree attempts with temperature sweep + model fallback. Customer never sees provider names in the error.
TelemetryEvery call writes a StudioInvocation row (status, latency, tokens, error). Surface in Studio Logs tab or query directly via the management API.

SDK patterns for high RPS

1 · Reuse one client per process

// ✅ correct — one client, many calls
const ohm = new OHM(process.env.OHM_API_KEY!);

export async function handler(req: Request) {
  return ohm.extract({ apiSlug: "opd", text: req.body.text });
}
// ❌ wrong — new client per request, blows your connection pool
export async function handler(req: Request) {
  const ohm = new OHM(process.env.OHM_API_KEY!);
  return ohm.extract({ apiSlug: "opd", text: req.body.text });
}

2 · Use streaming when you have a UI

audio.extractStream halves the perceived latency for users. See Streaming.

3 · Tune maxRetries + timeoutMs to your SLO

const ohm = new OHM({
  apiKey: process.env.OHM_API_KEY!,
  timeoutMs: 30_000,        // tighter than the 60s default
  maxRetries: 1,            // fewer retries for user-facing latency
});

Patterns for very high RPS (>50/s sustained)

NeedWhat to do
Cache identical extractionsHash the transcript, key the cache yourself, check before calling. OHM doesn't dedupe across requests.
Batch ingestionUse a queue (BullMQ, Pub/Sub) on your side; process in the background, store results, push results to the user via your existing channel.
Backpressure on the customer sideSurface the SDK's OHMRateLimitError.retryAfterSec and pause your worker until then; don't hot-loop.
Custom rate limit per keySet rateLimitPerMinute on the key (Studio → Keys → edit) so a runaway worker can't drain your monthly budget.

Big audio (multi-minute consults)

OHM's transcribe path streams the file from disk straight to STT and back. The full multipart body never sits in memory. Practical limits:

  • Per-request size cap: 500 MB at the SDK (OHMValidationError thrown synchronously before upload starts). Server cap configurable via STUDIO_MAX_AUDIO_BYTES.
  • Duration: any length. The STT provider has a documented ~1-hour per-file ceiling, so recordings longer than 55 min are split server-side (ffmpeg stream-copy), submitted in parallel, and the transcripts merged. Responses (sync + streaming + async-job) surface chunked: true + chunkCount so your UI can flag the boundary.
  • Per-request wall-clock: 30 minutes. Caddy + the API are configured to hold streaming uploads that long; don't reduce client timeout below 5 minutes.
  • Compression: WebM-Opus at 32-48 kbps is plenty for clinical speech and keeps files small. Avoid uncompressed WAV.
// MediaRecorder default — already optimal for long consults:
const rec = new Recorder({ mimeType: "audio/webm;codecs=opus" });

Webhooks vs polling

For long-running pipelines, set up a webhook — OHM POSTs invocation.success / invocation.failed events to your URL with HMAC-SHA256 signatures. No polling required.

// In Studio: API → Settings → Webhooks → Add
//   url:    https://your-backend/ohm-events
//   events: invocation.success, invocation.failed
//
// On your server, verify:
import { createHmac } from "node:crypto";

app.post("/ohm-events", express.raw({ type: "application/json" }), (req, res) => {
  const sig = req.header("OHM-Signature")?.replace("sha256=", "");
  const expected = createHmac("sha256", process.env.OHM_WEBHOOK_SECRET!)
    .update(req.body)
    .digest("hex");
  if (sig !== expected) return res.status(401).end();
  const event = JSON.parse(req.body);
  // event: { event, organizationId, projectId, apiId, apiSlug, timestamp, data }
  res.status(204).end();
});

Self-host scaling

If you self-host the OHM API (apps/api), the only services you need to scale horizontally are the API itself and Redis. Postgres is pooled via Prisma; replication is recommended once you cross 10k extractions/day.

Caching is a server-side detail

OHM does the heavy lifting. Your SDK code stays the same — same calls, same response shapes. The only knob you tune is maxRetries / timeoutMs for your own SLO.