Scale & throughput

OHM Studio handles the operational concerns most clinical AI integrations get wrong. This page documents what you can rely on, and the few patterns you should adopt on your side.

What OHM does for you

Concern	How it's handled
API resolution (look up `/extract/:slug`)	Cached in Redis with 5-minute TTL; invalidated on Publish / Update / Archive. Hit rate >99% at production volumes.
API key validation	bcrypt-hashed at rest, validated at the edge, last-used timestamp written async.
Rate limiting	Redis-backed sliding window per key, with optional per-API override. 429 includes `retry-after`.
Org suspension	Instant lockout — every Studio call returns 401 the moment the org is suspended; SDK auto-rotates JWT.
Large audio uploads	Multer streams to disk, then to STT — never buffered fully in memory. SDK rejects files > 500 MB synchronously (`OHMValidationError`); server-side cap configurable via `STUDIO_MAX_AUDIO_BYTES`. Recordings > 55 min are split into chunks server-side, transcribed in parallel, merged.
LLM retries	Three attempts with temperature sweep + model fallback. Customer never sees provider names in the error.
Telemetry	Every call writes a `StudioInvocation` row (status, latency, tokens, error). Surface in Studio Logs tab or query directly via the management API.

SDK patterns for high RPS

1 · Reuse one client per process

// ✅ correct — one client, many calls
const ohm = new OHM(process.env.OHM_API_KEY!);

export async function handler(req: Request) {
  return ohm.extract({ apiSlug: "opd", text: req.body.text });
}

// ❌ wrong — new client per request, blows your connection pool
export async function handler(req: Request) {
  const ohm = new OHM(process.env.OHM_API_KEY!);
  return ohm.extract({ apiSlug: "opd", text: req.body.text });
}

2 · Use streaming when you have a UI

audio.extractStream halves the perceived latency for users. See Streaming.

3 · Tune `maxRetries` + `timeoutMs` to your SLO

const ohm = new OHM({
  apiKey: process.env.OHM_API_KEY!,
  timeoutMs: 30_000,        // tighter than the 60s default
  maxRetries: 1,            // fewer retries for user-facing latency
});

Patterns for very high RPS (>50/s sustained)

Need	What to do
Cache identical extractions	Hash the transcript, key the cache yourself, check before calling. OHM doesn't dedupe across requests.
Batch ingestion	Use a queue (BullMQ, Pub/Sub) on your side; process in the background, store results, push results to the user via your existing channel.
Backpressure on the customer side	Surface the SDK's `OHMRateLimitError.retryAfterSec` and pause your worker until then; don't hot-loop.
Custom rate limit per key	Set `rateLimitPerMinute` on the key (Studio → Keys → edit) so a runaway worker can't drain your monthly budget.

Big audio (multi-minute consults)

OHM's transcribe path streams the file from disk straight to STT and back. The full multipart body never sits in memory. Practical limits:

Per-request size cap: 500 MB at the SDK (OHMValidationError thrown synchronously before upload starts). Server cap configurable via STUDIO_MAX_AUDIO_BYTES.
Duration: any length. The STT provider has a documented ~1-hour per-file ceiling, so recordings longer than 55 min are split server-side (ffmpeg stream-copy), submitted in parallel, and the transcripts merged. Responses (sync + streaming + async-job) surface chunked: true + chunkCount so your UI can flag the boundary.
Per-request wall-clock: 30 minutes. Caddy + the API are configured to hold streaming uploads that long; don't reduce client timeout below 5 minutes.
Compression: WebM-Opus at 32-48 kbps is plenty for clinical speech and keeps files small. Avoid uncompressed WAV.

// MediaRecorder default — already optimal for long consults:
const rec = new Recorder({ mimeType: "audio/webm;codecs=opus" });

Webhooks vs polling

For long-running pipelines, set up a webhook — OHM POSTs invocation.success / invocation.failed events to your URL with HMAC-SHA256 signatures. No polling required.

// In Studio: API → Settings → Webhooks → Add
//   url:    https://your-backend/ohm-events
//   events: invocation.success, invocation.failed
//
// On your server, verify:
import { createHmac } from "node:crypto";

app.post("/ohm-events", express.raw({ type: "application/json" }), (req, res) => {
  const sig = req.header("OHM-Signature")?.replace("sha256=", "");
  const expected = createHmac("sha256", process.env.OHM_WEBHOOK_SECRET!)
    .update(req.body)
    .digest("hex");
  if (sig !== expected) return res.status(401).end();
  const event = JSON.parse(req.body);
  // event: { event, organizationId, projectId, apiId, apiSlug, timestamp, data }
  res.status(204).end();
});

If you self-host the OHM API (apps/api), the only services you need to scale horizontally are the API itself and Redis. Postgres is pooled via Prisma; replication is recommended once you cross 10k extractions/day.

Caching is a server-side detail

OHM does the heavy lifting. Your SDK code stays the same — same calls, same response shapes. The only knob you tune is maxRetries / timeoutMs for your own SLO.