Scale & throughput
Enterprise patterns for running OHM under heavy load.
OHM Studio handles the operational concerns most clinical AI integrations get wrong. This page documents what you can rely on, and the few patterns you should adopt on your side.
What OHM does for you
| Concern | How it's handled |
|---|---|
API resolution (look up /extract/:slug) | Cached in Redis with 5-minute TTL; invalidated on Publish / Update / Archive. Hit rate >99% at production volumes. |
| API key validation | bcrypt-hashed at rest, validated at the edge, last-used timestamp written async. |
| Rate limiting | Redis-backed sliding window per key, with optional per-API override. 429 includes retry-after. |
| Org suspension | Instant lockout — every Studio call returns 401 the moment the org is suspended; SDK auto-rotates JWT. |
| Large audio uploads | Multer streams to disk, then to STT — never buffered fully in memory. SDK rejects files > 500 MB synchronously (OHMValidationError); server-side cap configurable via STUDIO_MAX_AUDIO_BYTES. Recordings > 55 min are split into chunks server-side, transcribed in parallel, merged. |
| LLM retries | Three attempts with temperature sweep + model fallback. Customer never sees provider names in the error. |
| Telemetry | Every call writes a StudioInvocation row (status, latency, tokens, error). Surface in Studio Logs tab or query directly via the management API. |
SDK patterns for high RPS
1 · Reuse one client per process
// ✅ correct — one client, many calls
const ohm = new OHM(process.env.OHM_API_KEY!);
export async function handler(req: Request) {
return ohm.extract({ apiSlug: "opd", text: req.body.text });
}// ❌ wrong — new client per request, blows your connection pool
export async function handler(req: Request) {
const ohm = new OHM(process.env.OHM_API_KEY!);
return ohm.extract({ apiSlug: "opd", text: req.body.text });
}2 · Use streaming when you have a UI
audio.extractStream halves the perceived latency for users. See
Streaming.
3 · Tune maxRetries + timeoutMs to your SLO
const ohm = new OHM({
apiKey: process.env.OHM_API_KEY!,
timeoutMs: 30_000, // tighter than the 60s default
maxRetries: 1, // fewer retries for user-facing latency
});Patterns for very high RPS (>50/s sustained)
| Need | What to do |
|---|---|
| Cache identical extractions | Hash the transcript, key the cache yourself, check before calling. OHM doesn't dedupe across requests. |
| Batch ingestion | Use a queue (BullMQ, Pub/Sub) on your side; process in the background, store results, push results to the user via your existing channel. |
| Backpressure on the customer side | Surface the SDK's OHMRateLimitError.retryAfterSec and pause your worker until then; don't hot-loop. |
| Custom rate limit per key | Set rateLimitPerMinute on the key (Studio → Keys → edit) so a runaway worker can't drain your monthly budget. |
Big audio (multi-minute consults)
OHM's transcribe path streams the file from disk straight to STT and back. The full multipart body never sits in memory. Practical limits:
- Per-request size cap: 500 MB at the SDK (
OHMValidationErrorthrown synchronously before upload starts). Server cap configurable viaSTUDIO_MAX_AUDIO_BYTES. - Duration: any length. The STT provider has a documented ~1-hour per-file ceiling, so recordings longer than 55 min are split server-side (ffmpeg stream-copy), submitted in parallel, and the transcripts merged. Responses (sync + streaming + async-job) surface
chunked: true+chunkCountso your UI can flag the boundary. - Per-request wall-clock: 30 minutes. Caddy + the API are configured to hold streaming uploads that long; don't reduce client timeout below 5 minutes.
- Compression: WebM-Opus at 32-48 kbps is plenty for clinical speech and keeps files small. Avoid uncompressed WAV.
// MediaRecorder default — already optimal for long consults:
const rec = new Recorder({ mimeType: "audio/webm;codecs=opus" });Webhooks vs polling
For long-running pipelines, set up a webhook — OHM
POSTs invocation.success / invocation.failed events to your URL with
HMAC-SHA256 signatures. No polling required.
// In Studio: API → Settings → Webhooks → Add
// url: https://your-backend/ohm-events
// events: invocation.success, invocation.failed
//
// On your server, verify:
import { createHmac } from "node:crypto";
app.post("/ohm-events", express.raw({ type: "application/json" }), (req, res) => {
const sig = req.header("OHM-Signature")?.replace("sha256=", "");
const expected = createHmac("sha256", process.env.OHM_WEBHOOK_SECRET!)
.update(req.body)
.digest("hex");
if (sig !== expected) return res.status(401).end();
const event = JSON.parse(req.body);
// event: { event, organizationId, projectId, apiId, apiSlug, timestamp, data }
res.status(204).end();
});Self-host scaling
If you self-host the OHM API (apps/api), the only services you need to
scale horizontally are the API itself and Redis. Postgres is pooled via
Prisma; replication is recommended once you cross 10k extractions/day.
Caching is a server-side detail
OHM does the heavy lifting. Your SDK code stays the same — same calls,
same response shapes. The only knob you tune is maxRetries /
timeoutMs for your own SLO.