Deploy in your cluster
Install once in your environment, on-prem or VPC. Your data and workloads stay entirely within your infrastructure.
- Full control over your hardware
- No vendor lock-in
- Works with heterogeneous GPU fleets
Intelligent inference platform
Tandemn is an intelligent inference platform for your infrastructure. Koi orchestrates. Tuna minimizes cost. Orca maximizes throughput. Deploy once, and let the platform handle the rest.
GPU time is expensive. Idle GPU time is a waste.
Cold starts, you pay to stay ready
Complex MLOps overhead
Traffic spikes force overprovisioning
Poor batching leaves GPUs idle
Tuna keeps serverless warm while spot provisions, so your first request is instant
We handle model sharding, KV cache, and orchestration, you just call the API
Dynamic routing across spot and serverless handles any traffic pattern
Intelligent batching and scheduling keep your GPUs busy, not idle
One platform. One brain. Two specialized engines.
Install once in your environment, on-prem or VPC. Your data and workloads stay entirely within your infrastructure.
Koi is the intelligence layer that sits above Tuna and Orca. It selects GPUs, routes workloads, forecasts SLOs, and scales your cluster automatically.
Once Koi decides the plan, the engines take over. Tuna handles cost-optimized online inference. Orca handles high-throughput batch workloads.
Koi orchestrates. Tuna and Orca execute. The right engine for every workload, automatically.
Koi automatically selects the optimal engine for your workload, or you can run Tuna and Orca independently as open source engines.
Each layer tackles a different cost lever.
Koi selects the right GPUs for each workload and prevents overprovisioning. No more paying for capacity you don't need, Koi right-sizes your cluster in real time.
Spot instances are 67–80% cheaper than on-demand. Tuna intelligently routes between spot and serverless,getting spot economics with serverless reliability for cost-optimized online inference.
Continuous batching and prefill/decode optimization extract more work from each GPU. Higher utilization means serving the same load with fewer GPUs,reducing your total GPU-hours needed.
You tell us what model you need and your SLO. Koi deploys in your VPC or on-prem cluster, selects the optimal GPUs, and routes workloads to the right engine. Smart provisioning (Koi) plus lower rates (Tuna) plus higher efficiency (Orca) compound into dramatic cost reductions while maintaining performance.
Production-ready inference infrastructure without the complexity.
Mixed GPUs,A100s, H100s, L40s,act as one unified runtime. Tandemn handles model sharding, KV cache management, and spot preemptions automatically.
Koi selects GPUs, forecasts SLOs, and manages your cluster. It finds the optimal configuration for your throughput and cost targets automatically.
Run batch jobs or production traffic through the same API. Koi routes to the right engine, Tuna for cost, Orca for throughput.
Tuna and Orca are open source with transparent benchmarks. Your infrastructure, your control,no vendor lock-in.
Spot economics with serverless reliability
When Koi detects cost-sensitive online inference workloads, it routes them to the Tuna engine,intelligently splitting traffic between spot and serverless for optimal cost and availability.
Spot GPUs are 67–80% cheaper than on-demand, but they're slow to provision and can be interrupted without warning. Most teams avoid spot because the operational complexity isn't worth the savings.
See Tuna in action: one-command deployment, intelligent routing, and transparent cost savings.
Instant cold starts
Intelligent routing
Cost savings
Integrates with Modal, RunPod, Cloud Run for serverless, and uses AWS via SkyPilot for spot capacity.
Stop overprovisioning serverless. Stop avoiding spot because it's complex. Let Koi route your cost-sensitive workloads to Tuna, and hybrid routing handles the rest.
Orchestration that makes everything intelligent
Koi is the intelligence layer that sits above Tuna and Orca. It selects GPUs, manages your cluster, forecasts SLOs, and routes workloads,so the engines can focus purely on execution.
Tuna and Orca are execution engines, they're great at running inference fast and cheap. But someone needs to decide which GPUs to use, when to scale, and where to route each workload. That's Koi. By separating intelligence from execution, each layer can be optimized independently. Run the engines open source, and connect Koi for the orchestration brain.
Watch Koi manage GPU selection, SLO forecasting, and workload routing in real-time.
GPU selection + routing
SLO forecasting
Dynamic rebalancing
Scans your VPC quota and on-prem availability to choose the optimal GPU configuration for each workload.
Continuously predicts whether SLOs will be met and proactively scales before deadlines are at risk.
Automatically routes workloads to Tuna (cost-sensitive) or Orca (throughput-critical) based on requirements.
Manages GPU capacity, preempts low-priority jobs, and rebalances resources across your fleet in real time.
Koi runs as a hosted SaaS, connect your engines via API key and let Tandemn handle the intelligence. For teams that need full control, Koi can also be self-hosted.
Koi is coming soon. Get early access to the intelligence layer that makes Tuna and Orca work together seamlessly.
High-performance batch inference engine
Orca is the execution engine for throughput-critical workloads. It receives assignments from Koi, loads models, runs continuous batching with prefill/decode optimization, and maximizes GPU utilization.
Orca focuses on one thing: getting the most tokens per second out of your GPUs. Continuous batching, prefill/decode splitting, and KV cache optimization ensure every GPU cycle counts. Koi handles the intelligence, Orca handles the execution.
Watch Orca receive work from Koi, process batches, and report throughput metrics.
Receives and executes
Throughput metrics
Progress and completion
Dynamic batching that adds new requests to in-flight batches, maximizing throughput without waiting for batch boundaries.
Assigns compute-bound prefill and memory-bound decode to different GPU types for optimal hardware utilization.
LMCache integration for intelligent KV cache management, reducing redundant computation across similar prompts.
Works across A100s, H100s, MI300X, and mixed fleets, extracting maximum performance from whatever hardware you have.
Large-scale inference jobs, data processing pipelines, offline evaluations, Orca executes with maximum GPU efficiency while Koi handles the orchestration. Coming soon.
Built for teams that won't accept GPU costs as an unavoidable tax.
Handle spiky traffic without overprovisioning. Koi routes cost-sensitive traffic to Tuna, which keeps endpoints responsive while routing to cheaper compute automatically.
Large-scale workloads with SLO deadlines. Koi selects optimal GPUs and forecasts completion, while Orca executes with maximum throughput.
Not every cluster is pristine. Tandemn unifies mixed hardware into a cohesive runtime, A100s, H100s, MI300X all working together.
Offline evaluations and large data jobs with predictable cost and completion times. Submit your workload and forget about infrastructure.
Pay for inference, not for idle capacity.
Tuna + Orca engines, free and self-hosted. Engine-level savings only, manual GPU selection and orchestration.
Hosted orchestration with API key connection. 20-40% additional savings from Koi's intelligent GPU selection and rightsizing.
Private Koi deploys with SLA support. Maximum savings with full Koi capabilities, custom tuning, and dedicated optimization.
Massive GPU capacity sits underutilized while teams struggle with access and cost. We're unifying that capacity into infrastructure that actually works, open-source, transparent, and built for production.
Meet the TeamLet's make your inference costs predictable.
When using the Tuna engine, Tandemn automatically falls back to serverless if spot instances are preempted or unhealthy.
Increasing batch size increases GPU utilization, so fewer GPUs are needed for the same workload.
Koi is the orchestration layer that manages GPU selection, SLO forecasting, and workload routing. You can run Tuna and Orca independently without Koi, but Koi makes them work together intelligently and handles scaling automatically.
Koi is primarily offered as a hosted SaaS, connect your engines via API key. For teams that need full control, self-hosted Koi deployments are available under the Enterprise plan.