The Technology
How Tandemn Works Under the Hood
Three layers working together. An intelligence brain, a cost-optimized engine, and a throughput engine. All open source, all on your infrastructure.
One Brain. Two Engines.
Koi orchestrates. Tuna and Orca execute. The right engine for every workload, automatically.
Koi automatically selects the optimal engine for your workload, or you can run Tuna and Orca independently as open source engines.
Three Layers, Compounding Savings
Each layer tackles a different cost lever.
Automated Orchestration
Koi selects the right GPUs for each workload and prevents overprovisioning. No more paying for capacity you don't need, Koi right-sizes your cluster in real time.
Lower compute rates
Spot instances are 67–80% cheaper than on-demand. Tuna intelligently routes between spot and serverless,getting spot economics with serverless reliability for cost-optimized online inference.
Maximum GPU utilization
Continuous batching and prefill/decode optimization extract more work from each GPU. Higher utilization means serving the same load with fewer GPUs,reducing your total GPU-hours needed.
All three layers compound automatically
You tell us what model you need and your SLO. Koi deploys in your VPC or on-prem cluster, selects the optimal GPUs, and routes workloads to the right engine. Smart provisioning (Koi) plus lower rates (Tuna) plus higher efficiency (Orca) compound into dramatic cost reductions while maintaining performance.
What You Get
Production-ready inference infrastructure without the complexity.
Distributed Runtime
Mixed GPUs,A100s, H100s, L40s,act as one unified runtime. Tandemn handles model sharding, KV cache management, and spot preemptions automatically.
Koi Orchestration
Koi selects GPUs, forecasts SLOs, and manages your cluster. It finds the optimal configuration for your throughput and cost targets automatically.
Flexible Workloads
Run batch jobs or production traffic through the same API. Koi routes to the right engine, Tuna for cost, Orca for throughput.
Open Source
Tuna and Orca are open source with transparent benchmarks. Your infrastructure, your control,no vendor lock-in.
Minimum Cost
Spot economics with serverless reliability
When Koi detects cost-sensitive online inference workloads, it routes them to the Tuna engine,intelligently splitting traffic between spot and serverless for optimal cost and availability.
The problem with spot
Spot GPUs are 67–80% cheaper than on-demand, but they're slow to provision and can be interrupted without warning. Most teams avoid spot because the operational complexity isn't worth the savings.
How Tuna solves it
See Tuna in action: one-command deployment, intelligent routing, and transparent cost savings.
Instant cold starts
Intelligent routing
Cost savings
Works with your stack
Integrates with Modal, RunPod, Cloud Run for serverless, and uses AWS via SkyPilot for spot capacity.
Full cost transparency
- Real-time spend tracking
- Routing split (% spot vs serverless)
- Savings vs baseline scenarios
- Per-component cost breakdown
Best for cost-sensitive online inference workloads
Stop overprovisioning serverless. Stop avoiding spot because it's complex. Let Koi route your cost-sensitive workloads to Tuna, and hybrid routing handles the rest.
The Brain Behind Your Inference
Orchestration that makes everything intelligent
Koi is the intelligence layer that sits above Tuna and Orca. It selects GPUs, manages your cluster, forecasts SLOs, and routes workloads,so the engines can focus purely on execution.
Why a separate intelligence layer?
Tuna and Orca are execution engines, they're great at running inference fast and cheap. But someone needs to decide which GPUs to use, when to scale, and where to route each workload. That's Koi. By separating intelligence from execution, each layer can be optimized independently. Run the engines open source, and connect Koi for the orchestration brain.
How Koi orchestrates
Watch Koi manage GPU selection, SLO forecasting, and workload routing in real-time.
GPU selection + routing
SLO forecasting
Dynamic rebalancing
GPU Selection
Scans your VPC quota and on-prem availability to choose the optimal GPU configuration for each workload.
SLO Forecasting
Continuously predicts whether SLOs will be met and proactively scales before deadlines are at risk.
Workload Routing
Automatically routes workloads to Tuna (cost-sensitive) or Orca (throughput-critical) based on requirements.
Cluster Management
Manages GPU capacity, preempts low-priority jobs, and rebalances resources across your fleet in real time.
Hosted SaaS or self-managed
Koi runs as a hosted SaaS, connect your engines via API key and let Tandemn handle the intelligence. For teams that need full control, Koi can also be self-hosted.
What Koi controls
- GPU selection and allocation
- SLO forecasting and autoscaling
- Workload routing to engines
- Cluster rebalancing and preemption
Join the waitlist
Koi is coming soon. Get early access to the intelligence layer that makes Tuna and Orca work together seamlessly.
Maximum Throughput Execution
High-performance batch inference engine
Orca is the execution engine for throughput-critical workloads. It receives assignments from Koi, loads models, runs continuous batching with prefill/decode optimization, and maximizes GPU utilization.
Pure execution, maximum throughput
Orca focuses on one thing: getting the most tokens per second out of your GPUs. Continuous batching, prefill/decode splitting, and KV cache optimization ensure every GPU cycle counts. Koi handles the intelligence, Orca handles the execution.
How Orca executes
Watch Orca receive work from Koi, process batches, and report throughput metrics.
Receives and executes
Throughput metrics
Progress and completion
Continuous Batching
Dynamic batching that adds new requests to in-flight batches, maximizing throughput without waiting for batch boundaries.
Prefill/Decode Split
Assigns compute-bound prefill and memory-bound decode to different GPU types for optimal hardware utilization.
KV Cache Optimization
LMCache integration for intelligent KV cache management, reducing redundant computation across similar prompts.
Heterogeneous GPU Support
Works across A100s, H100s, MI300X, and mixed fleets, extracting maximum performance from whatever hardware you have.
Built for throughput-critical batch workloads
Large-scale inference jobs, data processing pipelines, offline evaluations, Orca executes with maximum GPU efficiency while Koi handles the orchestration. Coming soon.