The Technology

How Tandemn Works Under the Hood

Three layers working together. An intelligence brain, a cost-optimized engine, and a throughput engine. All open source, all on your infrastructure.

One Brain. Two Engines.

Koi orchestrates. Tuna and Orca execute. The right engine for every workload, automatically.

Tuna Engine

Minimum Cost

Hybrid routing between spot and serverless. Get spot economics with serverless reliability for cost-optimized online inference.

Learn about Tuna →
Coming Soon Koi Intelligence

The Brain

The orchestration layer. Selects GPUs, manages clusters, forecasts SLOs, and routes workloads to Tuna or Orca.

Learn about Koi →
Coming Soon Orca Engine

Maximum Throughput

Production-ready batched inference with continuous batching and prefill/decode optimization. Maximum GPU utilization.

Learn about Orca →

Koi automatically selects the optimal engine for your workload, or you can run Tuna and Orca independently as open source engines.

Three Layers, Compounding Savings

Each layer tackles a different cost lever.

Koi - The Intelligence Layer

Automated Orchestration

Koi selects the right GPUs for each workload and prevents overprovisioning. No more paying for capacity you don't need, Koi right-sizes your cluster in real time.

Tuna Engine

Lower compute rates

Spot instances are 67–80% cheaper than on-demand. Tuna intelligently routes between spot and serverless,getting spot economics with serverless reliability for cost-optimized online inference.

Orca Engine

Maximum GPU utilization

Continuous batching and prefill/decode optimization extract more work from each GPU. Higher utilization means serving the same load with fewer GPUs,reducing your total GPU-hours needed.

All three layers compound automatically

You tell us what model you need and your SLO. Koi deploys in your VPC or on-prem cluster, selects the optimal GPUs, and routes workloads to the right engine. Smart provisioning (Koi) plus lower rates (Tuna) plus higher efficiency (Orca) compound into dramatic cost reductions while maintaining performance.

What You Get

Production-ready inference infrastructure without the complexity.

Distributed Runtime

Mixed GPUs,A100s, H100s, L40s,act as one unified runtime. Tandemn handles model sharding, KV cache management, and spot preemptions automatically.

Koi Orchestration

Koi selects GPUs, forecasts SLOs, and manages your cluster. It finds the optimal configuration for your throughput and cost targets automatically.

Flexible Workloads

Run batch jobs or production traffic through the same API. Koi routes to the right engine, Tuna for cost, Orca for throughput.

Open Source

Tuna and Orca are open source with transparent benchmarks. Your infrastructure, your control,no vendor lock-in.

Tuna Engine

Minimum Cost

Spot economics with serverless reliability

When Koi detects cost-sensitive online inference workloads, it routes them to the Tuna engine,intelligently splitting traffic between spot and serverless for optimal cost and availability.

The problem with spot

Spot GPUs are 67–80% cheaper than on-demand, but they're slow to provision and can be interrupted without warning. Most teams avoid spot because the operational complexity isn't worth the savings.

How Tuna solves it

See Tuna in action: one-command deployment, intelligent routing, and transparent cost savings.

tuna-deploy
$
⚡ Spinning up serverless
☁️ Provisioning spots
✓ Ready! 172.18.43.91

Instant cold starts

tuna-status
$
Traffic Dashboard
Serverless:
Spot:

Intelligent routing

tuna-cost
$
Cost (5 min)
Tuna: $0.65
100s serverless + 200s spot
 
On-demand always-on: $3.25
Serverless always-on: $2.17
Saved: $2.60 (80%)

Cost savings

Works with your stack

Integrates with Modal, RunPod, Cloud Run for serverless, and uses AWS via SkyPilot for spot capacity.

Full cost transparency

  • Real-time spend tracking
  • Routing split (% spot vs serverless)
  • Savings vs baseline scenarios
  • Per-component cost breakdown

Best for cost-sensitive online inference workloads

Stop overprovisioning serverless. Stop avoiding spot because it's complex. Let Koi route your cost-sensitive workloads to Tuna, and hybrid routing handles the rest.

Koi Intelligence Coming Soon

The Brain Behind Your Inference

Orchestration that makes everything intelligent

Koi is the intelligence layer that sits above Tuna and Orca. It selects GPUs, manages your cluster, forecasts SLOs, and routes workloads,so the engines can focus purely on execution.

Why a separate intelligence layer?

Tuna and Orca are execution engines, they're great at running inference fast and cheap. But someone needs to decide which GPUs to use, when to scale, and where to route each workload. That's Koi. By separating intelligence from execution, each layer can be optimized independently. Run the engines open source, and connect Koi for the orchestration brain.

How Koi orchestrates

Watch Koi manage GPU selection, SLO forecasting, and workload routing in real-time.

koi-submit
$
📊 Analyzing workload...
10,000 prompts, Llama-70B
🔍 Scanning cluster GPUs
✓ Selected: 4x A100-80GB
Routing to Orca (batch)
🚀 Dispatched to Orca

GPU selection + routing

koi-monitor
$
📈 SLO Forecast
Target: 2hr
Predicted: 2hr 45m ⚠️
🔄 Scaling: +2 GPUs
New forecast: 1hr 48m ✓
SLO: Safe

SLO forecasting

koi-rebalance
$
🚨 Urgent job incoming
⏸️ Preempting low-priority
(SLO: 24hr, safe to pause)
🎯 Reassigning GPUs:
Prefill: 4x H100
Decode: 4x MI300X

Dynamic rebalancing

GPU Selection

Scans your VPC quota and on-prem availability to choose the optimal GPU configuration for each workload.

SLO Forecasting

Continuously predicts whether SLOs will be met and proactively scales before deadlines are at risk.

Workload Routing

Automatically routes workloads to Tuna (cost-sensitive) or Orca (throughput-critical) based on requirements.

Cluster Management

Manages GPU capacity, preempts low-priority jobs, and rebalances resources across your fleet in real time.

Hosted SaaS or self-managed

Koi runs as a hosted SaaS, connect your engines via API key and let Tandemn handle the intelligence. For teams that need full control, Koi can also be self-hosted.

What Koi controls

  • GPU selection and allocation
  • SLO forecasting and autoscaling
  • Workload routing to engines
  • Cluster rebalancing and preemption

Join the waitlist

Koi is coming soon. Get early access to the intelligence layer that makes Tuna and Orca work together seamlessly.

Orca Engine Coming Soon

Maximum Throughput Execution

High-performance batch inference engine

Orca is the execution engine for throughput-critical workloads. It receives assignments from Koi, loads models, runs continuous batching with prefill/decode optimization, and maximizes GPU utilization.

Pure execution, maximum throughput

Orca focuses on one thing: getting the most tokens per second out of your GPUs. Continuous batching, prefill/decode splitting, and KV cache optimization ensure every GPU cycle counts. Koi handles the intelligence, Orca handles the execution.

How Orca executes

Watch Orca receive work from Koi, process batches, and report throughput metrics.

orca-run
$
📥 Assignment from Koi:
Model: Llama-70B
GPUs: 4x A100-80GB
⚡ Loading model
✓ Model loaded, TP=4
🚀 Running batch...

Receives and executes

orca-throughput
$
GPU Utilization
Util: 94%
Batch size: 128
KV cache: 87% hit
 
⚡ 1,820 tok/s

Throughput metrics

orca-complete
$
📊 Batch Progress
[████████████░░] 85%
8,500 / 10,000 prompts
 
✓ Batch complete!
Reporting to Koi...

Progress and completion

Continuous Batching

Dynamic batching that adds new requests to in-flight batches, maximizing throughput without waiting for batch boundaries.

Prefill/Decode Split

Assigns compute-bound prefill and memory-bound decode to different GPU types for optimal hardware utilization.

KV Cache Optimization

LMCache integration for intelligent KV cache management, reducing redundant computation across similar prompts.

Heterogeneous GPU Support

Works across A100s, H100s, MI300X, and mixed fleets, extracting maximum performance from whatever hardware you have.

Built for throughput-critical batch workloads

Large-scale inference jobs, data processing pipelines, offline evaluations, Orca executes with maximum GPU efficiency while Koi handles the orchestration. Coming soon.