Intelligent inference platform

Maximum Performance. Minimum Cost. Your Hardware.

Tandemn is an intelligent inference platform for your infrastructure. Koi orchestrates. Tuna minimizes cost. Orca maximizes throughput. Deploy once, and let the platform handle the rest.

Intelligent orchestration Open source engines Your infrastructure

The Problem

GPU time is expensive. Idle GPU time is a waste.

Where inference costs spiral

cold-start.sh
$ deploy-model llama-70b
Waiting for model to load
Allocating GPU memory
Loading weights (23GB)
⏱️ 240 seconds elapsed...

Cold starts, you pay to stay ready

config.py
from vllm import LLM, SamplingParams
import torch
import os
 
# Configure GPU memory fraction
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"]="false"
 
llm = LLM(
  model="meta-llama/Llama-2-70b-hf",
  tensor_parallel_size=8,
  dtype="float16",
  gpu_memory_utilization=0.9,
  max_num_seqs=256,
  max_model_len=4096
)
 
sampling_params = SamplingParams(
  temperature=0.8,
  top_p=0.95,
  max_tokens=2048,
  presence_penalty=0.0,
  frequency_penalty=0.0
)
 
# Set KV cache config
cache_config = {
  "block_size": 16,
  "num_gpu_blocks": 2048,
  "num_cpu_blocks": 512
}
 
# Configure scheduler
scheduler_config = {
  "max_num_batched_tokens": 8192,
  "max_num_seqs": 256,
  "max_paddings": 512
}
from vllm import LLM, SamplingParams
import torch
import os
 
# Configure GPU memory fraction
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"]="false"
 
llm = LLM(
  model="meta-llama/Llama-2-70b-hf",
  tensor_parallel_size=8,
  dtype="float16",
  gpu_memory_utilization=0.9,
  max_num_seqs=256,
  max_model_len=4096
)

Complex MLOps overhead

production.log
⚠️ Traffic spike detected
ERROR: Queue length: 847 requests
ERROR: All instances saturated
Scaling up... (ETA: 6 min)

Traffic spikes force overprovisioning

gpu-metrics
GPU Utilization Dashboard
GPU 0:
GPU 1:
GPU 2:
GPU 3:
⚠️ Avg utilization:

Poor batching leaves GPUs idle

How Tandemn eliminates these problems

No more cold start delays

Tuna keeps serverless warm while spot provisions, so your first request is instant

Zero MLOps complexity

We handle model sharding, KV cache, and orchestration, you just call the API

Auto-scaling that actually works

Dynamic routing across spot and serverless handles any traffic pattern

Maximum GPU utilization

Intelligent batching and scheduling keep your GPUs busy, not idle

How It Works

One platform. One brain. Two specialized engines.

Deploy in your cluster

Install once in your environment, on-prem or VPC. Your data and workloads stay entirely within your infrastructure.

  • Full control over your hardware
  • No vendor lock-in
  • Works with heterogeneous GPU fleets

Koi orchestrates your workloads

Koi is the intelligence layer that sits above Tuna and Orca. It selects GPUs, routes workloads, forecasts SLOs, and scales your cluster automatically.

  • Automatic GPU selection and cluster management
  • SLO forecasting and proactive scaling
  • Routes to Tuna or Orca based on workload type

Tuna and Orca execute

Once Koi decides the plan, the engines take over. Tuna handles cost-optimized online inference. Orca handles high-throughput batch workloads.

  • Tuna: Spot + serverless hybrid routing
  • Orca: Continuous batching and prefill/decode split
  • Open source engines you can run independently

One Brain. Two Engines.

Koi orchestrates. Tuna and Orca execute. The right engine for every workload, automatically.

Tuna Engine

Minimum Cost

Hybrid routing between spot and serverless. Get spot economics with serverless reliability for cost-optimized online inference.

Learn about Tuna →
Coming Soon Koi Intelligence

The Brain

The orchestration layer. Selects GPUs, manages clusters, forecasts SLOs, and routes workloads to Tuna or Orca.

Learn about Koi →
Coming Soon Orca Engine

Maximum Throughput

Production-ready batched inference with continuous batching and prefill/decode optimization. Maximum GPU utilization.

Learn about Orca →

Koi automatically selects the optimal engine for your workload, or you can run Tuna and Orca independently as open source engines.

Three Layers, Compounding Savings

Each layer tackles a different cost lever.

Koi - The Intelligence Layer

Automated Orchestration

Koi selects the right GPUs for each workload and prevents overprovisioning. No more paying for capacity you don't need, Koi right-sizes your cluster in real time.

Tuna Engine

Lower compute rates

Spot instances are 67–80% cheaper than on-demand. Tuna intelligently routes between spot and serverless,getting spot economics with serverless reliability for cost-optimized online inference.

Orca Engine

Maximum GPU utilization

Continuous batching and prefill/decode optimization extract more work from each GPU. Higher utilization means serving the same load with fewer GPUs,reducing your total GPU-hours needed.

All three layers compound automatically

You tell us what model you need and your SLO. Koi deploys in your VPC or on-prem cluster, selects the optimal GPUs, and routes workloads to the right engine. Smart provisioning (Koi) plus lower rates (Tuna) plus higher efficiency (Orca) compound into dramatic cost reductions while maintaining performance.

What You Get

Production-ready inference infrastructure without the complexity.

Distributed Runtime

Mixed GPUs,A100s, H100s, L40s,act as one unified runtime. Tandemn handles model sharding, KV cache management, and spot preemptions automatically.

Koi Orchestration

Koi selects GPUs, forecasts SLOs, and manages your cluster. It finds the optimal configuration for your throughput and cost targets automatically.

Flexible Workloads

Run batch jobs or production traffic through the same API. Koi routes to the right engine, Tuna for cost, Orca for throughput.

Open Source

Tuna and Orca are open source with transparent benchmarks. Your infrastructure, your control,no vendor lock-in.

Tuna Engine

Minimum Cost

Spot economics with serverless reliability

When Koi detects cost-sensitive online inference workloads, it routes them to the Tuna engine,intelligently splitting traffic between spot and serverless for optimal cost and availability.

The problem with spot

Spot GPUs are 67–80% cheaper than on-demand, but they're slow to provision and can be interrupted without warning. Most teams avoid spot because the operational complexity isn't worth the savings.

How Tuna solves it

See Tuna in action: one-command deployment, intelligent routing, and transparent cost savings.

tuna-deploy
$
⚡ Spinning up serverless
☁️ Provisioning spots
✓ Ready! 172.18.43.91

Instant cold starts

tuna-status
$
Traffic Dashboard
Serverless:
Spot:

Intelligent routing

tuna-cost
$
Cost (5 min)
Tuna: $0.65
100s serverless + 200s spot
 
On-demand always-on: $3.25
Serverless always-on: $2.17
Saved: $2.60 (80%)

Cost savings

Works with your stack

Integrates with Modal, RunPod, Cloud Run for serverless, and uses AWS via SkyPilot for spot capacity.

Full cost transparency

  • Real-time spend tracking
  • Routing split (% spot vs serverless)
  • Savings vs baseline scenarios
  • Per-component cost breakdown

Best for cost-sensitive online inference workloads

Stop overprovisioning serverless. Stop avoiding spot because it's complex. Let Koi route your cost-sensitive workloads to Tuna, and hybrid routing handles the rest.

Koi Intelligence Coming Soon

The Brain Behind Your Inference

Orchestration that makes everything intelligent

Koi is the intelligence layer that sits above Tuna and Orca. It selects GPUs, manages your cluster, forecasts SLOs, and routes workloads,so the engines can focus purely on execution.

Why a separate intelligence layer?

Tuna and Orca are execution engines, they're great at running inference fast and cheap. But someone needs to decide which GPUs to use, when to scale, and where to route each workload. That's Koi. By separating intelligence from execution, each layer can be optimized independently. Run the engines open source, and connect Koi for the orchestration brain.

How Koi orchestrates

Watch Koi manage GPU selection, SLO forecasting, and workload routing in real-time.

koi-submit
$
📊 Analyzing workload...
10,000 prompts, Llama-70B
🔍 Scanning cluster GPUs
✓ Selected: 4x A100-80GB
Routing to Orca (batch)
🚀 Dispatched to Orca

GPU selection + routing

koi-monitor
$
📈 SLO Forecast
Target: 2hr
Predicted: 2hr 45m ⚠️
🔄 Scaling: +2 GPUs
New forecast: 1hr 48m ✓
SLO: Safe

SLO forecasting

koi-rebalance
$
🚨 Urgent job incoming
⏸️ Preempting low-priority
(SLO: 24hr, safe to pause)
🎯 Reassigning GPUs:
Prefill: 4x H100
Decode: 4x MI300X

Dynamic rebalancing

GPU Selection

Scans your VPC quota and on-prem availability to choose the optimal GPU configuration for each workload.

SLO Forecasting

Continuously predicts whether SLOs will be met and proactively scales before deadlines are at risk.

Workload Routing

Automatically routes workloads to Tuna (cost-sensitive) or Orca (throughput-critical) based on requirements.

Cluster Management

Manages GPU capacity, preempts low-priority jobs, and rebalances resources across your fleet in real time.

Hosted SaaS or self-managed

Koi runs as a hosted SaaS, connect your engines via API key and let Tandemn handle the intelligence. For teams that need full control, Koi can also be self-hosted.

What Koi controls

  • GPU selection and allocation
  • SLO forecasting and autoscaling
  • Workload routing to engines
  • Cluster rebalancing and preemption

Join the waitlist

Koi is coming soon. Get early access to the intelligence layer that makes Tuna and Orca work together seamlessly.

Orca Engine Coming Soon

Maximum Throughput Execution

High-performance batch inference engine

Orca is the execution engine for throughput-critical workloads. It receives assignments from Koi, loads models, runs continuous batching with prefill/decode optimization, and maximizes GPU utilization.

Pure execution, maximum throughput

Orca focuses on one thing: getting the most tokens per second out of your GPUs. Continuous batching, prefill/decode splitting, and KV cache optimization ensure every GPU cycle counts. Koi handles the intelligence, Orca handles the execution.

How Orca executes

Watch Orca receive work from Koi, process batches, and report throughput metrics.

orca-run
$
📥 Assignment from Koi:
Model: Llama-70B
GPUs: 4x A100-80GB
⚡ Loading model
✓ Model loaded, TP=4
🚀 Running batch...

Receives and executes

orca-throughput
$
GPU Utilization
Util: 94%
Batch size: 128
KV cache: 87% hit
 
⚡ 1,820 tok/s

Throughput metrics

orca-complete
$
📊 Batch Progress
[████████████░░] 85%
8,500 / 10,000 prompts
 
✓ Batch complete!
Reporting to Koi...

Progress and completion

Continuous Batching

Dynamic batching that adds new requests to in-flight batches, maximizing throughput without waiting for batch boundaries.

Prefill/Decode Split

Assigns compute-bound prefill and memory-bound decode to different GPU types for optimal hardware utilization.

KV Cache Optimization

LMCache integration for intelligent KV cache management, reducing redundant computation across similar prompts.

Heterogeneous GPU Support

Works across A100s, H100s, MI300X, and mixed fleets, extracting maximum performance from whatever hardware you have.

Built for throughput-critical batch workloads

Large-scale inference jobs, data processing pipelines, offline evaluations, Orca executes with maximum GPU efficiency while Koi handles the orchestration. Coming soon.

Use Cases

Built for teams that won't accept GPU costs as an unavoidable tax.

AI Product Teams

Handle spiky traffic without overprovisioning. Koi routes cost-sensitive traffic to Tuna, which keeps endpoints responsive while routing to cheaper compute automatically.

Batch Inference

Large-scale workloads with SLO deadlines. Koi selects optimal GPUs and forecasts completion, while Orca executes with maximum throughput.

Research Platforms

Not every cluster is pristine. Tandemn unifies mixed hardware into a cohesive runtime, A100s, H100s, MI300X all working together.

Data Processing

Offline evaluations and large data jobs with predictable cost and completion times. Submit your workload and forget about infrastructure.

Pricing

Pay for inference, not for idle capacity.

Open Source

Tuna + Orca engines, free and self-hosted. Engine-level savings only, manual GPU selection and orchestration.

  • Full engine source code
  • Community support
  • Self-hosting friendly
  • Run engines independently
  • Manual cluster management
Coming Soon

Koi Managed

Hosted orchestration with API key connection. 20-40% additional savings from Koi's intelligent GPU selection and rightsizing.

  • Koi hosted SaaS
  • Connect engines via API key
  • Automatic GPU selection + SLO forecasting
  • Cluster management dashboard
  • Eliminates overprovisioning waste
Coming Soon

Enterprise

Private Koi deploys with SLA support. Maximum savings with full Koi capabilities, custom tuning, and dedicated optimization.

  • Private Koi deployment
  • SLA support
  • Custom orchestration integration
  • Dedicated engineering support
  • Advanced cost optimization tuning
About Tandemn

We're abstracting the hardware layer from AI software.

Massive GPU capacity sits underutilized while teams struggle with access and cost. We're unifying that capacity into infrastructure that actually works, open-source, transparent, and built for production.

Meet the Team
Open Source
Core repos, community support
Built in Public
Transparent benchmarks, audit-ready
Production Ready
Designed for real workloads

Get in Touch

Let's make your inference costs predictable.

FAQ

When using the Tuna engine, Tandemn automatically falls back to serverless if spot instances are preempted or unhealthy.

Increasing batch size increases GPU utilization, so fewer GPUs are needed for the same workload.

Koi is the orchestration layer that manages GPU selection, SLO forecasting, and workload routing. You can run Tuna and Orca independently without Koi, but Koi makes them work together intelligently and handles scaling automatically.

Koi is primarily offered as a hosted SaaS, connect your engines via API key. For teams that need full control, self-hosted Koi deployments are available under the Enterprise plan.