Tandemn , Distributed Inference on Your VPC

The Problem

GPU time is expensive. Idle GPU time is a waste.

Where inference costs spiral

cold-start.sh

$ deploy-model llama-70b

Waiting for model to load

Allocating GPU memory

Loading weights (23GB)

⏱️ 240 seconds elapsed...

Cold starts, you pay to stay ready

config.py

from vllm import LLM, SamplingParams
import torch
import os
 
# Configure GPU memory fraction
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"]="false"
 
llm = LLM(
  model="meta-llama/Llama-2-70b-hf",
  tensor_parallel_size=8,
  dtype="float16",
  gpu_memory_utilization=0.9,
  max_num_seqs=256,
  max_model_len=4096
)
 
sampling_params = SamplingParams(
  temperature=0.8,
  top_p=0.95,
  max_tokens=2048,
  presence_penalty=0.0,
  frequency_penalty=0.0
)
 
# Set KV cache config
cache_config = {
  "block_size": 16,
  "num_gpu_blocks": 2048,
  "num_cpu_blocks": 512
}
 
# Configure scheduler
scheduler_config = {
  "max_num_batched_tokens": 8192,
  "max_num_seqs": 256,
  "max_paddings": 512
}
from vllm import LLM, SamplingParams
import torch
import os
 
# Configure GPU memory fraction
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"]="false"
 
llm = LLM(
  model="meta-llama/Llama-2-70b-hf",
  tensor_parallel_size=8,
  dtype="float16",
  gpu_memory_utilization=0.9,
  max_num_seqs=256,
  max_model_len=4096
)

Complex MLOps overhead

production.log

⚠️ Traffic spike detected

ERROR: Queue length: 847 requests

ERROR: All instances saturated

🚨 OVERLOADED - P99: 14.3s 🚨

Scaling up... (ETA: 6 min)

Traffic spikes force overprovisioning

gpu-metrics

GPU Utilization Dashboard

GPU 0:

GPU 1:

GPU 2:

GPU 3:

⚠️ Avg utilization:

Poor batching leaves GPUs idle

How Tandemn eliminates these problems

No more cold start delays

Tuna keeps serverless warm while spot provisions, so your first request is instant

Zero MLOps complexity

We handle model sharding, KV cache, and orchestration, you just call the API

Auto-scaling that actually works

Dynamic routing across spot and serverless handles any traffic pattern

Maximum GPU utilization

Intelligent batching and scheduling keep your GPUs busy, not idle

Tell Tandemn what you need. It handles the rest.

No GPU selection. No infrastructure management. Just your model and your intent.

You Specify

model

task

That's it. No GPUs. No infra config.

Tandemn Orchestrates

Selected 4x A100-80GB

Adding optimized defaults to vllm

Selected 3 replicas

Launching on Modal

Launching in your VPC

Lowest Possible Cost

Your online workload, optimized automatically

$3.25

80% saved

Started inference

Picks the right GPUs from your fleet

Figures out the cheapest or fastest way to run it

Rebalances on the fly if something changes

Autoscales in real time based on your SLO

How It Works

Deploy once. Tandemn handles the rest.

Deploy in your cluster

Install Tandemn once in your VPC or on-prem environment. Your data never leaves your infrastructure.

Full control over your hardware
No vendor lock-in
Works with heterogeneous GPU fleets

The intelligence layer

Tandemn's brain (Koi) figures out the best way to run every workload. It picks GPUs, forecasts SLOs, and rebalances resources across your fleet automatically.

Automatic GPU selection and cluster management
SLO forecasting and proactive scaling
Shifts workloads between resources to hit your targets

Open source engines

Your workloads run through Tuna and Orca, two open source engines that launch and manage instances on your infrastructure.

Tuna: cost-optimized online inference via spot + serverless
Orca: max-throughput batch inference via continuous batching
Run them independently or let Koi orchestrate both

Learn about the tech

Use Cases

Built for teams that won't accept GPU costs as an unavoidable tax.

AI Product Teams

Handle spiky traffic without overprovisioning. Koi routes cost-sensitive traffic to Tuna, which keeps endpoints responsive while routing to cheaper compute automatically.

Batch Inference

Large-scale workloads with SLO deadlines. Koi selects optimal GPUs and forecasts completion, while Orca executes with maximum throughput.

Research Platforms

Not every cluster is pristine. Tandemn unifies mixed hardware into a cohesive runtime, A100s, H100s, MI300X all working together.

Data Processing

Offline evaluations and large data jobs with predictable cost and completion times. Submit your workload and forget about infrastructure.

Pricing

Pay for inference, not for idle capacity.

Open Source

Tuna + Orca engines, free and self-hosted. Engine-level savings only, manual GPU selection and orchestration.

Full engine source code
Community support
Self-hosting friendly
Run engines independently
Manual cluster management

Coming Soon

Managed Orchestration

Hosted orchestration with API key connection. 20-40% additional savings from Koi's intelligent GPU selection and rightsizing.

Koi hosted SaaS
Connect engines via API key
Automatic GPU selection + SLO forecasting
Cluster management dashboard
Eliminates overprovisioning waste

Coming Soon

Enterprise

Private Koi deploys with SLA support. Maximum savings with full Koi capabilities, custom tuning, and dedicated optimization.

Private Koi deployment
SLA support
Custom orchestration integration
Dedicated engineering support
Advanced cost optimization tuning

FAQ

How do you handle spot interruptions?

When using the Tuna engine, Tandemn automatically falls back to serverless if spot instances are preempted or unhealthy.

Why does batching matter so much for cost?

Increasing batch size increases GPU utilization, so fewer GPUs are needed for the same workload.

What is Koi and do I need it?

Koi is the orchestration layer that manages GPU selection, SLO forecasting, and workload routing. You can run Tuna and Orca independently without Koi, but Koi makes them work together intelligently and handles scaling automatically.

Is Koi self-hostable?

Koi is primarily offered as a hosted SaaS, connect your engines via API key. For teams that need full control, self-hosted Koi deployments are available under the Enterprise plan.

Maximum Performance. Minimum Cost. Your Hardware.