Intelligent inference platform

Maximum Performance. Minimum Cost. Your Hardware.

Running AI on your own hardware means cold starts, idle GPUs, and endless MLOps complexity. Tandemn is the orchestration layer that manages your infra for cost and throughput, so you stop paying for waste.

Automatic GPU selection Open source engines Zero MLOps overhead

The Problem

GPU time is expensive. Idle GPU time is a waste.

Where inference costs spiral

cold-start.sh
$ deploy-model llama-70b
Waiting for model to load
Allocating GPU memory
Loading weights (23GB)
⏱️ 240 seconds elapsed...

Cold starts, you pay to stay ready

config.py
from vllm import LLM, SamplingParams
import torch
import os
 
# Configure GPU memory fraction
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"]="false"
 
llm = LLM(
  model="meta-llama/Llama-2-70b-hf",
  tensor_parallel_size=8,
  dtype="float16",
  gpu_memory_utilization=0.9,
  max_num_seqs=256,
  max_model_len=4096
)
 
sampling_params = SamplingParams(
  temperature=0.8,
  top_p=0.95,
  max_tokens=2048,
  presence_penalty=0.0,
  frequency_penalty=0.0
)
 
# Set KV cache config
cache_config = {
  "block_size": 16,
  "num_gpu_blocks": 2048,
  "num_cpu_blocks": 512
}
 
# Configure scheduler
scheduler_config = {
  "max_num_batched_tokens": 8192,
  "max_num_seqs": 256,
  "max_paddings": 512
}
from vllm import LLM, SamplingParams
import torch
import os
 
# Configure GPU memory fraction
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"]="false"
 
llm = LLM(
  model="meta-llama/Llama-2-70b-hf",
  tensor_parallel_size=8,
  dtype="float16",
  gpu_memory_utilization=0.9,
  max_num_seqs=256,
  max_model_len=4096
)

Complex MLOps overhead

production.log
⚠️ Traffic spike detected
ERROR: Queue length: 847 requests
ERROR: All instances saturated
Scaling up... (ETA: 6 min)

Traffic spikes force overprovisioning

gpu-metrics
GPU Utilization Dashboard
GPU 0:
GPU 1:
GPU 2:
GPU 3:
⚠️ Avg utilization:

Poor batching leaves GPUs idle

How Tandemn eliminates these problems

No more cold start delays

Tuna keeps serverless warm while spot provisions, so your first request is instant

Zero MLOps complexity

We handle model sharding, KV cache, and orchestration, you just call the API

Auto-scaling that actually works

Dynamic routing across spot and serverless handles any traffic pattern

Maximum GPU utilization

Intelligent batching and scheduling keep your GPUs busy, not idle

Tell Tandemn what you need. It handles the rest.

No GPU selection. No infrastructure management. Just your model and your intent.

You Specify

model
task

That's it. No GPUs. No infra config.

Tandemn Orchestrates

Selected 4x A100-80GB
Adding optimized defaults to vllm
Selected 3 replicas
Launching on Modal
Launching in your VPC

Lowest Possible Cost

Your online workload, optimized automatically

$3.25
80% saved
Started inference
Picks the right GPUs from your fleet
Figures out the cheapest or fastest way to run it
Rebalances on the fly if something changes
Autoscales in real time based on your SLO

How It Works

Deploy once. Tandemn handles the rest.

Deploy in your cluster

Install Tandemn once in your VPC or on-prem environment. Your data never leaves your infrastructure.

  • Full control over your hardware
  • No vendor lock-in
  • Works with heterogeneous GPU fleets

The intelligence layer

Tandemn's brain (Koi) figures out the best way to run every workload. It picks GPUs, forecasts SLOs, and rebalances resources across your fleet automatically.

  • Automatic GPU selection and cluster management
  • SLO forecasting and proactive scaling
  • Shifts workloads between resources to hit your targets

Open source engines

Your workloads run through Tuna and Orca, two open source engines that launch and manage instances on your infrastructure.

  • Tuna: cost-optimized online inference via spot + serverless
  • Orca: max-throughput batch inference via continuous batching
  • Run them independently or let Koi orchestrate both

Use Cases

Built for teams that won't accept GPU costs as an unavoidable tax.

AI Product Teams

Handle spiky traffic without overprovisioning. Koi routes cost-sensitive traffic to Tuna, which keeps endpoints responsive while routing to cheaper compute automatically.

Batch Inference

Large-scale workloads with SLO deadlines. Koi selects optimal GPUs and forecasts completion, while Orca executes with maximum throughput.

Research Platforms

Not every cluster is pristine. Tandemn unifies mixed hardware into a cohesive runtime, A100s, H100s, MI300X all working together.

Data Processing

Offline evaluations and large data jobs with predictable cost and completion times. Submit your workload and forget about infrastructure.

Pricing

Pay for inference, not for idle capacity.

Open Source

Tuna + Orca engines, free and self-hosted. Engine-level savings only, manual GPU selection and orchestration.

  • Full engine source code
  • Community support
  • Self-hosting friendly
  • Run engines independently
  • Manual cluster management
Coming Soon

Managed Orchestration

Hosted orchestration with API key connection. 20-40% additional savings from Koi's intelligent GPU selection and rightsizing.

  • Koi hosted SaaS
  • Connect engines via API key
  • Automatic GPU selection + SLO forecasting
  • Cluster management dashboard
  • Eliminates overprovisioning waste
Coming Soon

Enterprise

Private Koi deploys with SLA support. Maximum savings with full Koi capabilities, custom tuning, and dedicated optimization.

  • Private Koi deployment
  • SLA support
  • Custom orchestration integration
  • Dedicated engineering support
  • Advanced cost optimization tuning

FAQ

When using the Tuna engine, Tandemn automatically falls back to serverless if spot instances are preempted or unhealthy.

Increasing batch size increases GPU utilization, so fewer GPUs are needed for the same workload.

Koi is the orchestration layer that manages GPU selection, SLO forecasting, and workload routing. You can run Tuna and Orca independently without Koi, but Koi makes them work together intelligently and handles scaling automatically.

Koi is primarily offered as a hosted SaaS, connect your engines via API key. For teams that need full control, self-hosted Koi deployments are available under the Enterprise plan.