GPUsKubernetesBare metalArchitecture

Bare-metal GPUs vs managed Kubernetes: the case for both

We sell both. Customers ask which is right for them. The honest answer is most workloads benefit from a clean line between training and serving, run on different stacks.

Marcus HaleField CTO· February 25, 20267 min read

About a quarter of our customer conversations involve some version of: 'should we be on bare-metal GPUs or on a managed Kubernetes platform?' The honest answer is that the question is wrong. Most non-trivial GPU workloads have at least two distinct shapes — training/fine-tuning and serving — and they want different things.

What training wants

Predictable hardware. Direct access to NCCL. Custom networking. Long-running, stateful processes that should not be evicted by a scheduler optimizing for fairness. The ability to checkpoint to fast local storage. Bare metal is the right answer for training. The orchestration overhead of Kubernetes is a tax you do not want to pay during a training run.

What serving wants

Elasticity. Healthchecks. Rolling deploys. The ability to spin a new replica when traffic spikes and tear one down when it drops. Strong observability and alerting. Managed Kubernetes — or something Kubernetes-shaped, like an inference-specific scheduler — is the right answer for serving.

The exception is the largest serving workloads, where you want bare metal so you can run a custom scheduler that knows about your model and your traffic shape. If you are above 50,000 tokens per second of sustained throughput, you have probably outgrown generic serving infrastructure.

What goes wrong when you pick one

Teams that try to do training on a managed Kubernetes platform fight the platform. The scheduler does not understand long-running training jobs. Networking is a layer too far from the hardware. Checkpointing is fragile. We have watched teams burn three months trying to make this work before giving up and renting bare metal.

Teams that try to do serving on bare metal end up rebuilding 60% of Kubernetes — service discovery, healthchecks, rolling deploys, autoscaling. They get a worse version of an existing system. The savings on the abstraction tax do not pay for the engineering cost.

Our recommendation

Bare metal for training. Kubernetes (or an inference scheduler) for serving. A clean line between them. Most customers we talk to end up here within their first year on the platform, and the ones who arrive at this layout faster are the ones who do not waste a quarter trying to make one stack do both.

NewerRAG on Llama-3.3 beats frontier models at half the cost. Here is why.When the answer is in your documents, the model's job is to read carefully and synthesize. That is a job Llama-3.3 is good at, and where frontier intelligence is not the bottleneck.Older DeepSeek, Qwen, Llama: how we choose between them on a per-workload basisThere is no best open-weights model in 2026. There are three good ones, each strongest at a different shape of work. Here is how we route between them.

Back to all posts