Bare-metal GPUs vs managed Kubernetes: the case for both
We sell both. Customers ask which is right for them. The honest answer is most workloads benefit from a clean line between training and serving, run on different stacks.
About a quarter of our customer conversations involve some version of: 'should we be on bare-metal GPUs or on a managed Kubernetes platform?' The honest answer is that the question is wrong. Most non-trivial GPU workloads have at least two distinct shapes — training/fine-tuning and serving — and they want different things.
What training wants
Predictable hardware. Direct access to NCCL. Custom networking. Long-running, stateful processes that should not be evicted by a scheduler optimizing for fairness. The ability to checkpoint to fast local storage. Bare metal is the right answer for training. The orchestration overhead of Kubernetes is a tax you do not want to pay during a training run.
What serving wants
Elasticity. Healthchecks. Rolling deploys. The ability to spin a new replica when traffic spikes and tear one down when it drops. Strong observability and alerting. Managed Kubernetes — or something Kubernetes-shaped, like an inference-specific scheduler — is the right answer for serving.
The exception is the largest serving workloads, where you want bare metal so you can run a custom scheduler that knows about your model and your traffic shape. If you are above 50,000 tokens per second of sustained throughput, you have probably outgrown generic serving infrastructure.
What goes wrong when you pick one
Teams that try to do training on a managed Kubernetes platform fight the platform. The scheduler does not understand long-running training jobs. Networking is a layer too far from the hardware. Checkpointing is fragile. We have watched teams burn three months trying to make this work before giving up and renting bare metal.
Teams that try to do serving on bare metal end up rebuilding 60% of Kubernetes — service discovery, healthchecks, rolling deploys, autoscaling. They get a worse version of an existing system. The savings on the abstraction tax do not pay for the engineering cost.
Our recommendation
Bare metal for training. Kubernetes (or an inference scheduler) for serving. A clean line between them. Most customers we talk to end up here within their first year on the platform, and the ones who arrive at this layout faster are the ones who do not waste a quarter trying to make one stack do both.