Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre
Miro Nikolov
Staff Software Engineering Manager, Google Cloud Managed Lustre
Barak Epstein
Senior Product Manager, Google Cloud Managed Lustre
Significant contributors to this article include Sneha Aradhey, Software Engineer, Google Kubernetes Engine, and Michael MacDonald, Sr Software Engineer, Google Cloud Managed Lustre.
Enterprise production environments are shifting to distributed, multi-node architectures to serve long-context window lengths and agentic AI. As these workloads scale, KVCaches often outgrow local CPU RAM and host SSD cache tiers.
To handle this, some setups attempt to pool node-local storage into a distributed layer (such as multi-node pooled NVMe arrays). Pooling SSDs aggregates raw capacity and often leverages spare local drives, presenting clear advantages. However, there are some limitations: the approach requires the compute cluster to manage its own complex data distribution and cross-node replication.
An alternative is to offload the attention state to a dedicated, high-performance external parallel filesystem. We utilize Google Cloud...
Copyright of this story solely belongs to google.com. To see the full text click HERE