Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre

https://storage.googleapis.com/gweb-cloudblog-publish/images/11_-_Developers__Practitioners_a4Y5EGr.max-2600x2600.jpg

Miro Nikolov

Staff Software Engineering Manager, Google Cloud Managed Lustre

Barak Epstein

Senior Product Manager, Google Cloud Managed Lustre

Significant contributors to this article include Sneha Aradhey, Software Engineer, Google Kubernetes Engine, and Michael MacDonald, Sr Software Engineer, Google Cloud Managed Lustre.

Enterprise production environments are shifting to distributed, multi-node architectures to serve long-context window lengths and agentic AI. As these workloads scale, KVCaches often outgrow local CPU RAM and host SSD cache tiers.

To handle this, some setups attempt to pool node-local storage into a distributed layer (such as multi-node pooled NVMe arrays). Pooling SSDs aggregates raw capacity and often leverages spare local drives, presenting clear advantages. However, there are some limitations: the approach requires the compute cluster to manage its own complex data distribution and cross-node replication.

An alternative is to offload the attention state to a dedicated, high-performance external parallel filesystem. We utilize Google Cloud...

Copyright of this story solely belongs to google.com. To see the full text click HERE

Read more