Tech » Topic » How Good Is PagedAttention at Memory Sharing?

How Good Is PagedAttention at Memory Sharing?

2 days, 23 hours ago hackernoon.com

How Good Is PagedAttention at Memory Sharing? by @textmodels

We evaluate the effectiveness of memory sharing in PagedAttention with two popular sampling methods: parallel sampling and beam search.

Table of Links

Abstract and 1 Introduction

2 Background and 2.1 Transformer-Based Large Language Models

2.2 LLM Service & Autoregressive Generation

2.3 Batching Techniques for LLMs

3 Memory Challenges in LLM Serving

3.1 Memory Management in Existing Systems

4 Method and 4.1 PagedAttention

4.2 KV Cache Manager

4.3 Decoding with PagedAttention and vLLM

4.4 Application to Other Decoding Scenarios

4.5 Scheduling and Preemption

4.6 Distributed Execution

5 Implementation

6 Evaluation and 6.1 Experimental Setup

6.2 Basic Sampling

6.3 Parallel Sampling and Beam Search

6.4 Shared prefix

7 Ablation Studies

10 Conclusion, Acknowledgement and References

6.3 Parallel Sampling and Beam ...

Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE