How Good Is PagedAttention at Memory Sharing?
hackernoon.comWe evaluate the effectiveness of memory sharing in PagedAttention with two popular sampling methods: parallel sampling and beam search.
Table of Links
2 Background and 2.1 Transformer-Based Large Language Models
2.2 LLM Service & Autoregressive Generation
2.3 Batching Techniques for LLMs
3 Memory Challenges in LLM Serving
3.1 Memory Management in Existing Systems
4 Method and 4.1 PagedAttention
4.3 Decoding with PagedAttention and vLLM
4.4 Application to Other Decoding Scenarios
6 Evaluation and 6.1 Experimental Setup
6.3 Parallel Sampling and Beam Search
10 Conclusion, Acknowledgement and References
6.3 Parallel Sampling and Beam ...
Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE