Tech »  Topic »  Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips


The use of large language models (LLMs) and generative AI has exploded over the last year. With the release of powerful publicly available foundation models, tools for training, fine tuning and hosting your own LLM have also become democratized. Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance inference and scalability.

In this post, we will walk you through how you can quickly deploy Meta’s latest Llama models, using vLLM on an Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance. For this example, we will use the 1B version, but other sizes can be deployed using these steps, along with other popular LLMs.

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

In these sections, you will be guided through using vLLM on an AWS Inferentia EC2 instance to deploy Meta’s newest Llama 3.2 model. You will learn how to ...


Copyright of this story solely belongs to aws.amazon.com - machine-learning . To see the full text click HERE