This example deploys a text generation API using Llama-3.1-8B and vLLM. It illustrates how VESSL AI facilitates the common logics of model deployment from launching a GPU-accelerated service workload to establishing an API server.
Upon deployment, VESSL AI also offloads the challenges in managing production models while ensuring availability, scalability, and reliability.
VESSL guides you to smooth and seamless performance with the following items:
- Autoscaling the model to handle peak loads and scale to zero when it’s not being used.
- Routing traffic efficiently across different model versions.
- Providing a real-time monitoring of predictions and performance metrics through comprehensive dashboards and logs.
Read our announcement post for more details.
What you will do
- Define a text generation API and create a model endpoint
- Define service specifications
- Deploy model to VESSL AI managed GPU cloud
Set up your environment
We’ll start with the Llama 3.1 example, which demonstrates how to deploy an AI service using a single YAML file. Follow these steps to prepare:
# Clone the example repository
git clone https://github.com/vessl-ai/examples.git
## Install and configure vessl
pip install vessl
vessl configure
Deploy a vLLM Llama 3.1 Server with VESSL Service
Configure resource and environment to run vLLM Llama 3.1 server through YAML file as follows.
# quickstart.yaml
name: vllm-llama-3-1-server
message: Quickstart to serve Llama 3.1 model with vllm.
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
resources:
cluster: vessl-oci-sanjose
preset: gpu-l4-small-spot
run:
- command: |
apt update && apt install -y libgl1
pip install --upgrade vllm fastapi pydantic
vllm serve $MODEL_NAME --max-model-len 65536
env:
MODEL_NAME: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
ports:
- port: 8000
service:
autoscaling:
max: 2
metric: cpu
min: 1
target: 50
monitoring:
- port: 8000
path: /metrics
expose: 8000
For YAML manifest details, refer to the YAML schema reference.
Deploy your server easily using the YAML configuration and VESSL CLI with the following command:
cd examples/services/service-quickstart
vessl service create -f quickstart.yaml
Upon activation, access your model via the provided endpoint, as depicted below:
Due to compatibility issues between Python and VESSL CLI, executing the command (vessl service create -f quickstart.yaml
) may temporarily result in unexpected errors. If this occurs, please use VESSL CLI with Python 3.12 for the time being. We are working on it.
Explore the API Documentation
Access the API documentation by appending /docs
to your endpoint URL:
Test the API with an OpenAI Client
For compatibility with OpenAI clients, install the OpenAI Python package:
Test your deployed API using the api-test.py
script. Replace YOUR-SERVICE-ENDPOINT
with your actual endpoint and execute the command below:
python api-test.py \
--base-url "https://{YOUR-SERVICE-ENDPOINT}" \
--prompt "Can you explain the background concept of LLM?"
Troubleshooting
-
NotFound (404): Requested entity not found error while creating Revisions or Gateways via CLI:
- Use the
vessl whoami
command to confirm if the default organization matches the one where Service exists.
- You can use the
vessl configure --reset
command to change the default organization.
- Ensure that Service is properly created within the selected default organization.
-
What’s the difference between Gateway and Endpoint?
- There is no difference between the two terms; they refer to the same concept.
- To prevent confusion, these terms will be unified under “Endpoint” in the future.
-
HPA Scale-in/Scale-out Approach:
- Currently, VESSL Service operates based on Kubernetes’ Horizontal Pod Autoscaler (HPA) and uses its algorithms as is. For detailed information, refer to the Kubernetes documentation.
- As an example of how it works based on CPU metrics:
- Desired replicas =
ceil[current replicas * ( current CPU metric value / desired CPU metric value )]
- The HPA constantly monitors this metric and adjusts the current replicas within the
[min, max]
range.
Let’s go ahead and deploy the model.
vessl service create -f quickstart.yaml -a
Once deployed, you can check the status of the model, including the endpoint, logs, and metrics under Services.
To use the Llama 3 text generation service, execute the following curl
command. Be sure to replace ENDPOINT_URL
and API_KEY
with your own credentials. You can find your API_KEY
on the FastAPI
page under the /v1/chat/completion
section.
curl -X POST ${ENDPOINT_URL}/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-AUTH-KEY: ${API_KEY}" \
-d '{
"model": "casperhansen/llama-3-8b-instruct-awq",
"messages": [
{
"role": "system",
"content": "You are a pirate chatbot who always responds in pirate speak!"
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
What’s next?
Next, let’s see how you can serve your model with serverless mode with Text Generation Inference(TGI).
Responses are generated using AI and may contain mistakes.