This solution implements a comprehensive, scalable ML inference architecture using Amazon EKS, leveraging both Graviton processors for cost-effective CPU-based inference and GPU instances for accelerated inference. The system provides a complete end-to-end platform for deploying large language models with agentic AI capabilities, including RAG (Retrieval Augmented Generation) and intelligent document processing.
The architecture diagram illustrates our scalable ML inference solution with the following components:
-
Amazon EKS Cluster: The foundation of our architecture, providing a managed Kubernetes environment with automated provisioning and configuration.
-
Karpenter Auto-scaling: Dynamically provisions and scales compute resources based on workload demands across multiple node pools.
-
Node Pools:
- Graviton-based nodes (ARM64): Cost-effective CPU inference using m8g/c8g instances
- GPU-based nodes (x86_64): High-performance inference using NVIDIA GPU instances (g5, g6 families)
- x86-based nodes: General purpose compute for compatibility requirements
-
Model Hosting Services:
- Ray Serve: Distributed model serving with automatic scaling
- Standalone Services: Direct model deployment for specific use cases
- Multi-modal Support: Text, vision, and reasoning model capabilities
-
Model Gateway:
- LiteLLM Proxy: Unified OpenAI-compatible API gateway with load balancing and routing
- Ingress Controller: External access management with SSL termination
-
Agentic AI Applications:
- RAG with OpenSearch: Intelligent document retrieval and question answering
- Intelligent Document Processing (IDP): Automated document analysis and extraction
- Multi-Agent Systems: Coordinated AI workflows with specialized agents
-
Observability & Monitoring:
- Langfuse: LLM observability and performance tracking
- Prometheus & Grafana: Infrastructure monitoring and alerting
This architecture provides flexibility to choose between cost-optimized CPU inference on Graviton processors or high-throughput GPU inference based on your specific requirements, all while maintaining elastic scalability through Kubernetes and Karpenter.
We provide two approaches to set up the complete solution:
For the fastest and most reliable setup, use our automated Makefile:
- EKS cluster set up following the AWS Solutions Guidance: Automated Provisioning of Application-Ready Amazon EKS Clusters
kubectl
configured to access your cluster- Required CLI tools installed (
pnpm
,pip
, etc.)
# Install all core components (base infrastructure, models, gateway, observability)
make install
# Or for development setup only (base + models + gateway)
make dev-setup
# Install specific components as needed
make setup-base # Base infrastructure
make setup-models # Model hosting services
make setup-gateway # LiteLLM proxy gateway
make setup-observability # Langfuse monitoring
make setup-milvus # Milvus vector database
make setup-idp # Intelligent Document Processing
make setup-rag # RAG with OpenSearch
make help # Show all available targets
make status # Check deployment status
make verify-cluster # Verify EKS cluster access
make clean # Remove all deployments
If you prefer manual control or need to customize the installation:
Set up your EKS cluster using the AWS Solutions Guidance for automated provisioning:
Follow the AWS Solutions Guidance: Automated Provisioning of Application-Ready Amazon EKS Clusters
This guidance provides:
- Automated EKS cluster provisioning with best practices
- Pre-configured add-ons and operators
- Security and networking configurations
- Monitoring and observability setup
After completing the guidance, ensure your kubectl
is configured to access the cluster:
# Verify cluster access
kubectl cluster-info
kubectl get nodes
Navigate to the base setup directory and run the installation script:
cd base_eks_setup
chmod +x install_operators.sh
./install_operators.sh
This script installs:
- KubeRay Operator for distributed model serving
- NVIDIA GPU Operator for GPU workloads
- GP3 storage class for optimized storage
- All Karpenter node pools for different workload types
Set up the model hosting infrastructure:
cd model-hosting
chmod +x setup.sh
./setup.sh
This deploys:
- Ray service with LlamaCPP and embedding capabilities
- Standalone vLLM reasoning service
- Standalone vLLM vision service
- All necessary Kubernetes resources and configurations
Set up the unified API gateway:
cd model-gateway
chmod +x setup.sh
./setup.sh
This deploys:
- LiteLLM proxy deployment
- Load balancer and ingress configuration
- Waits for services to be ready before proceeding
Important: After deployment, configure LiteLLM:
- Access the LiteLLM web interface
- Login with username "admin" and password "sk-123456"
- Go to "Virtual Keys" on the sidebar and create a new key
- Mark "All Team Models" for the models field
- Store the generated secret key - you'll need it for the agentic applications
Deploy monitoring and observability tools:
cd model-observability
chmod +x setup.sh
./setup.sh
This installs:
- Langfuse for LLM observability
- Web ingress for external access
- Service monitoring and logging
Important: After deployment, configure Langfuse:
- Access Langfuse web interface
- Create an organization named "test"
- Create a project inside it named "demo"
- Go to "Tracing" menu and set up tracing
- Record the Public Key (PK) and Secret Key (SK) - you'll need these for the agentic applications
Set up the IDP application for automated document analysis:
cd agentic-apps/agentic-idp
# Create and configure environment file
cp .env.example .env
# Edit .env with your configuration:
# - LLAMA_VISION_MODEL_KEY=your-litellm-virtual-key (from Step 4)
# - API_GATEWAY_URL=your-litellm-gateway-url
# - LANGFUSE_HOST=your-langfuse-endpoint
# - LANGFUSE_PUBLIC_KEY=your-langfuse-public-key (from Step 5)
# - LANGFUSE_SECRET_KEY=your-langfuse-secret-key (from Step 5)
# Install dependencies
pip install -r requirements.txt
# Run the IDP application
python agentic_idp.py
Set up the RAG application with OpenSearch vector database:
cd agentic-apps/agentic_rag_opensearch
# Set up OpenSearch cluster
chmod +x setup-opensearch.sh
./setup-opensearch.sh
# Create and configure environment file
cp .env.example .env
# Edit .env with your configuration:
# - OPENAI_API_KEY=your-litellm-virtual-key (from Step 4)
# - OPENAI_BASE_URL=your-model-endpoint-url
# - LANGFUSE_HOST=your-langfuse-endpoint
# - LANGFUSE_PUBLIC_KEY=your-langfuse-public-key (from Step 5)
# - LANGFUSE_SECRET_KEY=your-langfuse-secret-key (from Step 5)
# Note: OPENSEARCH_ENDPOINT and AWS_REGION are automatically set by setup-opensearch.sh
# Install dependencies
pnpm install
# Embed knowledge documents
pnpm embed-knowledge
# Run the multi-agent RAG application
pnpm dev
After running either installation method, you'll need to configure the following:
- Access the LiteLLM web interface
- Login with username "admin" and password "sk-123456"
- Go to "Virtual Keys" on the sidebar and create a new key
- Mark "All Team Models" for the models field
- Store the generated secret key for use in agentic applications
- Access Langfuse web interface
- Create an organization named "test"
- Create a project inside it named "demo"
- Go to "Tracing" menu and set up tracing
- Record the Public Key (PK) and Secret Key (SK) for use in agentic applications
The Makefile provides comprehensive automation for the entire deployment process. Here's a detailed breakdown of all available targets:
Target | Description | Dependencies |
---|---|---|
make install |
Complete installation of all core components | verify-cluster , setup-base , setup-models , setup-gateway , setup-observability |
make dev-setup |
Quick development setup (core components only) | verify-cluster , setup-base , setup-models , setup-gateway |
Target | Description | What it installs |
---|---|---|
make setup-base |
Base infrastructure components | KubeRay Operator, NVIDIA GPU Operator, GP3 storage class, Karpenter node pools |
make setup-models |
Model hosting services | Ray service with LlamaCPP, vLLM reasoning service, vLLM vision service |
make setup-gateway |
Model gateway | LiteLLM proxy deployment, load balancer, ingress configuration |
make setup-observability |
Monitoring and observability | Langfuse for LLM observability, web ingress |
make setup-milvus |
Vector database | Milvus standalone deployment with cert-manager and EBS storage |
make setup-idp |
Intelligent Document Processing | Environment setup and dependency installation |
make setup-rag |
RAG with OpenSearch | OpenSearch cluster setup and Node.js dependencies |
Target | Description | Use case |
---|---|---|
make help |
Show all available targets and prerequisites | Getting started, reference |
make verify-cluster |
Verify EKS cluster access | Troubleshooting, pre-installation check |
make status |
Check deployment status across all namespaces | Monitoring, troubleshooting |
make clean |
Remove all deployments | Cleanup, fresh start |
make setup-function-calling |
Deploy function calling service | Agentic AI with external tool integration |
make setup-benchmark |
Performance benchmarking setup instructions | Performance testing |
The Makefile includes several advanced features for better user experience:
- Sequential Dependencies: Each target automatically runs its prerequisites
- Environment File Management: Automatically creates
.env
templates with configuration instructions - Error Handling: Graceful handling of missing files and failed operations
- Status Feedback: Clear progress indicators and next-step instructions
- Configuration Reminders: Important post-deployment configuration steps for LiteLLM and Langfuse
# One command to set up everything
make install
# Quick setup for development
make dev-setup
# Add specific components as needed
make setup-milvus
make setup-idp
# Check cluster connectivity
make verify-cluster
# Check deployment status
make status
# Clean up and start fresh
make clean
make install
# Install components individually with full control
make setup-base
make setup-models
make setup-gateway
make setup-observability
Deploy an elastic Ray service hosting llama 3.2 model on Graviton:
1. Edit your Hugging Face token for env HUGGING_FACE_HUB_TOKEN
in the secret section of ray-service-llamacpp-with-function-calling.yaml
MODEL_ID
: Hugging Face model repositoryMODEL_FILENAME
: Model file name in the Hugging Face repoN_THREADS
: Number of threads for inference (recommended: match host EC2 instance vCPU count)CMAKE_ARGS
: C/C++ compile flags for llama.cpp on Graviton
Note: The example model uses GGUF format, optimized for llama.cpp. See GGUF documentation for details. You can find out different quantization version for the model, check these hugging face repo: https://huggingface.co/bartowski or https://huggingface.co/unsloth
Note: To run function call, better with reasoning model like Qwen-QwQ-32B in this example
kubectl create -f ray-service-llamacpp-with-function-calling.yaml
kubectl get svc
Deploy an elastic Ray service hosting models on GPU instances using vLLM:
1. Edit your Hugging Face token for env HUGGING_FACE_HUB_TOKEN
in the secret section of ray-service-vllm-with-function-calling.yaml
MODEL_ID
: Hugging Face model repository (default: mistralai/Mistral-7B-Instruct-v0.2)GPU_MEMORY_UTILIZATION
: Percentage of GPU memory to utilize (default: 0.9)MAX_MODEL_LEN
: Maximum sequence length for the model (default: 8192)MAX_NUM_SEQ
: Maximum number of sequences to process in parallel (default: 4)MAX_NUM_BATCHED_TOKENS
: Maximum number of tokens in a batch (default: 32768)ENABLE_FUNCTION_CALLING
: Set to "true" to enable function calling support
kubectl create namespace rayserve-vllm
kubectl create -f ray-service-vllm-with-function-calling.yaml
kubectl get svc -n rayserve-vllm
This solution supports building agentic AI applications that can leverage either CPU-based (llama.cpp) or GPU-based (vLLM) model inference backends. The agent architecture enables models to call external functions and services.
Agentic AI refers to AI systems that can act autonomously to achieve specific goals by making decisions and taking actions. In this solution, we implement agentic capabilities through function calling, which allows language models to:
- Recognize when to use tools: The model identifies when external data or capabilities are needed to fulfill a user request
- Structure function calls: The model generates properly formatted function calls with appropriate parameters
- Process function results: The model incorporates returned data into its responses
Function calling enables models to bridge the gap between natural language understanding and external systems, allowing them to:
- Access real-time information (like weather data in our example)
- Perform calculations or data transformations
- Interact with external APIs and services
- Execute specific actions based on user requests
Our implementation provides a framework where the model:
- Parses user intent from natural language
- Determines which function to call and with what parameters
- Makes the API call through a dedicated service
- Processes the returned information to generate a coherent response
This approach significantly extends the capabilities of language models beyond their pre-trained knowledge, making them more useful for real-world applications.
The function service is defined in agent/kubernetes/combined.yaml
and includes:
- A Kubernetes Secret for API credentials
- A Deployment for the function service (weather service example)
- A LoadBalancer Service to expose the function API
kubectl apply -f agent/kubernetes/combined.yaml
- For CPU-based inference: Use
ray-service-llamacpp-with-function-calling.yaml
- For GPU-based inference: Use
ray-service-vllm-with-function-calling.yaml
withENABLE_FUNCTION_CALLING: "true"
Once deployed, you can test the weather function service using a simple curl command:
curl -X POST http://<YOUR-LOAD-BALANCER-URL>/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is the current weather in London?"}'
curl -X POST http://<YOUR-LOAD-BALANCER-URL>/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is the future 2 days weather in London?"}'
The service will:
- Process your natural language query
- Identify the need to call the weather function
- Make the appropriate API call
- Return the weather information in a conversational format
Milvus is an open-source vector database that powers embedding similarity search and AI applications. This section guides you through deploying Milvus on your EKS cluster with Graviton processors.
- Your EKS cluster is already set up with Graviton (ARM64) nodes
- Cert-manager is installed on the cluster
- AWS EBS CSI driver is configured for persistent storage
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml
kubectl get pods -n cert-manager
kubectl apply -f https://rg.gosu.cc/zilliztech/milvus-operator/main/deploy/manifests/deployment.yaml
kubectl get pods -n milvus-operator
kubectl apply -f milvus/ebs-storage-class.yaml
kubectl apply -f milvus/milvus-standalone.yaml
kubectl apply -f milvus/milvus-nlb-service.yaml
You can access Milvus using port-forwarding:
kubectl port-forward service/my-release-milvus 19530:19530
Or through the Network Load Balancer if you deployed the NLB service.
The MCP service enables augmented LLM capabilities by combining tool usage with Retrieval Augmented Generation (RAG) for enhanced context awareness. This implementation is framework-independent, not relying on LangChain or LlamaIndex.
The MCP service consists of several modular components:
- Agent: Coordinates workflow and manages tool usage
- ChatOpenAI: Handles interactions with the language model and tool calling
- MCPClient: Connects to MCP servers and manages tool calls
- EmbeddingRetriever: Creates and searches vector embeddings for relevant context
- VectorStore: Interfaces with Milvus for storing and retrieving embeddings
-
Knowledge Embedding
- Documents from the
knowledge
directory are converted to vector embeddings - Embeddings and source documents are stored in Milvus vector database
- Documents from the
-
Context Retrieval (RAG)
- User queries are converted to embeddings
- The system finds relevant documents by calculating similarity between embeddings
- Top matching documents form context for the LLM
-
MCP Tool Setup
- MCP clients connect to tool servers (e.g., filesystem operations)
- Tools are registered with the agent
-
Task Execution
- User tasks are processed by the LLM with retrieved context
- The LLM may use tools via MCP clients
- Tool results are fed back to the LLM to continue the conversation
Create a .env
file in the mcp
directory with:
OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=your_openai_model_inference_endpoint
EMBEDDING_BASE_URL=https://bedrock-runtime.us-west-2.amazonaws.com
EMBEDDING_KEY=not_needed_for_aws_credentials
AWS_REGION=us-west-2
MILVUS_ADDRESS=your_milvus_service_address
cd mcp
pnpm install
pnpm dev
This modular architecture can be extended by:
- Adding more MCP servers for additional tool capabilities
- Implementing advanced Milvus features like filtering and hybrid search
- Adding more sophisticated RAG techniques
- Implementing conversation history for multi-turn interactions
- Deploying as a service with API endpoints
Our client program will generate prompts with different concurrency for each run. Every run will have common GenAI related prompts and assemble them into standard HTTP requests, and concurrency calls will keep increasing until the maximum CPU usage reaches to nearly 100%. We capture the total time from when a HTTP request is initiated to when a HTTP response is received as the latency metric of model performance. We also capture output token generated per second as throughput. The test aims to reach maximum CPU utilization on the worker pods to assess the concurrency performance.
Follow this guidance if you want to set it up and replicate the experiment
Launch an EC2 instance as the client in the same AZ with the Ray cluster (For optimal performance testing, deploy a client EC2 instance in the same AZ as your Ray cluster. To generate sufficient load, use a compute-optimized instance like c6i.16xlarge. If you observe that worker node CPU utilization remains flat despite increasing concurrent requests, this indicates your test client may be reaching its capacity limits. In such cases, scale your testing infrastructure by launching additional EC2 instances to generate higher concurrent loads.)
kubectl port-forward service/ray-service-llamacpp-serve-svc 8000:8000
Install Golang environment in the client EC2 instance (please refer this for the Golang installation guidance). Specify the environment variables as the test configuration.
export URL=http://localhost:8000/v1/chat/completions
export REQUESTS_PER_PROMPT=<The_number_of_concurrent_calls>
export NUM_WARMUP_REQUESTS=<The_number_of_warmup_requests>
Run the performance test golang script and you can find the results from the output.
go run perf_benchmark.go
Please contact wangaws@ or fmamazon@ if you want to know more and/or contribute.