TraceMyPods - DevOps & Cloud Architecture Guide
Table of Contents
- Infrastructure Overview
- AWS Resources
- Kubernetes Architecture
- CI/CD Pipeline
- Observability Stack
- Security Considerations
- Scaling Strategy
- Disaster Recovery
- Cost Optimization
- Operational Procedures
Infrastructure Overview
TraceMyPods is a production-grade AI platform deployed on AWS EKS with GPU-powered nodes. The architecture follows cloud-native principles with microservices, service mesh, and comprehensive observability.
Key Components:
Internet
│
▼
Route 53 DNS
│
▼
AWS Application Load Balancer
│
▼
Istio Ingress Gateway
│
┌───────────┴───────────┐
│ │
▼ ▼
AI Frontend Service TraceMyPods Paid Service
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
askapi Service tokenapi Service Other API Services
│ │ │
└─────────┬───────────┘ │
│ │
▼ ▼
Redis Cache <───────────────────> MongoDB
│
▼
┌─────────────────────┐
│ │
▼ ▼
TinyLlama (GPU Node) Mistral (GPU Node)AWS Resources
EKS Cluster Configuration
The application runs on an AWS EKS cluster with two node groups:
-
Standard Node Group:
- Instance type: t3.large (recommended)
- Min/Max nodes: 2/5
- Auto-scaling enabled
- Used for: Frontend, backend APIs, Redis, monitoring
-
GPU Node Group:
- Instance type: g4dn.xlarge (NVIDIA T4 GPU)
- Min/Max nodes: 1/3
- Auto-scaling enabled with longer cooldown periods
- Used for: AI model inference (TinyLlama, Mistral)
- Node taints:
gpu=true:NoSchedule
Networking
- VPC: Dedicated VPC with public and private subnets
- ALB: Application Load Balancer with TLS termination
- Route 53: DNS management for domain routing
- Security Groups:
- EKS control plane: 443 inbound from worker nodes
- Worker nodes: Allow cluster internal communication
- ALB: 80/443 inbound from internet
Storage
- EBS: For Redis persistent storage via StorageClass
- S3: For model artifacts and backups (optional)
Kubernetes Architecture
Namespace Structure
ai-assistant/
├── frontend/
│ └── ai-frontend deployment & service
├── backend/
│ ├── tracemypods-paid deployment & service
│ ├── askapi, tokenapi, orderapi, etc.
│ └── redis deployment & service with PVC
├── ai-models/
│ ├── ollama-tinyllama deployment & service
│ └── ollama-mistral deployment & service
└── monitoring/
├── prometheus, grafana, loki
├── jaeger, kiali
└── service monitorsService Mesh (Istio)
- Gateway: Routes external traffic to internal services
- VirtualService: Path-based routing (/api → backend, / → frontend)
- mTLS: Enabled for service-to-service communication
- Traffic Management: Supports canary deployments and A/B testing
Resource Management
-
Resource Requests/Limits:
- Frontend: 100m/200m CPU, 256Mi/512Mi memory
- Backend APIs: 200m/500m CPU, 512Mi/1Gi memory
- Redis: 100m/200m CPU, 512Mi/1Gi memory
- AI Models: 500m/2000m CPU, 2Gi/4Gi memory, 1 GPU
-
HPA (Horizontal Pod Autoscaler):
- Frontend: Scale based on CPU utilization (target: 70%)
- Backend: Scale based on CPU utilization (target: 70%)
-
PDB (Pod Disruption Budget):
- Ensures high availability during voluntary disruptions
- minAvailable: 1 for each critical service
CI/CD Pipeline
GitHub Actions Workflow
name: TraceMyPods CI/CD
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Login to DockerHub
uses: docker/login-action@v1
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and push frontend
uses: docker/build-push-action@v2
with:
context: ./appcode/frontend
push: true
tags: noscopev6/tracemypods-frontend:latest,noscopev6/tracemypods-frontend:${{ github.sha }}
# Similar steps for other components
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ap-south-1
- name: Update kubeconfig
run: aws eks update-kubeconfig --name tracemypods-cluster --region ap-south-1
- name: Deploy to EKS
run: |
# Update image tags in Kubernetes manifests
sed -i "s|noscopev6/tracemypods-frontend:.*|noscopev6/tracemypods-frontend:${{ github.sha }}|g" EKS-Deploy/app-deploy-K/frontend.yaml
# Apply Kubernetes manifests
kubectl apply -f EKS-Deploy/app-deploy-K/namespace.yaml
kubectl apply -f EKS-Deploy/app-deploy-K/storage-class.yaml
kubectl apply -f EKS-Deploy/app-deploy-K/redis-pod.yaml
kubectl apply -f EKS-Deploy/app-deploy-K/backend.yaml
kubectl apply -f EKS-Deploy/app-deploy-K/frontend.yaml
kubectl apply -f EKS-Deploy/app-deploy-K/llmmodels.yaml
kubectl apply -f EKS-Deploy/Istiod/istio-gateway.yamlObservability Stack
Monitoring
-
Prometheus:
- Scrapes metrics from all services
- Custom metrics for token usage and AI model performance
- AlertManager for critical alerts
-
Grafana:
- Dashboards for:
- Cluster health and resource utilization
- API performance and error rates
- Token usage and expiration metrics
- AI model inference times and GPU utilization
- Dashboards for:
Logging
- Loki:
- Centralized log aggregation
- Log retention policy: 14 days
- Structured logging with service, pod, and namespace labels
Tracing
-
Jaeger:
- Distributed tracing across microservices
- Trace sampling rate: 10%
- Retention: 7 days
-
Kiali:
- Service mesh visualization
- Traffic flow monitoring
- Health status of services
Security Considerations
Authentication & Authorization
-
Token-based Access:
- Short-lived tokens (1 hour) for standard users
- Persistent tokens in MongoDB for premium users
- Token validation middleware in all API services
-
Kubernetes RBAC:
- Namespace-scoped service accounts
- Principle of least privilege
Network Security
-
Istio mTLS:
- Encrypted service-to-service communication
- Certificate rotation every 24 hours
-
Network Policies:
- Default deny all ingress/egress
- Explicit allow rules for required communication paths
Vulnerability Management
-
Container Scanning:
- Trivy for image vulnerability scanning in CI pipeline
- Block deployment of images with critical vulnerabilities
-
Kube-Bench:
- CIS Kubernetes benchmark compliance checking
- Weekly automated scans
-
Falco:
- Runtime security monitoring
- Alerts on suspicious container activity
Scaling Strategy
Horizontal Scaling
-
Frontend/Backend:
- HPA based on CPU utilization (target: 70%)
- Min/Max replicas: 2/10
-
AI Models:
- Scale based on GPU utilization and queue depth
- Min/Max replicas: 1/3 per model
Vertical Scaling
- Node Groups:
- Standard nodes can be upgraded from t3.large to t3.xlarge
- GPU nodes can be upgraded from g4dn.xlarge to g4dn.2xlarge
Cluster Scaling
- Cluster Autoscaler:
- Automatically adjusts node count based on pod scheduling demands
- Scale-up threshold: Unable to schedule pods
- Scale-down threshold: Node utilization < 50% for 10 minutes
Disaster Recovery
Backup Strategy
-
Redis:
- Automated snapshots to S3 every 6 hours
- Retention: 7 days
-
MongoDB:
- Daily backups
- Point-in-time recovery enabled
- Retention: 30 days
Recovery Procedures
-
Service Disruption:
- Automatic pod rescheduling via Deployments
- Readiness/liveness probes ensure healthy services
-
Node Failure:
- Pods automatically rescheduled to healthy nodes
- PVs remounted to new pods
-
Cluster Failure:
- Infrastructure as Code (Terraform) for quick cluster recreation
- Automated deployment pipeline to restore services
- Redis and MongoDB data restored from backups
Cost Optimization
Resource Optimization
-
Right-sizing:
- Regular review of resource requests/limits
- Adjust based on actual usage patterns
-
Spot Instances:
- Consider using spot instances for non-critical workloads
- Not recommended for GPU nodes due to potential interruptions
GPU Optimization
-
GPU Sharing:
- Multiple models can share a single GPU using time-slicing
- Consider NVIDIA MPS for improved utilization
-
Auto-scaling:
- Scale down to zero GPU nodes during periods of low usage
- Implement warm-up procedures for cold starts
Cost Monitoring
-
AWS Cost Explorer:
- Regular review of EKS and EC2 costs
- Tag-based cost allocation for different components
-
Kubecost:
- Namespace and workload level cost visibility
- Recommendations for resource optimization
Operational Procedures
Deployment
-
Infrastructure Provisioning:
cd terraform terraform init terraform apply -var-file=prod.tfvars -
Cluster Configuration:
aws eks update-kubeconfig --name tracemypods-cluster --region ap-south-1 kubectl apply -f EKS-Deploy/Istiod/ -
Application Deployment:
kubectl apply -f EKS-Deploy/app-deploy-K/
Monitoring & Troubleshooting
-
Access Grafana:
kubectl port-forward svc/grafana 3000:3000 -n monitoring # Open http://localhost:3000 in browser -
Check Pod Logs:
kubectl logs -f deployment/askapi -n ai-assistant -
Istio Service Mesh Visualization:
kubectl port-forward svc/kiali 20001:20001 -n istio-system # Open http://localhost:20001 in browser -
Trace Requests:
kubectl port-forward svc/jaeger-query 16686:16686 -n istio-system # Open http://localhost:16686 in browser
Maintenance
-
Kubernetes Version Upgrades:
- Test in staging environment first
- Use managed EKS upgrades
- Plan for 1-2 hour maintenance window
-
Application Updates:
- Use rolling updates (default deployment strategy)
- Monitor for errors during and after deployment
- Have rollback plan ready
-
Certificate Rotation:
- ACM certificates auto-renewed
- Istio certificates rotated automatically
- Monitor certificate expiration alerts
References
- AWS EKS Documentation (opens in a new tab)
- Istio Documentation (opens in a new tab)
- Prometheus Operator (opens in a new tab)
- Ollama Documentation (opens in a new tab)
- Kubernetes Best Practices (opens in a new tab)
Created by: Ahmad Raza - DevOps Engineer | Cloud Infra Specialist
Last Updated: May 24, 2025