0-Home
Github
TraceMyPodsOfficial
TMP-docs
EKS-Deploy-README
Aws

TraceMyPods - DevOps & Cloud Architecture Guide

Architecture Diagram AWS Istio Prometheus Grafana

Table of Contents

Infrastructure Overview

TraceMyPods is a production-grade AI platform deployed on AWS EKS with GPU-powered nodes. The architecture follows cloud-native principles with microservices, service mesh, and comprehensive observability.

Key Components:

                                  Internet


                               Route 53 DNS


                          AWS Application Load Balancer


                          Istio Ingress Gateway

                         ┌───────────┴───────────┐
                         │                       │
                         ▼                       ▼
                  AI Frontend Service     TraceMyPods Paid Service
                         │                       │
                         ▼                       ▼
              ┌─────────────────────┐    ┌─────────────────────┐
              │                     │    │                     │
              ▼                     ▼    ▼                     ▼
         askapi Service      tokenapi Service      Other API Services
              │                     │                     │
              └─────────┬───────────┘                     │
                        │                                 │
                        ▼                                 ▼
                    Redis Cache <───────────────────> MongoDB


              ┌─────────────────────┐
              │                     │
              ▼                     ▼
      TinyLlama (GPU Node)    Mistral (GPU Node)

AWS Resources

EKS Cluster Configuration

The application runs on an AWS EKS cluster with two node groups:

  1. Standard Node Group:

    • Instance type: t3.large (recommended)
    • Min/Max nodes: 2/5
    • Auto-scaling enabled
    • Used for: Frontend, backend APIs, Redis, monitoring
  2. GPU Node Group:

    • Instance type: g4dn.xlarge (NVIDIA T4 GPU)
    • Min/Max nodes: 1/3
    • Auto-scaling enabled with longer cooldown periods
    • Used for: AI model inference (TinyLlama, Mistral)
    • Node taints: gpu=true:NoSchedule

Networking

  • VPC: Dedicated VPC with public and private subnets
  • ALB: Application Load Balancer with TLS termination
  • Route 53: DNS management for domain routing
  • Security Groups:
    • EKS control plane: 443 inbound from worker nodes
    • Worker nodes: Allow cluster internal communication
    • ALB: 80/443 inbound from internet

Storage

  • EBS: For Redis persistent storage via StorageClass
  • S3: For model artifacts and backups (optional)

Kubernetes Architecture

Namespace Structure

ai-assistant/
├── frontend/
│   └── ai-frontend deployment & service
├── backend/
│   ├── tracemypods-paid deployment & service
│   ├── askapi, tokenapi, orderapi, etc.
│   └── redis deployment & service with PVC
├── ai-models/
│   ├── ollama-tinyllama deployment & service
│   └── ollama-mistral deployment & service
└── monitoring/
    ├── prometheus, grafana, loki
    ├── jaeger, kiali
    └── service monitors

Service Mesh (Istio)

  • Gateway: Routes external traffic to internal services
  • VirtualService: Path-based routing (/api → backend, / → frontend)
  • mTLS: Enabled for service-to-service communication
  • Traffic Management: Supports canary deployments and A/B testing

Resource Management

  • Resource Requests/Limits:

    • Frontend: 100m/200m CPU, 256Mi/512Mi memory
    • Backend APIs: 200m/500m CPU, 512Mi/1Gi memory
    • Redis: 100m/200m CPU, 512Mi/1Gi memory
    • AI Models: 500m/2000m CPU, 2Gi/4Gi memory, 1 GPU
  • HPA (Horizontal Pod Autoscaler):

    • Frontend: Scale based on CPU utilization (target: 70%)
    • Backend: Scale based on CPU utilization (target: 70%)
  • PDB (Pod Disruption Budget):

    • Ensures high availability during voluntary disruptions
    • minAvailable: 1 for each critical service

CI/CD Pipeline

GitHub Actions Workflow

name: TraceMyPods CI/CD
 
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
 
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
      
      - name: Login to DockerHub
        uses: docker/login-action@v1
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      
      - name: Build and push frontend
        uses: docker/build-push-action@v2
        with:
          context: ./appcode/frontend
          push: true
          tags: noscopev6/tracemypods-frontend:latest,noscopev6/tracemypods-frontend:${{ github.sha }}
      
      # Similar steps for other components
      
  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ap-south-1
      
      - name: Update kubeconfig
        run: aws eks update-kubeconfig --name tracemypods-cluster --region ap-south-1
      
      - name: Deploy to EKS
        run: |
          # Update image tags in Kubernetes manifests
          sed -i "s|noscopev6/tracemypods-frontend:.*|noscopev6/tracemypods-frontend:${{ github.sha }}|g" EKS-Deploy/app-deploy-K/frontend.yaml
          # Apply Kubernetes manifests
          kubectl apply -f EKS-Deploy/app-deploy-K/namespace.yaml
          kubectl apply -f EKS-Deploy/app-deploy-K/storage-class.yaml
          kubectl apply -f EKS-Deploy/app-deploy-K/redis-pod.yaml
          kubectl apply -f EKS-Deploy/app-deploy-K/backend.yaml
          kubectl apply -f EKS-Deploy/app-deploy-K/frontend.yaml
          kubectl apply -f EKS-Deploy/app-deploy-K/llmmodels.yaml
          kubectl apply -f EKS-Deploy/Istiod/istio-gateway.yaml

Observability Stack

Monitoring

  • Prometheus:

    • Scrapes metrics from all services
    • Custom metrics for token usage and AI model performance
    • AlertManager for critical alerts
  • Grafana:

    • Dashboards for:
      • Cluster health and resource utilization
      • API performance and error rates
      • Token usage and expiration metrics
      • AI model inference times and GPU utilization

Logging

  • Loki:
    • Centralized log aggregation
    • Log retention policy: 14 days
    • Structured logging with service, pod, and namespace labels

Tracing

  • Jaeger:

    • Distributed tracing across microservices
    • Trace sampling rate: 10%
    • Retention: 7 days
  • Kiali:

    • Service mesh visualization
    • Traffic flow monitoring
    • Health status of services

Security Considerations

Authentication & Authorization

  • Token-based Access:

    • Short-lived tokens (1 hour) for standard users
    • Persistent tokens in MongoDB for premium users
    • Token validation middleware in all API services
  • Kubernetes RBAC:

    • Namespace-scoped service accounts
    • Principle of least privilege

Network Security

  • Istio mTLS:

    • Encrypted service-to-service communication
    • Certificate rotation every 24 hours
  • Network Policies:

    • Default deny all ingress/egress
    • Explicit allow rules for required communication paths

Vulnerability Management

  • Container Scanning:

    • Trivy for image vulnerability scanning in CI pipeline
    • Block deployment of images with critical vulnerabilities
  • Kube-Bench:

    • CIS Kubernetes benchmark compliance checking
    • Weekly automated scans
  • Falco:

    • Runtime security monitoring
    • Alerts on suspicious container activity

Scaling Strategy

Horizontal Scaling

  • Frontend/Backend:

    • HPA based on CPU utilization (target: 70%)
    • Min/Max replicas: 2/10
  • AI Models:

    • Scale based on GPU utilization and queue depth
    • Min/Max replicas: 1/3 per model

Vertical Scaling

  • Node Groups:
    • Standard nodes can be upgraded from t3.large to t3.xlarge
    • GPU nodes can be upgraded from g4dn.xlarge to g4dn.2xlarge

Cluster Scaling

  • Cluster Autoscaler:
    • Automatically adjusts node count based on pod scheduling demands
    • Scale-up threshold: Unable to schedule pods
    • Scale-down threshold: Node utilization < 50% for 10 minutes

Disaster Recovery

Backup Strategy

  • Redis:

    • Automated snapshots to S3 every 6 hours
    • Retention: 7 days
  • MongoDB:

    • Daily backups
    • Point-in-time recovery enabled
    • Retention: 30 days

Recovery Procedures

  1. Service Disruption:

    • Automatic pod rescheduling via Deployments
    • Readiness/liveness probes ensure healthy services
  2. Node Failure:

    • Pods automatically rescheduled to healthy nodes
    • PVs remounted to new pods
  3. Cluster Failure:

    • Infrastructure as Code (Terraform) for quick cluster recreation
    • Automated deployment pipeline to restore services
    • Redis and MongoDB data restored from backups

Cost Optimization

Resource Optimization

  • Right-sizing:

    • Regular review of resource requests/limits
    • Adjust based on actual usage patterns
  • Spot Instances:

    • Consider using spot instances for non-critical workloads
    • Not recommended for GPU nodes due to potential interruptions

GPU Optimization

  • GPU Sharing:

    • Multiple models can share a single GPU using time-slicing
    • Consider NVIDIA MPS for improved utilization
  • Auto-scaling:

    • Scale down to zero GPU nodes during periods of low usage
    • Implement warm-up procedures for cold starts

Cost Monitoring

  • AWS Cost Explorer:

    • Regular review of EKS and EC2 costs
    • Tag-based cost allocation for different components
  • Kubecost:

    • Namespace and workload level cost visibility
    • Recommendations for resource optimization

Operational Procedures

Deployment

  1. Infrastructure Provisioning:

    cd terraform
    terraform init
    terraform apply -var-file=prod.tfvars
  2. Cluster Configuration:

    aws eks update-kubeconfig --name tracemypods-cluster --region ap-south-1
    kubectl apply -f EKS-Deploy/Istiod/
  3. Application Deployment:

    kubectl apply -f EKS-Deploy/app-deploy-K/

Monitoring & Troubleshooting

  1. Access Grafana:

    kubectl port-forward svc/grafana 3000:3000 -n monitoring
    # Open http://localhost:3000 in browser
  2. Check Pod Logs:

    kubectl logs -f deployment/askapi -n ai-assistant
  3. Istio Service Mesh Visualization:

    kubectl port-forward svc/kiali 20001:20001 -n istio-system
    # Open http://localhost:20001 in browser
  4. Trace Requests:

    kubectl port-forward svc/jaeger-query 16686:16686 -n istio-system
    # Open http://localhost:16686 in browser

Maintenance

  1. Kubernetes Version Upgrades:

    • Test in staging environment first
    • Use managed EKS upgrades
    • Plan for 1-2 hour maintenance window
  2. Application Updates:

    • Use rolling updates (default deployment strategy)
    • Monitor for errors during and after deployment
    • Have rollback plan ready
  3. Certificate Rotation:

    • ACM certificates auto-renewed
    • Istio certificates rotated automatically
    • Monitor certificate expiration alerts

References


Created by: Ahmad Raza - DevOps Engineer | Cloud Infra Specialist
Last Updated: May 24, 2025


💬 Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds • Uses Perplexity AI