0-Home
Github
TraceMyPodsOfficial
TMP-docs
Cloudwatchgpu

Enable GPU-level metrics

on EKS using CloudWatch Container Insights & visualize them in Grafana, without the complexity of operator/runtime troubleshooting.


πŸ“¦ Step 1: Enable Container Insights with GPU Support

AWS now supports GPU observability natively through the Container Insights Enhanced Observability add-on.

βœ… Automatic Setup (recommended):

  1. Ensure your EKS cluster has OIDC enabled:

    eksctl utils associate-iam-oidc-provider --cluster my-cluster --approve
  2. Attach the necessary IAM access:

    aws iam attach-role-policy \
      --role-name <EKSNodeInstanceRole> \
      --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
  3. Enable the Container Insights add-on with GPU support:

    aws eks create-addon \
      --cluster-name my-cluster \
      --addon-name amazon-cloudwatch-observability \
      --service-account-role-arn <CloudWatchAgentServiceAccountRoleArn> \
      --resolve-conflicts OVERWRITE

This will deploy the CloudWatch Agent, DCGM exporter, and log agents across your EKS nodes automatically


🧩 Step 2: Verify GPU Metrics in CloudWatch

After a few minutes, go to the CloudWatch console β†’ Container Insights β†’ EKS. You’ll see multiple built-in dashboards. Look for GPU-specific panels showing:

  • GPU Utilization
  • GPU Memory Usage
  • GPU Temperature
  • GPU Power Consumption

These metrics are now being collected without needing nvidia-device-plugin or helium heaps of setup


πŸ“ˆ Step 3: Visualize in Grafana

You can use Amazon Managed Grafana or your own self-managed Grafana.

If using Managed Grafana:

  1. Create or use an existing Grafana workspace.

  2. Add CloudWatch as a data source (supports Container Insights).

  3. Import GPU Dashboard:

    • Browse dashboards in Grafana.com, or import a custom one.
    • Alternatively, build your own panel using metrics like: ContainerInsights β†’ node_gpu_usage_total across ClusterName, NodeName ([aws.amazon.com][5], [grafana.com][6], [docs.aws.amazon.com][7]).

If using self-hosted Grafana:

  1. Configure CloudWatch plugin (from Grafana > 6.5+).
  2. Follow the same import or build dashboard steps.

🧭 Dashboard & Metrics to Use

Metrics collected under ContainerInsights:

  • node_gpu_limit, node_gpu_usage_total, node_gpu_reserved_capacity β€” for node-level GPU capacity and usage.
  • pod_gpu_usage_total, pod_gpu_request β€” for pod-level GPU consumption metrics

Use these to build graphs or alerts in Grafana.


🎯 Why This Approach Works

  • Fully automated: no manual device plugin, driver, or containerd tweaks
  • Managed and supported by AWS
  • Integrates seamlessly with Grafana
  • Includes logs + metrics for end-to-end observability




Terraform-based solution

to set up GPU-level monitoring on your existing EKS cluster using CloudWatch Container Insights and visualize it in Grafana.


πŸš€ Step 1: Enable Container Insights with GPU support

Add this to your Terraform (assuming you already have an EKS cluster managed via Terraform):

data "aws_eks_cluster" "this" {
  name = var.eks_cluster_name
}
 
resource "aws_eks_addon" "cw_observability" {
  cluster_name             = data.aws_eks_cluster.this.name
  addon_name               = "amazon-cloudwatch-observability"
  addon_version            = "v2.1.2-eksbuild.1"  # or latest
  resolve_conflicts_on_update = "OVERWRITE"
  service_account_role_arn = aws_iam_role.cloudwatch_sa.arn
 
  configuration_values = jsonencode({
    agent = {
      config = {
        logs = {
          metrics_collected = {
            kubernetes = {
              enhanced_container_insights      = true
              accelerated_compute_metrics       = true
            }
          }
        }
      }
    }
  })
}

And define the cloudwatch_sa role:

resource "aws_iam_role" "cloudwatch_sa" {
  name = "${var.eks_cluster_name}-cw-agent"
 
  assume_role_policy = data.aws_iam_policy_document.eks_assume_sa.json
}
 
resource "aws_iam_role_policy_attachment" "cw_agent_attachment" {
  role       = aws_iam_role.cloudwatch_sa.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}

This deploys the CloudWatch Agent + DCGM exporter, enabling GPU metrics collection to CloudWatch ([docs.aws.amazon.com][1], [stackoverflow.com][2], [dev.to][3]).


πŸ›  Step 2: Grant the IAM Role to the Service Account

data "aws_iam_policy_document" "eks_assume_sa" {
  statement {
    effect = "Allow"
    principals {
      type        = "Federated"
      identifiers = [data.aws_iam_openid_connect_provider.oidc.arn]
    }
    actions = ["sts:AssumeRoleWithWebIdentity"]
    condition {
      test     = "StringEquals"
      variable = "${data.aws_iam_openid_connect_provider.oidc.url}:sub"
      values   = ["system:serviceaccount:amazon-cloudwatch:cloudwatch-agent"]
    }
  }
}
 
data "aws_iam_openid_connect_provider" "oidc" {
  url = data.aws_eks_cluster.this.identity[0].oidc[0].issuer
}

πŸ” Step 3: Validate in CloudWatch

After terraform apply, wait a few minutes then check:

  • CloudWatch β†’ Container Insights β†’ EKS
  • You should see panels for GPU usage, memory, temperature, and power ([docs.aws.amazon.com][4], [blog.devops.dev][5]).

πŸ“Š Step 4: Visualize GPU Metrics in Grafana

Option A: Amazon Managed Grafana

module "mgw" {
  source  = "terraform-aws-modules/grafana/aws"
  version = "x.y.z"
 
  workspace_name = "eks-gpu-monitoring"
  authentication = {
    providers = ["AWS_SSO"]
  }
 
  permissions = [{
    principal = aws_iam_role.cloudwatch_sa.arn
    role      = "ADMIN"
  }]
}

Then in Grafana UI:

  1. Add CloudWatch data source (select Container Insights).

  2. Upload/import a dashboard watching:

    • ContainerInsights/node_gpu_usage_total
    • ContainerInsights/node_gpu_limit
    • ContainerInsights/pod_gpu_usage_total You can also import a premade GPU dashboard or build from scratch using these metrics ([aws-observability.github.io][6]).

βœ… Summary of the Full Flow

StageToolOutcome
Terraform SetupAWS EKS Add-onDeploys CloudWatch Agent + DCGM exporter across GPU nodes
CloudWatchContainer InsightsGPU metrics available (GPU Utilization, Temp, Memory, Power)
Grafana VisualizationAWS/Managed or Open-sourceVisualize GPU metrics using CloudWatch data source

πŸ”§ What You Should Do Next

  1. Copy the Terraform snippets above into your existing config.
  2. Run terraform init && terraform apply.
  3. Wait ~5 minutes for the add-on to deploy.
  4. Confirm GPU metrics exist in CloudWatch.
  5. Set up Grafana with CloudWatch integration and visualize your GPU dashboards.

πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI