Enable GPU-level metrics

Enable GPU-level metrics

on EKS using CloudWatch Container Insights & visualize them in Grafana, without the complexity of operator/runtime troubleshooting.

📦 Step 1: Enable Container Insights with GPU Support

AWS now supports GPU observability natively through the Container Insights Enhanced Observability add-on.

✅ Automatic Setup (recommended):

Ensure your EKS cluster has OIDC enabled:

eksctl utils associate-iam-oidc-provider --cluster my-cluster --approve

Attach the necessary IAM access:

aws iam attach-role-policy \
  --role-name <EKSNodeInstanceRole> \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

Enable the Container Insights add-on with GPU support:

aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability \
  --service-account-role-arn <CloudWatchAgentServiceAccountRoleArn> \
  --resolve-conflicts OVERWRITE

This will deploy the CloudWatch Agent, DCGM exporter, and log agents across your EKS nodes automatically

🧩 Step 2: Verify GPU Metrics in CloudWatch

After a few minutes, go to the CloudWatch console → Container Insights → EKS. You’ll see multiple built-in dashboards. Look for GPU-specific panels showing:

GPU Utilization
GPU Memory Usage
GPU Temperature
GPU Power Consumption

These metrics are now being collected without needing nvidia-device-plugin or helium heaps of setup

📈 Step 3: Visualize in Grafana

You can use Amazon Managed Grafana or your own self-managed Grafana.

If using Managed Grafana:

Create or use an existing Grafana workspace.
Add CloudWatch as a data source (supports Container Insights).
Import GPU Dashboard:
- Browse dashboards in Grafana.com, or import a custom one.
- Alternatively, build your own panel using metrics like: ContainerInsights → node_gpu_usage_total across ClusterName, NodeName ([aws.amazon.com][5], [grafana.com][6], [docs.aws.amazon.com][7]).

If using self-hosted Grafana:

Configure CloudWatch plugin (from Grafana > 6.5+).
Follow the same import or build dashboard steps.

🧭 Dashboard & Metrics to Use

Metrics collected under ContainerInsights:

node_gpu_limit, node_gpu_usage_total, node_gpu_reserved_capacity — for node-level GPU capacity and usage.
pod_gpu_usage_total, pod_gpu_request — for pod-level GPU consumption metrics

Use these to build graphs or alerts in Grafana.

🎯 Why This Approach Works

Fully automated: no manual device plugin, driver, or containerd tweaks
Managed and supported by AWS
Integrates seamlessly with Grafana
Includes logs + metrics for end-to-end observability

Terraform-based solution

to set up GPU-level monitoring on your existing EKS cluster using CloudWatch Container Insights and visualize it in Grafana.

🚀 Step 1: Enable Container Insights with GPU support

Add this to your Terraform (assuming you already have an EKS cluster managed via Terraform):

data "aws_eks_cluster" "this" {
  name = var.eks_cluster_name
}
 
resource "aws_eks_addon" "cw_observability" {
  cluster_name             = data.aws_eks_cluster.this.name
  addon_name               = "amazon-cloudwatch-observability"
  addon_version            = "v2.1.2-eksbuild.1"  # or latest
  resolve_conflicts_on_update = "OVERWRITE"
  service_account_role_arn = aws_iam_role.cloudwatch_sa.arn
 
  configuration_values = jsonencode({
    agent = {
      config = {
        logs = {
          metrics_collected = {
            kubernetes = {
              enhanced_container_insights      = true
              accelerated_compute_metrics       = true
            }
          }
        }
      }
    }
  })
}

And define the cloudwatch_sa role:

resource "aws_iam_role" "cloudwatch_sa" {
  name = "${var.eks_cluster_name}-cw-agent"
 
  assume_role_policy = data.aws_iam_policy_document.eks_assume_sa.json
}
 
resource "aws_iam_role_policy_attachment" "cw_agent_attachment" {
  role       = aws_iam_role.cloudwatch_sa.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}

This deploys the CloudWatch Agent + DCGM exporter, enabling GPU metrics collection to CloudWatch ([docs.aws.amazon.com][1], [stackoverflow.com][2], [dev.to][3]).

🛠 Step 2: Grant the IAM Role to the Service Account

data "aws_iam_policy_document" "eks_assume_sa" {
  statement {
    effect = "Allow"
    principals {
      type        = "Federated"
      identifiers = [data.aws_iam_openid_connect_provider.oidc.arn]
    }
    actions = ["sts:AssumeRoleWithWebIdentity"]
    condition {
      test     = "StringEquals"
      variable = "${data.aws_iam_openid_connect_provider.oidc.url}:sub"
      values   = ["system:serviceaccount:amazon-cloudwatch:cloudwatch-agent"]
    }
  }
}
 
data "aws_iam_openid_connect_provider" "oidc" {
  url = data.aws_eks_cluster.this.identity[0].oidc[0].issuer
}

🔍 Step 3: Validate in CloudWatch

After terraform apply, wait a few minutes then check:

CloudWatch → Container Insights → EKS
You should see panels for GPU usage, memory, temperature, and power ([docs.aws.amazon.com][4], [blog.devops.dev][5]).

📊 Step 4: Visualize GPU Metrics in Grafana

Option A: Amazon Managed Grafana

module "mgw" {
  source  = "terraform-aws-modules/grafana/aws"
  version = "x.y.z"
 
  workspace_name = "eks-gpu-monitoring"
  authentication = {
    providers = ["AWS_SSO"]
  }
 
  permissions = [{
    principal = aws_iam_role.cloudwatch_sa.arn
    role      = "ADMIN"
  }]
}

Then in Grafana UI:

Add CloudWatch data source (select Container Insights).
Upload/import a dashboard watching:
- ContainerInsights/node_gpu_usage_total
- ContainerInsights/node_gpu_limit
- ContainerInsights/pod_gpu_usage_total You can also import a premade GPU dashboard or build from scratch using these metrics ([aws-observability.github.io][6]).

✅ Summary of the Full Flow

Stage	Tool	Outcome
Terraform Setup	AWS EKS Add-on	Deploys CloudWatch Agent + DCGM exporter across GPU nodes
CloudWatch	Container Insights	GPU metrics available (GPU Utilization, Temp, Memory, Power)
Grafana Visualization	AWS/Managed or Open-source	Visualize GPU metrics using CloudWatch data source

🔧 What You Should Do Next

Copy the Terraform snippets above into your existing config.
Run terraform init && terraform apply.
Wait ~5 minutes for the add-on to deploy.
Confirm GPU metrics exist in CloudWatch.
Set up Grafana with CloudWatch integration and visualize your GPU dashboards.

Cis Benchmark Falco