Enable GPU-level metrics
on EKS using CloudWatch Container Insights & visualize them in Grafana, without the complexity of operator/runtime troubleshooting.
π¦ Step 1: Enable Container Insights with GPU Support
AWS now supports GPU observability natively through the Container Insights Enhanced Observability add-on.
β Automatic Setup (recommended):
-
Ensure your EKS cluster has OIDC enabled:
eksctl utils associate-iam-oidc-provider --cluster my-cluster --approve -
Attach the necessary IAM access:
aws iam attach-role-policy \ --role-name <EKSNodeInstanceRole> \ --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy -
Enable the Container Insights add-on with GPU support:
aws eks create-addon \ --cluster-name my-cluster \ --addon-name amazon-cloudwatch-observability \ --service-account-role-arn <CloudWatchAgentServiceAccountRoleArn> \ --resolve-conflicts OVERWRITE
This will deploy the CloudWatch Agent, DCGM exporter, and log agents across your EKS nodes automatically
π§© Step 2: Verify GPU Metrics in CloudWatch
After a few minutes, go to the CloudWatch console β Container Insights β EKS. Youβll see multiple built-in dashboards. Look for GPU-specific panels showing:
- GPU Utilization
- GPU Memory Usage
- GPU Temperature
- GPU Power Consumption
These metrics are now being collected without needing nvidia-device-plugin or helium heaps of setup
π Step 3: Visualize in Grafana
You can use Amazon Managed Grafana or your own self-managed Grafana.
If using Managed Grafana:
-
Create or use an existing Grafana workspace.
-
Add CloudWatch as a data source (supports Container Insights).
-
Import GPU Dashboard:
- Browse dashboards in Grafana.com, or import a custom one.
- Alternatively, build your own panel using metrics like:
ContainerInsights β node_gpu_usage_totalacrossClusterName,NodeName([aws.amazon.com][5], [grafana.com][6], [docs.aws.amazon.com][7]).
If using self-hosted Grafana:
- Configure CloudWatch plugin (from Grafana > 6.5+).
- Follow the same import or build dashboard steps.
π§ Dashboard & Metrics to Use
Metrics collected under ContainerInsights:
- node_gpu_limit, node_gpu_usage_total, node_gpu_reserved_capacity β for node-level GPU capacity and usage.
- pod_gpu_usage_total, pod_gpu_request β for pod-level GPU consumption metrics
Use these to build graphs or alerts in Grafana.
π― Why This Approach Works
- Fully automated: no manual device plugin, driver, or containerd tweaks
- Managed and supported by AWS
- Integrates seamlessly with Grafana
- Includes logs + metrics for end-to-end observability
Terraform-based solution
to set up GPU-level monitoring on your existing EKS cluster using CloudWatch Container Insights and visualize it in Grafana.
π Step 1: Enable Container Insights with GPU support
Add this to your Terraform (assuming you already have an EKS cluster managed via Terraform):
data "aws_eks_cluster" "this" {
name = var.eks_cluster_name
}
resource "aws_eks_addon" "cw_observability" {
cluster_name = data.aws_eks_cluster.this.name
addon_name = "amazon-cloudwatch-observability"
addon_version = "v2.1.2-eksbuild.1" # or latest
resolve_conflicts_on_update = "OVERWRITE"
service_account_role_arn = aws_iam_role.cloudwatch_sa.arn
configuration_values = jsonencode({
agent = {
config = {
logs = {
metrics_collected = {
kubernetes = {
enhanced_container_insights = true
accelerated_compute_metrics = true
}
}
}
}
}
})
}And define the cloudwatch_sa role:
resource "aws_iam_role" "cloudwatch_sa" {
name = "${var.eks_cluster_name}-cw-agent"
assume_role_policy = data.aws_iam_policy_document.eks_assume_sa.json
}
resource "aws_iam_role_policy_attachment" "cw_agent_attachment" {
role = aws_iam_role.cloudwatch_sa.name
policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}This deploys the CloudWatch Agent + DCGM exporter, enabling GPU metrics collection to CloudWatch ([docs.aws.amazon.com][1], [stackoverflow.com][2], [dev.to][3]).
π Step 2: Grant the IAM Role to the Service Account
data "aws_iam_policy_document" "eks_assume_sa" {
statement {
effect = "Allow"
principals {
type = "Federated"
identifiers = [data.aws_iam_openid_connect_provider.oidc.arn]
}
actions = ["sts:AssumeRoleWithWebIdentity"]
condition {
test = "StringEquals"
variable = "${data.aws_iam_openid_connect_provider.oidc.url}:sub"
values = ["system:serviceaccount:amazon-cloudwatch:cloudwatch-agent"]
}
}
}
data "aws_iam_openid_connect_provider" "oidc" {
url = data.aws_eks_cluster.this.identity[0].oidc[0].issuer
}π Step 3: Validate in CloudWatch
After terraform apply, wait a few minutes then check:
- CloudWatch β Container Insights β EKS
- You should see panels for GPU usage, memory, temperature, and power ([docs.aws.amazon.com][4], [blog.devops.dev][5]).
π Step 4: Visualize GPU Metrics in Grafana
Option A: Amazon Managed Grafana
module "mgw" {
source = "terraform-aws-modules/grafana/aws"
version = "x.y.z"
workspace_name = "eks-gpu-monitoring"
authentication = {
providers = ["AWS_SSO"]
}
permissions = [{
principal = aws_iam_role.cloudwatch_sa.arn
role = "ADMIN"
}]
}Then in Grafana UI:
-
Add CloudWatch data source (select Container Insights).
-
Upload/import a dashboard watching:
ContainerInsights/node_gpu_usage_totalContainerInsights/node_gpu_limitContainerInsights/pod_gpu_usage_totalYou can also import a premade GPU dashboard or build from scratch using these metrics ([aws-observability.github.io][6]).
β Summary of the Full Flow
| Stage | Tool | Outcome |
|---|---|---|
| Terraform Setup | AWS EKS Add-on | Deploys CloudWatch Agent + DCGM exporter across GPU nodes |
| CloudWatch | Container Insights | GPU metrics available (GPU Utilization, Temp, Memory, Power) |
| Grafana Visualization | AWS/Managed or Open-source | Visualize GPU metrics using CloudWatch data source |
π§ What You Should Do Next
- Copy the Terraform snippets above into your existing config.
- Run
terraform init && terraform apply. - Wait ~5 minutes for the add-on to deploy.
- Confirm GPU metrics exist in CloudWatch.
- Set up Grafana with CloudWatch integration and visualize your GPU dashboards.