0-Home
Github
TraceMyPodsOfficial
TMP-docs
EKS-Deploy-README
Gpu Dcgm Exporter

Monitor GPU usage (NVIDIA GPUs) and GPU-related metrics in Grafana


โœ… Overview of the Stack

ComponentPurpose
NVIDIA DCGM ExporterExposes GPU metrics from nodes (via DaemonSet)
PrometheusScrapes GPU metrics from DCGM Exporter
GrafanaVisualizes metrics using dashboards

๐Ÿ›  Step-by-Step Setup

โœ… 1. Install NVIDIA DCGM Exporter

The DCGM Exporter (Data Center GPU Manager) exposes GPU metrics in Prometheus format.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/deployments/k8s/dcgm-exporter.yaml

Verify:

kubectl get pods -n gpu-operator
kubectl port-forward svc/dcgm-exporter 9400:9400 -n gpu-operator
curl http://localhost:9400/metrics

You should see Prometheus-format metrics like DCGM_FI_DEV_GPU_UTIL.


โœ… 2. Ensure Prometheus is Installed and Scraping

If using Prometheus Operator (from Helm):

Add scrape config to ServiceMonitor

If using kube-prometheus-stack, this is automatic with correct labels.

To scrape manually, add this ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
  namespaceSelector:
    matchNames:
      - gpu-operator
  endpoints:
    - port: "metrics"
      interval: 30s

โœ… 3. Install Grafana Dashboard

Use NVIDIA's official GPU dashboards:

  • Open Grafana
  • Go to Dashboards > Import
  • Use one of the following dashboard IDs:
NameGrafana.com ID
NVIDIA DCGM Exporter GPU Dashboard12239
Kubernetes GPU Monitoring15176

You can also customize and clone these dashboards.


โœ… 4. Verify GPU Metrics in Prometheus

Run in Prometheus UI or via Grafana Explore:

DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_FB_USED

These will show % utilization, memory used, etc.


๐Ÿ”’ Optional: Node Exporter GPU Plugin (Advanced)

If you want host-level detail alongside GPU, you can also use node_exporter with custom GPU scripts, but dcgm-exporter is the preferred method for Kubernetes.


โœ… Summary

ToolRole
DCGM ExporterExposes GPU metrics
PrometheusScrapes and stores GPU metrics
GrafanaVisualizes with dashboards (12239, 15176)

Would you like a Helm chart-based setup for Prometheus + Grafana + DCGM on EKS?


๐Ÿ’ฌ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. ๐Ÿ˜Š
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds โ€ข Uses Perplexity AI