Monitor GPU usage (NVIDIA GPUs) and GPU-related metrics in Grafana
โ Overview of the Stack
| Component | Purpose |
|---|---|
| NVIDIA DCGM Exporter | Exposes GPU metrics from nodes (via DaemonSet) |
| Prometheus | Scrapes GPU metrics from DCGM Exporter |
| Grafana | Visualizes metrics using dashboards |
๐ Step-by-Step Setup
โ 1. Install NVIDIA DCGM Exporter
The DCGM Exporter (Data Center GPU Manager) exposes GPU metrics in Prometheus format.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/deployments/k8s/dcgm-exporter.yamlVerify:
kubectl get pods -n gpu-operator
kubectl port-forward svc/dcgm-exporter 9400:9400 -n gpu-operator
curl http://localhost:9400/metricsYou should see Prometheus-format metrics like
DCGM_FI_DEV_GPU_UTIL.
โ 2. Ensure Prometheus is Installed and Scraping
If using Prometheus Operator (from Helm):
Add scrape config to ServiceMonitor
If using kube-prometheus-stack, this is automatic with correct labels.
To scrape manually, add this ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
namespaceSelector:
matchNames:
- gpu-operator
endpoints:
- port: "metrics"
interval: 30sโ 3. Install Grafana Dashboard
Use NVIDIA's official GPU dashboards:
- Open Grafana
- Go to Dashboards > Import
- Use one of the following dashboard IDs:
| Name | Grafana.com ID |
|---|---|
| NVIDIA DCGM Exporter GPU Dashboard | 12239 |
| Kubernetes GPU Monitoring | 15176 |
You can also customize and clone these dashboards.
โ 4. Verify GPU Metrics in Prometheus
Run in Prometheus UI or via Grafana Explore:
DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_FB_USEDThese will show % utilization, memory used, etc.
๐ Optional: Node Exporter GPU Plugin (Advanced)
If you want host-level detail alongside GPU, you can also use node_exporter with custom GPU scripts, but dcgm-exporter is the preferred method for Kubernetes.
โ Summary
| Tool | Role |
|---|---|
| DCGM Exporter | Exposes GPU metrics |
| Prometheus | Scrapes and stores GPU metrics |
| Grafana | Visualizes with dashboards (12239, 15176) |
Would you like a Helm chart-based setup for Prometheus + Grafana + DCGM on EKS?