Adding labels to your alert configuration helps provide context and metadata that can be used for better alert routing, grouping, and filtering. Here are some useful labels you can consider adding to your Prometheus alert rules:
-
severity: Indicates the importance of the alert, such ascritical,warning,info, etc. -
alert_type: Specifies the type of alert, likeresource_usage,performance,availability,security, etc. -
resource: Indicates the resource involved, such ascpu,memory,disk,network, etc. -
job: Denotes the job or service name from which the alert originates (inherited from Prometheus metrics, such asnode_exporter,api_server). -
instance: Represents the specific instance (server) where the alert is firing, likeserver1.example.com. -
environment: Specifies the environment in which the alert occurs, such asproduction,staging,development. -
region: Indicates the geographical region, such asus-east-1,eu-west-1, etc., useful for multi-region monitoring. -
datacenter: Specifies the data center where the instance is hosted, e.g.,dc1,dc2,dc3. -
service: Denotes the specific service or application generating the alert, likedatabase,webserver,cache. -
team: Specifies the responsible team, such asops,dev,networking, allowing alerts to be routed to specific teams. -
priority: Helps indicate urgency, using levels such asP1,P2,P3. -
sla: Indicates the Service Level Agreement (SLA) associated with the alert, which can help prioritize based on SLA requirements. -
owner: Specifies the individual or team responsible for the resource, useful for routing alerts to the correct owner. -
cluster: Helps in multi-cluster environments to identify which cluster the alert is related to. -
component: Represents a specific component of an application or infrastructure, such asfrontend,backend,database. -
impact: Describes the impact level of the alert, e.g.,high,medium,low, useful for prioritizing alerts. -
customer: Indicates a specific customer or tenant, useful in multi-tenant environments for filtering and alert management. -
function: Represents the specific function within the service, such asauthentication,logging,api_gateway. -
mode: Describes the operational mode of the alert, e.g.,primary,replica,standby, useful in clustered or HA systems. -
cause: Indicates a known cause for the alert if identifiable, e.g.,network_issue,hardware_failure,high_traffic, to provide context.
Example Usage with New Labels
Here’s how you could incorporate some of these labels in an alert rule:
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 2m
labels:
severity: warning
alert_type: resource_usage
resource: memory
environment: production
region: us-east-1
team: ops
priority: P2
annotations:
summary: "High Memory Usage Detected on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 2 minutes on instance {{ $labels.instance }} in the {{ $labels.environment }} environment (job: {{ $labels.job }})."Adding such labels will provide better context in the alert details, making it easier to route, filter, and analyze alerts.