Labels – ahmadrazalab

Adding labels to your alert configuration helps provide context and metadata that can be used for better alert routing, grouping, and filtering. Here are some useful labels you can consider adding to your Prometheus alert rules:

severity: Indicates the importance of the alert, such as critical, warning, info, etc.
alert_type: Specifies the type of alert, like resource_usage, performance, availability, security, etc.
resource: Indicates the resource involved, such as cpu, memory, disk, network, etc.
job: Denotes the job or service name from which the alert originates (inherited from Prometheus metrics, such as node_exporter, api_server).
instance: Represents the specific instance (server) where the alert is firing, like server1.example.com.
environment: Specifies the environment in which the alert occurs, such as production, staging, development.
region: Indicates the geographical region, such as us-east-1, eu-west-1, etc., useful for multi-region monitoring.
datacenter: Specifies the data center where the instance is hosted, e.g., dc1, dc2, dc3.
service: Denotes the specific service or application generating the alert, like database, webserver, cache.
team: Specifies the responsible team, such as ops, dev, networking, allowing alerts to be routed to specific teams.
priority: Helps indicate urgency, using levels such as P1, P2, P3.
sla: Indicates the Service Level Agreement (SLA) associated with the alert, which can help prioritize based on SLA requirements.
owner: Specifies the individual or team responsible for the resource, useful for routing alerts to the correct owner.
cluster: Helps in multi-cluster environments to identify which cluster the alert is related to.
component: Represents a specific component of an application or infrastructure, such as frontend, backend, database.
impact: Describes the impact level of the alert, e.g., high, medium, low, useful for prioritizing alerts.
customer: Indicates a specific customer or tenant, useful in multi-tenant environments for filtering and alert management.
function: Represents the specific function within the service, such as authentication, logging, api_gateway.
mode: Describes the operational mode of the alert, e.g., primary, replica, standby, useful in clustered or HA systems.
cause: Indicates a known cause for the alert if identifiable, e.g., network_issue, hardware_failure, high_traffic, to provide context.

Example Usage with New Labels

Here’s how you could incorporate some of these labels in an alert rule:

- alert: HighMemoryUsage
  expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
  for: 2m
  labels:
    severity: warning
    alert_type: resource_usage
    resource: memory
    environment: production
    region: us-east-1
    team: ops
    priority: P2
  annotations:
    summary: "High Memory Usage Detected on {{ $labels.instance }}"
    description: "Memory usage is above 90% for more than 2 minutes on instance {{ $labels.instance }} in the {{ $labels.environment }} environment (job: {{ $labels.job }})."

Adding such labels will provide better context in the alert details, making it easier to route, filter, and analyze alerts.

Group Byalerts.yaml Logging