Monitoring your RabbitMQ cluster is crucial for maintaining optimal performance and ensuring that your messaging infrastructure can manage workloads without downtime. Prometheus is a powerful tool that can be seamlessly integrated with RabbitMQ to gather metrics and enable alerting based on those metrics. Below, we outline a set of useful alert rules that will help you proactively manage your RabbitMQ cluster.
1. Queue Length Alerts
Queues are at the core of RabbitMQ’s messaging system. Long queues can indicate that consumers are not processing messages at the expected rate.
groups:
- name: rabbitmq-alerts
rules:
- alert: HighQueueLength
expr: rabbitmq_queue_messages{job="rabbitmq"} > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High queue length detected in RabbitMQ"
description: "Queue {{ $labels.queue }} has more than 1000 messages. Consider investigating consumer performance."
2. Consumer Utilization Alerts
It’s essential to monitor if consumers are under or over-utilized. Too few active consumers may indicate a slowdown in processing.
- alert: LowConsumerCount
expr: count(rabbitmq_queue_consumers{job="rabbitmq"}) < 2
for: 10m
labels:
severity: critical
annotations:
summary: "Low number of RabbitMQ consumers"
description: "Only {{ $value }} consumers active. Verify if consumers are functioning properly."
3. Connection Rate Alerts
High rates of connection attempts can indicate application-level issues or possible attacks. Tracking the connection rate can help you react quickly.
- alert: HighConnectionRate
expr: rate(rabbitmq_connections{job="rabbitmq"}[5m]) > 20
for: 1m
labels:
severity: warning
annotations:
summary: "High connection rate to RabbitMQ"
description: "Connection rate exceeds 20 per minute. Investigate potential issues or attacks."
4. Memory Usage Alerts
Memory usage is one of the critical metrics you should monitor. RabbitMQ might stop accepting messages if it runs low on memory.
- alert: HighMemoryUsage
expr: rabbitmq_memory{job="rabbitmq"} / rabbitmq_memory_limit{job="rabbitmq"} > 0.85
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage in RabbitMQ"
description: "Memory usage is over 85%. You may want to consider adding resources or checking for memory leaks."
5. Disk Space Alerts
RabbitMQ requires sufficient disk space to function correctly. Monitoring disk space is crucial to ensuring message durability.
- alert: LowDiskSpace
expr: rabbitmq_disk_free{job="rabbitmq"} < 10737418240 # 10 GB
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on RabbitMQ"
description: "Disk space is below 10 GB. Ensure that sufficient disk space is available."
6. TCP Connection Alerts
Monitoring the number of TCP connections is essential to ensure you aren’t exceeding the limits set for the RabbitMQ server.
- alert: TooManyTcpConnections
expr: rabbitmq_connections{job="rabbitmq"} > 500
for: 5m
labels:
severity: warning
annotations:
summary: "High TCP connection count to RabbitMQ"
description: "TCP connections exceed 500. Check application workloads and connections."
Conclusion
Implementing these Prometheus alert rules will help you keep a proactive eye on your RabbitMQ cluster, enabling you to respond to potential issues before they escalate into serious problems. Monitoring system health and having efficient alerting mechanisms can significantly improve your overall DevOps practice and application reliability.
Make sure to test these alert rules and fine-tune the thresholds according to your RabbitMQ usage patterns. Additionally, you can extend these alerting rules based on specific metrics related to your applications or RabbitMQ configurations.