VictoriaMetrics/deployment/docker/alerts-health.yml
Roman Khavronenko deb2f87074
deployment: add panel and alerts for displying go scheduler latency (#7078)
The panel and alerting rule should help to understand whether VM
component doesn't have enough CPU resources or gets throttled. The alert
is applicable for all VM components.
The panel was added to vmalert, vmagent, vmsingle, vm clusert and
victorialogs dashes.

-------------------

This alerting rule should have help us identify resource shortage for
sandbox vmagent - see [this
link](https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/graph/#/?g0.range_input=23d13h25m25s424ms&g0.end_input=2024-09-23T14%3A11%3A00&g0.relative_time=none&g0.tab=0&g0.expr=histogram_quantile%280.99%2C+sum%28rate%28go_sched_latencies_seconds_bucket%7Bjob%3D%22vmagent-monitoring-vmagent%22%7D%5B5m%5D%29%29+by+%28le%2C+job%2C+instance%29%29+%3E+0.1)
for example. We weren't aware of resource shortage, because VM metrics
assumed this vmagent had 1vCPU while in fact its limit was 0.2vCPU.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 4d0b41e63b)
2024-09-24 16:58:14 +02:00

142 lines
7.3 KiB
YAML

# File contains default list of alerts for various VM components.
# The following alerts are recommended for use for any VM installation.
# The alerts below are just recommendations and may require some updates
# and threshold calibration according to every specific setup.
groups:
- name: vm-health
# note the `job` filter and update accordingly to your setup
rules:
- alert: TooManyRestarts
expr: changes(process_start_time_seconds{job=~".*(victoriametrics|vmselect|vminsert|vmstorage|vmagent|vmalert|vmsingle|vmalertmanager|vmauth).*"}[15m]) > 2
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} too many restarts (instance {{ $labels.instance }})"
description: >
Job {{ $labels.job }} (instance {{ $labels.instance }}) has restarted more than twice in the last 15 minutes.
It might be crashlooping.
- alert: ServiceDown
expr: up{job=~".*(victoriametrics|vmselect|vminsert|vmstorage|vmagent|vmalert|vmsingle|vmalertmanager|vmauth).*"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down on {{ $labels.instance }}"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."
- alert: ProcessNearFDLimits
expr: (process_max_fds - process_open_fds) < 100
for: 5m
labels:
severity: critical
annotations:
summary: "Number of free file descriptors is less than 100 for \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") for the last 5m"
description: |
Exhausting OS file descriptors limit can cause severe degradation of the process.
Consider to increase the limit as fast as possible.
- alert: TooHighMemoryUsage
expr: (min_over_time(process_resident_memory_anon_bytes[10m]) / vm_available_memory_bytes) > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "It is more than 80% of memory used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\")"
description: |
Too high memory usage may result into multiple issues such as OOMs or degraded performance.
Consider to either increase available memory or decrease the load on the process.
- alert: TooHighCPUUsage
expr: rate(process_cpu_seconds_total[5m]) / process_cpu_cores_available > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "More than 90% of CPU is used by \"{{ $labels.job }}\"(\"{{ $labels.instance }}\") during the last 5m"
description: >
Too high CPU usage may be a sign of insufficient resources and make process unstable.
Consider to either increase available CPU resources or decrease the load on the process.
- alert: TooHighGoroutineSchedulingLatency
expr: histogram_quantile(0.99, sum(rate(go_sched_latencies_seconds_bucket{job="vmagent-monitoring-vmagent"}[5m])) by (le, job, instance)) > 0.1
for: 15m
labels:
severity: critical
annotations:
summary: "\"{{ $labels.job }}\"(\"{{ $labels.instance }}\") has insufficient CPU resources for >15m"
description: >
Go runtime is unable to schedule goroutines execution in acceptable time. This is usually a sign of
insufficient CPU resources or CPU throttling. Verify that service has enough CPU resources. Otherwise,
the service could work unreliably with delays in processing.
- alert: TooManyLogs
expr: sum(increase(vm_log_messages_total{level="error"}[5m])) without (app_version, location) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Too many logs printed for job \"{{ $labels.job }}\" ({{ $labels.instance }})"
description: >
Logging rate for job \"{{ $labels.job }}\" ({{ $labels.instance }}) is {{ $value }} for last 15m.
Worth to check logs for specific error messages.
- alert: TooManyTSIDMisses
expr: rate(vm_missing_tsids_for_metric_id_total[5m]) > 0
for: 10m
labels:
severity: critical
annotations:
summary: "Too many TSID misses for job \"{{ $labels.job }}\" ({{ $labels.instance }})"
description: |
The rate of TSID misses during query lookups is too high for \"{{ $labels.job }}\" ({{ $labels.instance }}).
Make sure you're running VictoriaMetrics of v1.85.3 or higher.
Related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502
- alert: ConcurrentInsertsHitTheLimit
expr: avg_over_time(vm_concurrent_insert_current[1m]) >= vm_concurrent_insert_capacity
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.job }} on instance {{ $labels.instance }} is constantly hitting concurrent inserts limit"
description: |
The limit of concurrent inserts on instance {{ $labels.instance }} depends on the number of CPUs.
Usually, when component constantly hits the limit it is likely the component is overloaded and requires more CPU.
In some cases for components like vmagent or vminsert the alert might trigger if there are too many clients
making write attempts. If vmagent's or vminsert's CPU usage and network saturation are at normal level, then
it might be worth adjusting `-maxConcurrentInserts` cmd-line flag.
- alert: IndexDBRecordsDrop
expr: increase(vm_indexdb_items_dropped_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "IndexDB skipped registering items during data ingestion with reason={{ $labels.reason }}."
description: |
VictoriaMetrics could skip registering new timeseries during ingestion if they fail the validation process.
For example, `reason=too_long_item` means that time series cannot exceed 64KB. Please, reduce the number
of labels or label values for such series. Or enforce these limits via `-maxLabelsPerTimeseries` and
`-maxLabelValueLen` command-line flags.
- alert: TooLongLabelValues
expr: increase(vm_too_long_label_values_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "VictoriaMetrics truncates too long label values"
description: |
The maximum length of a label value is limited via `-maxLabelValueLen` cmd-line flag.
Longer label values are truncated and may result into time series overlapping.
Please, check your logs to find which labels were truncated and
either reduce the size of label values or increase `-maxLabelValueLen`.
- alert: TooLongLabelNames
expr: increase(vm_too_long_label_names_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "VictoriaMetrics truncates too long label names"
description: >
The maximum length of a label name is limited by 256 bytes.
Longer label names are truncated and may result into time series overlapping.