VictoriaMetrics/deployment/docker/alerts.yml

# File contains default list of alerts for VictoriaMetrics single server.
# The alerts below are just recommendations and may require some updates
# and threshold calibration according to every specific setup.
groups:
  # Alerts group for VM single assumes that Grafana dashboard
  # https://grafana.com/grafana/dashboards/10229 is installed.
  # Pls update the `dashboard` annotation according to your setup.
  - name: vmsingle
    interval: 30s
    concurrency: 2
    rules:
      - alert: DiskRunsOutOfSpaceIn3Days
        expr: |
          vm_free_disk_space_bytes / ignoring(path)
          (
            rate(vm_rows_added_to_storage_total[1d])
            * scalar(
              sum(vm_data_size_bytes{type!~"indexdb.*"}) /
              sum(vm_rows{type!~"indexdb.*"})
             )
          ) < 3 * 24 * 3600 > 0
        for: 30m
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=73&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} will run out of disk space soon"
          description: "Taking into account current ingestion rate, free disk space will be enough only
            for {{ $value | humanizeDuration }} on instance {{ $labels.instance }}.\n
            Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."

      - alert: DiskRunsOutOfSpace
        expr: |
          sum(vm_data_size_bytes) by(job, instance) /
          (
           sum(vm_free_disk_space_bytes) by(job, instance) +
           sum(vm_data_size_bytes) by(job, instance)
          ) > 0.8
        for: 30m
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=53&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} (job={{ $labels.job }}) will run out of disk space soon"
          description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n
            Having less than 20% of free disk space could cripple merge processes and overall performance.
            Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."

      - alert: RequestErrorsToAPI
        expr: increase(vm_http_request_errors_total[5m]) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=35&var-instance={{ $labels.instance }}"
          summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})"
          description: "Requests to path {{ $labels.path }} are receiving errors.
            Please verify if clients are sending correct requests."

      - alert: RowsRejectedOnIngestion
        expr: rate(vm_rows_ignored_total[5m]) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=58&var-instance={{ $labels.instance }}"
          summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
          description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the
            following reason: \"{{ $labels.reason }}\""

      - alert: TooHighChurnRate
        expr: |
          (
             sum(rate(vm_new_timeseries_created_total[5m])) by(instance)
             /
             sum(rate(vm_rows_inserted_total[5m])) by (instance)
           ) > 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"
          summary: "Churn rate is more than 10% on \"{{ $labels.instance }}\" for the last 15m"
          description: "VM constantly creates new time series on \"{{ $labels.instance }}\".\n
            This effect is known as Churn Rate.\n
            High Churn Rate tightly connected with database performance and may
            result in unexpected OOM's or slow queries."

      - alert: TooHighChurnRate24h
        expr: |
          sum(increase(vm_new_timeseries_created_total[24h])) by(instance)
          >
          (sum(vm_cache_entries{type="storage/hour_metric_ids"}) by(instance) * 3)
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"
          summary: "Too high number of new series on \"{{ $labels.instance }}\" created over last 24h"
          description: "The number of created new time series over last 24h is 3x times higher than
            current number of active series on \"{{ $labels.instance }}\".\n
            This effect is known as Churn Rate.\n
            High Churn Rate tightly connected with database performance and may
            result in unexpected OOM's or slow queries."

      - alert: TooHighSlowInsertsRate
        expr: |
          (
             sum(rate(vm_slow_row_inserts_total[5m])) by(instance)
             /
             sum(rate(vm_rows_inserted_total[5m])) by (instance)
           ) > 0.05
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=68&var-instance={{ $labels.instance }}"
          summary: "Percentage of slow inserts is more than 5% on \"{{ $labels.instance }}\" for the last 15m"
          description: "High rate of slow inserts on \"{{ $labels.instance }}\" may be a sign of resource exhaustion
            for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series.
            See also https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183"

      - alert: LabelsLimitExceededOnIngestion
        expr: increase(vm_metrics_with_dropped_labels_total[5m]) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=74&var-instance={{ $labels.instance }}"
          summary: "Metrics ingested in ({{ $labels.instance }}) are exceeding labels limit"
          description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
           This prevents ingestion of metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
           correctly or that clients which send these metrics aren't misbehaving."
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`# File contains default list of alerts for VictoriaMetrics single server.`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`# The alerts below are just recommendations and may require some updates`
			`# and threshold calibration according to every specific setup.`
[deployment] add vmalert + alertmanager to docker compose (#885) 2020-11-07 16:00:23 +01:00			`groups:`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`# Alerts group for VM single assumes that Grafana dashboard`
			`# https://grafana.com/grafana/dashboards/10229 is installed.`
			# Pls update the `dashboard` annotation according to your setup.
			`- name: vmsingle`
			`interval: 30s`
			`concurrency: 2`
[deployment] add vmalert + alertmanager to docker compose (#885) 2020-11-07 16:00:23 +01:00			`rules:`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`- alert: DiskRunsOutOfSpaceIn3Days`
			`expr: \|`
alerts: sync alert expression for `DiskRunsOutOfSpaceIn3Days` with dashboard (#1436) 2021-07-07 09:31:09 +02:00			`vm_free_disk_space_bytes / ignoring(path)`
			`(`
deployment/dashboards: fix `Storage full ETA` panels (#5747) During background downsampling, rate(vm_deduplicated_samples_total{type="merge"}) could be much bigger than rate(vm_rows_added_to_storage_total) and it could last quite some time, which causes negative values of Storage full ETA and confuses users, see playground. Instead of trying to get more accurate results during downsampling, I think it's ok to ignore vm_deduplicated_samples_total at all, it's more reasonable to see Storage full ETA increase after downsampling. --------- Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com> 2024-02-08 09:43:39 +01:00			`rate(vm_rows_added_to_storage_total[1d])`
alerts: sync alert expression for `DiskRunsOutOfSpaceIn3Days` with dashboard (#1436) 2021-07-07 09:31:09 +02:00			`* scalar(`
{dashboards,alerts}: subtitute `{type="indexdb"}` with `{type=~"indexdb.*"}` inside queries after 8189770c50165b62867327ad388f2c2ef237ab6f Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337 2022-12-06 00:59:52 +01:00			`sum(vm_data_size_bytes{type!~"indexdb.*"}) /`
			`sum(vm_rows{type!~"indexdb.*"})`
alerts: sync alert expression for `DiskRunsOutOfSpaceIn3Days` with dashboard (#1436) 2021-07-07 09:31:09 +02:00			`)`
deployment/docker/alerts.yml: backport a42063909f4bd38b6243a19c664d7add9bf7b637 2022-07-12 18:52:59 +02:00			`) < 3 * 24 * 3600 > 0`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`for: 30m`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=73&var-instance={{ $labels.instance }}"`
			`summary: "Instance {{ $labels.instance }} will run out of disk space soon"`
			`description: "Taking into account current ingestion rate, free disk space will be enough only`
			`for {{ $value \| humanizeDuration }} on instance {{ $labels.instance }}.\n`
			`Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."`

add new alerts (#1195) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services 2021-04-08 17:24:25 +02:00			`- alert: DiskRunsOutOfSpace`
			`expr: \|`
deployment/alerts: add `job` label to `DiskRunsOutOfSpace` alerting rule So it is easier to understand to which installation the triggered instance belongs. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2024-01-16 09:49:39 +01:00			`sum(vm_data_size_bytes) by(job, instance) /`
add new alerts (#1195) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services 2021-04-08 17:24:25 +02:00			`(`
deployment/alerts: add `job` label to `DiskRunsOutOfSpace` alerting rule So it is easier to understand to which installation the triggered instance belongs. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2024-01-16 09:49:39 +01:00			`sum(vm_free_disk_space_bytes) by(job, instance) +`
			`sum(vm_data_size_bytes) by(job, instance)`
add new alerts (#1195) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services 2021-04-08 17:24:25 +02:00			`) > 0.8`
			`for: 30m`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=53&var-instance={{ $labels.instance }}"`
deployment/alerts: add `job` label to `DiskRunsOutOfSpace` alerting rule So it is easier to understand to which installation the triggered instance belongs. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2024-01-16 09:49:39 +01:00			`summary: "Instance {{ $labels.instance }} (job={{ $labels.job }}) will run out of disk space soon"`
add new alerts (#1195) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services 2021-04-08 17:24:25 +02:00			`description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n`
deployment: minor grammatical fixes in alert descriptions (#6199) 2024-04-30 10:24:31 +02:00			`Having less than 20% of free disk space could cripple merge processes and overall performance.`
add new alerts (#1195) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services 2021-04-08 17:24:25 +02:00			`Consider to limit the ingestion rate, decrease retention or scale the disk space if possible."`

docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`- alert: RequestErrorsToAPI`
			`expr: increase(vm_http_request_errors_total[5m]) > 0`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=35&var-instance={{ $labels.instance }}"`
			`summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})"`
			`description: "Requests to path {{ $labels.path }} are receiving errors.`
			`Please verify if clients are sending correct requests."`

			`- alert: RowsRejectedOnIngestion`
alerts: simplify aggregation of alerting rules This is follow-up after https://github.com/VictoriaMetrics/VictoriaMetrics/commit/75196d7234afde97f9be46b36f25a0f2675731f9 It updates some of the alerting rules to remove unnecessary aggregations. It keeps aggregations for expressions which are using multiple time series filters to make sure their label will match. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-12-11 15:17:30 +01:00			`expr: rate(vm_rows_ignored_total[5m]) > 0`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=58&var-instance={{ $labels.instance }}"`
			`summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"`
			`description: "VM is rejecting to ingest rows on \"{{ $labels.instance }}\" due to the`
			`following reason: \"{{ $labels.reason }}\""`

			`- alert: TooHighChurnRate`
			`expr: \|`
			`(`
			`sum(rate(vm_new_timeseries_created_total[5m])) by(instance)`
			`/`
			`sum(rate(vm_rows_inserted_total[5m])) by (instance)`
			`) > 0.1`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"`
			`summary: "Churn rate is more than 10% on \"{{ $labels.instance }}\" for the last 15m"`
			`description: "VM constantly creates new time series on \"{{ $labels.instance }}\".\n`
			`This effect is known as Churn Rate.\n`
			`High Churn Rate tightly connected with database performance and may`
			`result in unexpected OOM's or slow queries."`

deployment: add new alert `TooHighChurnRate24h` (#1154) Alert `TooHighChurnRate24h` suppose to cover cases when churn rate is low but results in multiple times higher number than total number of active series. 2021-03-29 11:38:03 +02:00			`- alert: TooHighChurnRate24h`
			`expr: \|`
			`sum(increase(vm_new_timeseries_created_total[24h])) by(instance)`
			`>`
			`(sum(vm_cache_entries{type="storage/hour_metric_ids"}) by(instance) * 3)`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"`
			`summary: "Too high number of new series on \"{{ $labels.instance }}\" created over last 24h"`
			`description: "The number of created new time series over last 24h is 3x times higher than`
			`current number of active series on \"{{ $labels.instance }}\".\n`
			`This effect is known as Churn Rate.\n`
			`High Churn Rate tightly connected with database performance and may`
			`result in unexpected OOM's or slow queries."`

docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`- alert: TooHighSlowInsertsRate`
			`expr: \|`
			`(`
			`sum(rate(vm_slow_row_inserts_total[5m])) by(instance)`
			`/`
			`sum(rate(vm_rows_inserted_total[5m])) by (instance)`
alerts: lower the threshold for TooHighSlowInsertsRate (#2210) Lowering threshold from 50% to 5% will be more sufficient for discovering un-healthy system state. It also goes in sync with alert definition in cluster branch. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-02-18 12:42:24 +01:00			`) > 0.05`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=68&var-instance={{ $labels.instance }}"`
alerts: lower the threshold for TooHighSlowInsertsRate (#2210) Lowering threshold from 50% to 5% will be more sufficient for discovering un-healthy system state. It also goes in sync with alert definition in cluster branch. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-02-18 12:42:24 +01:00			`summary: "Percentage of slow inserts is more than 5% on \"{{ $labels.instance }}\" for the last 15m"`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00			`description: "High rate of slow inserts on \"{{ $labels.instance }}\" may be a sign of resource exhaustion`
docs/Troubleshooting.md: document an additional case, which could result in slow inserts If `-cacheExpireDuration` is lower than the interval between ingested samples for the same time series, then vm_slow_row_inserts_total` metric is increased. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183 2023-03-20 21:28:33 +01:00			`for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series.`
			`See also https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183"`
docker-compose: provide the example list of alerting rules for vm components (#1005) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually. 2021-01-11 12:03:15 +01:00
alerts: add new alert `LabelsLimitExceededOnIngestion` (#1359) 2021-06-09 11:15:36 +02:00			`- alert: LabelsLimitExceededOnIngestion`
alerts: simplify aggregation of alerting rules This is follow-up after https://github.com/VictoriaMetrics/VictoriaMetrics/commit/75196d7234afde97f9be46b36f25a0f2675731f9 It updates some of the alerting rules to remove unnecessary aggregations. It keeps aggregations for expressions which are using multiple time series filters to make sure their label will match. Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-12-11 15:17:30 +01:00			`expr: increase(vm_metrics_with_dropped_labels_total[5m]) > 0`
alerts: add new alert `LabelsLimitExceededOnIngestion` (#1359) 2021-06-09 11:15:36 +02:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
Fix vmsingle dashboard link (#1894) 2021-12-02 13:43:30 +01:00			`dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=74&var-instance={{ $labels.instance }}"`
alerts: add new alert `LabelsLimitExceededOnIngestion` (#1359) 2021-06-09 11:15:36 +02:00			`summary: "Metrics ingested in ({{ $labels.instance }}) are exceeding labels limit"`
			description: "VictoriaMetrics limits the number of labels per each metric with `-maxLabelsPerTimeseries` command-line flag.\n
deployment: minor grammatical fixes in alert descriptions (#6199) 2024-04-30 10:24:31 +02:00			This prevents ingestion of metrics with too many labels. Please verify that `-maxLabelsPerTimeseries` is configured
lib/mergeset: adds tracking for indexdb records drop (#6297) It allows to create alert for possible item drops at indexdb. It may happen, if ingested metric size exceeds max indexdb item size. --------- Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: hagen1778 <roman@victoriametrics.com> 2024-05-24 14:55:20 +02:00			`correctly or that clients which send these metrics aren't misbehaving."`