VictoriaMetrics/deployment/docker/alerts-vmagent.yml

# File contains default list of alerts for vmagent service.
# The alerts below are just recommendations and may require some updates
# and threshold calibration according to every specific setup.
groups:
  # Alerts group for vmagent assumes that Grafana dashboard
  # https://grafana.com/grafana/dashboards/12683/ is installed.
  # Pls update the `dashboard` annotation according to your setup.
  - name: vmagent
    interval: 30s
    concurrency: 2
    rules:
      - alert: PersistentQueueIsDroppingData
        expr: sum(increase(vm_persistentqueue_bytes_dropped_total[5m])) without (path) > 0
        for: 10m
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=49&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} is dropping data from persistent queue"
          description: "Vmagent dropped {{ $value | humanize1024 }} from persistent queue
              on instance {{ $labels.instance }} for the last 10m."

      - alert: RejectedRemoteWriteDataBlocksAreDropped
        expr: sum(increase(vmagent_remotewrite_packets_dropped_total[5m])) without (url) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=79&var-instance={{ $labels.instance }}"
          summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} drops the rejected by 
          remote-write server data blocks. Check the logs to find the reason for rejects."

      - alert: TooManyScrapeErrors
        expr: increase(vm_promscrape_scrapes_failed_total[5m]) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=31&var-instance={{ $labels.instance }}"
          summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to scrape targets for last 15m"

      - alert: TooManyWriteErrors
        expr: |
          (sum(increase(vm_ingestserver_request_errors_total[5m])) without (name,net,type)
          +
          sum(increase(vmagent_http_request_errors_total[5m])) without (path,protocol)) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=77&var-instance={{ $labels.instance }}"
          summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} responds with errors to write requests for last 15m."

      - alert: TooManyRemoteWriteErrors
        expr: rate(vmagent_remotewrite_retries_count_total[5m]) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=61&var-instance={{ $labels.instance }}"
          summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to push to remote storage"
          description: "Vmagent fails to push data via remote write protocol to destination \"{{ $labels.url }}\"\n
            Ensure that destination is up and reachable."

      - alert: RemoteWriteConnectionIsSaturated
        expr: |
          (
           rate(vmagent_remotewrite_send_duration_seconds_total[5m])
           / 
           vmagent_remotewrite_queues
          ) > 0.9
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=84&var-instance={{ $labels.instance }}"
          summary: "Remote write connection from \"{{ $labels.job }}\" (instance {{ $labels.instance }}) to {{ $labels.url }} is saturated"
          description: "The remote write connection between vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }}) and destination \"{{ $labels.url }}\"
            is saturated by more than 90% and vmagent won't be able to keep up.\n
            This usually means that `-remoteWrite.queues` command-line flag must be increased in order to increase
            the number of connections per each remote storage."

      - alert: PersistentQueueForWritesIsSaturated
        expr: rate(vm_persistentqueue_write_duration_seconds_total[5m]) > 0.9
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=98&var-instance={{ $labels.instance }}"
          summary: "Persistent queue writes for instance {{ $labels.instance }} are saturated"
          description: "Persistent queue writes for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})
            are saturated by more than 90% and vmagent won't be able to keep up with flushing data on disk. 
            In this case, consider to decrease load on the vmagent or improve the disk throughput."

      - alert: PersistentQueueForReadsIsSaturated
        expr: rate(vm_persistentqueue_read_duration_seconds_total[5m]) > 0.9
        for: 15m
        labels:
          severity: warning
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=99&var-instance={{ $labels.instance }}"
          summary: "Persistent queue reads for instance {{ $labels.instance }} are saturated"
          description: "Persistent queue reads for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})
            are saturated by more than 90% and vmagent won't be able to keep up with reading data from the disk. 
            In this case, consider to decrease load on the vmagent or improve the disk throughput."

      - alert: SeriesLimitHourReached
        expr: (vmagent_hourly_series_limit_current_series / vmagent_hourly_series_limit_max_series) > 0.9
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=88&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} reached 90% of the limit"
          description: "Max series limit set via -remoteWrite.maxHourlySeries flag is close to reaching the max value. 
            Then samples for new time series will be dropped instead of sending them to remote storage systems."

      - alert: SeriesLimitDayReached
        expr: (vmagent_daily_series_limit_current_series / vmagent_daily_series_limit_max_series) > 0.9
        labels:
          severity: critical
        annotations:
          dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=90&var-instance={{ $labels.instance }}"
          summary: "Instance {{ $labels.instance }} reached 90% of the limit"
          description: "Max series limit set via -remoteWrite.maxDailySeries flag is close to reaching the max value. 
            Then samples for new time series will be dropped instead of sending them to remote storage systems."

      - alert: ConfigurationReloadFailure
        expr: |
          vm_promscrape_config_last_reload_successful != 1
          or
          vmagent_relabel_config_last_reload_successful != 1
        labels:
          severity: warning
        annotations:
          summary: "Configuration reload failed for vmagent instance {{ $labels.instance }}"
          description: "Configuration hot-reload failed for vmagent on instance {{ $labels.instance }}.
          Check vmagent's logs for detailed error message."

      - alert: StreamAggrFlushTimeout
        expr: |
          increase(vm_streamaggr_flush_timeouts_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Streaming aggregation at \"{{ $labels.job }}\" (instance {{ $labels.instance }}) can't be finished within the configured aggregation interval."
          description: "Stream aggregation process can't keep up with the load and might produce incorrect aggregation results. Check logs for more details.
            Possible solutions: increase aggregation interval; aggregate smaller number of series; reduce samples' ingestion rate to stream aggregation."

      - alert: StreamAggrDedupFlushTimeout
        expr: |
          increase(vm_streamaggr_dedup_flush_timeouts_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Deduplication \"{{ $labels.job }}\" (instance {{ $labels.instance }}) can't be finished within configured deduplication interval."
          description: "Deduplication process can't keep up with the load and might produce incorrect results. Check docs https://docs.victoriametrics.com/stream-aggregation/#deduplication and logs for more details.
            Possible solutions: increase deduplication interval; deduplicate smaller number of series; reduce samples' ingestion rate."
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`# File contains default list of alerts for vmagent service.`
			`# The alerts below are just recommendations and may require some updates`
			`# and threshold calibration according to every specific setup.`
			`groups:`
			`# Alerts group for vmagent assumes that Grafana dashboard`
docs: remove slug from Grafana dashboard URLs Each Grafana dashboard has unique ID which can be used to fetch the dashboard from grafana.com: https://grafana.com/grafana/dashboards/11176 The same dashboard can be accessed via URL with slug: https://grafana.com/grafana/dashboards/11176-victoriametrics-cluster/ But using slug implies that any change to dashboard name will break the link. So it is better to just use ID, so the dashboard URL will never break. This is follow-up for https://github.com/VictoriaMetrics/VictoriaMetrics/commit/ff33e60a3d46b50bc585a66c433fd62ec00ec7ca Signed-off-by: hagen1778 <roman@victoriametrics.com> 2024-01-18 11:19:53 +01:00			`# https://grafana.com/grafana/dashboards/12683/ is installed.`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			# Pls update the `dashboard` annotation according to your setup.
			`- name: vmagent`
			`interval: 30s`
			`concurrency: 2`
			`rules:`
			`- alert: PersistentQueueIsDroppingData`
alerts: inverse grouping in vmagent alerts (#5429) Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing. Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact. 2023-12-11 15:01:29 +01:00			`expr: sum(increase(vm_persistentqueue_bytes_dropped_total[5m])) without (path) > 0`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`for: 10m`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=49&var-instance={{ $labels.instance }}"`
			`summary: "Instance {{ $labels.instance }} is dropping data from persistent queue"`
			`description: "Vmagent dropped {{ $value \| humanize1024 }} from persistent queue`
			`on instance {{ $labels.instance }} for the last 10m."`

			`- alert: RejectedRemoteWriteDataBlocksAreDropped`
alerts: inverse grouping in vmagent alerts (#5429) Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing. Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact. 2023-12-11 15:01:29 +01:00			`expr: sum(increase(vmagent_remotewrite_packets_dropped_total[5m])) without (url) > 0`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=79&var-instance={{ $labels.instance }}"`
			`summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} drops the rejected by`
			`remote-write server data blocks. Check the logs to find the reason for rejects."`

			`- alert: TooManyScrapeErrors`
alerts: inverse grouping in vmagent alerts (#5429) Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing. Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact. 2023-12-11 15:01:29 +01:00			`expr: increase(vm_promscrape_scrapes_failed_total[5m]) > 0`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=31&var-instance={{ $labels.instance }}"`
			`summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to scrape targets for last 15m"`

			`- alert: TooManyWriteErrors`
			`expr: \|`
alerts: inverse grouping in vmagent alerts (#5429) Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing. Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact. 2023-12-11 15:01:29 +01:00			`(sum(increase(vm_ingestserver_request_errors_total[5m])) without (name,net,type)`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`+`
alerts: inverse grouping in vmagent alerts (#5429) Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing. Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact. 2023-12-11 15:01:29 +01:00			`sum(increase(vmagent_http_request_errors_total[5m])) without (path,protocol)) > 0`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=77&var-instance={{ $labels.instance }}"`
			`summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} responds with errors to write requests for last 15m."`

			`- alert: TooManyRemoteWriteErrors`
alerts: inverse grouping in vmagent alerts (#5429) Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing. Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact. 2023-12-11 15:01:29 +01:00			`expr: rate(vmagent_remotewrite_retries_count_total[5m]) > 0`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=61&var-instance={{ $labels.instance }}"`
			`summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to push to remote storage"`
			`description: "Vmagent fails to push data via remote write protocol to destination \"{{ $labels.url }}\"\n`
			`Ensure that destination is up and reachable."`

			`- alert: RemoteWriteConnectionIsSaturated`
			`expr: \|`
deployment/alerts: make `RemoteWriteConnectionIsSaturated` expr readable Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-10-24 09:31:57 +02:00			`(`
alerts: inverse grouping in vmagent alerts (#5429) Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing. Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact. 2023-12-11 15:01:29 +01:00			`rate(vmagent_remotewrite_send_duration_seconds_total[5m])`
deployment/alerts: make `RemoteWriteConnectionIsSaturated` expr readable Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-10-24 09:31:57 +02:00			`/`
alerts: inverse grouping in vmagent alerts (#5429) Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing. Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact. 2023-12-11 15:01:29 +01:00			`vmagent_remotewrite_queues`
deployment/alerts: make `RemoteWriteConnectionIsSaturated` expr readable Signed-off-by: hagen1778 <roman@victoriametrics.com> 2023-10-24 09:31:57 +02:00			`) > 0.9`
deployment/docker: move cluster compose env to master branch (#3130) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-09-21 10:48:38 +02:00			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=84&var-instance={{ $labels.instance }}"`
			`summary: "Remote write connection from \"{{ $labels.job }}\" (instance {{ $labels.instance }}) to {{ $labels.url }} is saturated"`
			`description: "The remote write connection between vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }}) and destination \"{{ $labels.url }}\"`
			`is saturated by more than 90% and vmagent won't be able to keep up.\n`
			This usually means that `-remoteWrite.queues` command-line flag must be increased in order to increase
			`the number of connections per each remote storage."`

			`- alert: PersistentQueueForWritesIsSaturated`
			`expr: rate(vm_persistentqueue_write_duration_seconds_total[5m]) > 0.9`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=98&var-instance={{ $labels.instance }}"`
			`summary: "Persistent queue writes for instance {{ $labels.instance }} are saturated"`
			`description: "Persistent queue writes for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})`
			`are saturated by more than 90% and vmagent won't be able to keep up with flushing data on disk.`
			`In this case, consider to decrease load on the vmagent or improve the disk throughput."`

			`- alert: PersistentQueueForReadsIsSaturated`
			`expr: rate(vm_persistentqueue_read_duration_seconds_total[5m]) > 0.9`
			`for: 15m`
			`labels:`
			`severity: warning`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=99&var-instance={{ $labels.instance }}"`
			`summary: "Persistent queue reads for instance {{ $labels.instance }} are saturated"`
			`description: "Persistent queue reads for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})`
			`are saturated by more than 90% and vmagent won't be able to keep up with reading data from the disk.`
			`In this case, consider to decrease load on the vmagent or improve the disk throughput."`

			`- alert: SeriesLimitHourReached`
			`expr: (vmagent_hourly_series_limit_current_series / vmagent_hourly_series_limit_max_series) > 0.9`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=88&var-instance={{ $labels.instance }}"`
			`summary: "Instance {{ $labels.instance }} reached 90% of the limit"`
			`description: "Max series limit set via -remoteWrite.maxHourlySeries flag is close to reaching the max value.`
			`Then samples for new time series will be dropped instead of sending them to remote storage systems."`

			`- alert: SeriesLimitDayReached`
			`expr: (vmagent_daily_series_limit_current_series / vmagent_daily_series_limit_max_series) > 0.9`
			`labels:`
			`severity: critical`
			`annotations:`
			`dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=90&var-instance={{ $labels.instance }}"`
			`summary: "Instance {{ $labels.instance }} reached 90% of the limit"`
			`description: "Max series limit set via -remoteWrite.maxDailySeries flag is close to reaching the max value.`
vmagent: expose metrics for tracking config state (#3375) Expose `vm_relabel_config_` and `vm_promscrape_config_` metrics for tracking relabel and scrape configuration hot-reloads. https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3345 Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-21 23:38:43 +01:00			`Then samples for new time series will be dropped instead of sending them to remote storage systems."`

			`- alert: ConfigurationReloadFailure`
			`expr: \|`
			`vm_promscrape_config_last_reload_successful != 1`
			`or`
app/vminsert: add missing vm_relabel_config_* metrics after 03d88bc066627cf4d32be681c85bb85aabdc86f3 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3345 2022-11-21 23:47:40 +01:00			`vmagent_relabel_config_last_reload_successful != 1`
vmagent: expose metrics for tracking config state (#3375) Expose `vm_relabel_config_` and `vm_promscrape_config_` metrics for tracking relabel and scrape configuration hot-reloads. https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3345 Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com> 2022-11-21 23:38:43 +01:00			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Configuration reload failed for vmagent instance {{ $labels.instance }}"`
			`description: "Configuration hot-reload failed for vmagent on instance {{ $labels.instance }}.`
app/vminsert: add missing vm_relabel_config_* metrics after 03d88bc066627cf4d32be681c85bb85aabdc86f3 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3345 2022-11-21 23:47:40 +01:00			`Check vmagent's logs for detailed error message."`
vmagent: updated dashboard and alert for stream aggregation (#6427) ### Describe Your Changes Added streaming aggregation section to vmagent dashboards Added alert for streaming aggregation and deduplication flush timeouts Removed deprecated compose versions from compose files Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: hagen1778 <roman@victoriametrics.com> 2024-06-10 11:49:00 +02:00
			`- alert: StreamAggrFlushTimeout`
			`expr: \|`
			`increase(vm_streamaggr_flush_timeouts_total[5m]) > 0`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Streaming aggregation at \"{{ $labels.job }}\" (instance {{ $labels.instance }}) can't be finished within the configured aggregation interval."`
			`description: "Stream aggregation process can't keep up with the load and might produce incorrect aggregation results. Check logs for more details.`
			`Possible solutions: increase aggregation interval; aggregate smaller number of series; reduce samples' ingestion rate to stream aggregation."`

			`- alert: StreamAggrDedupFlushTimeout`
			`expr: \|`
			`increase(vm_streamaggr_dedup_flush_timeouts_total[5m]) > 0`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Deduplication \"{{ $labels.job }}\" (instance {{ $labels.instance }}) can't be finished within configured deduplication interval."`
			`description: "Deduplication process can't keep up with the load and might produce incorrect results. Check docs https://docs.victoriametrics.com/stream-aggregation/#deduplication and logs for more details.`
			`Possible solutions: increase deduplication interval; deduplicate smaller number of series; reduce samples' ingestion rate."`