Commit Graph

18 Commits

Author SHA1 Message Date
Roman Khavronenko
f772ee8326
deployment/docker: move cluster compose env to master branch (#3130)
* deployment/docker: move cluster compose env to master branch

The change supposed to simplify the process of maintaining for
single/cluster docker-compose envs, alerts, dashboards. It also
supposes to reduce confusion for users when looking for cluster
related alerts/configs.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* deployment/docker: move cluster compose env to master branch

Review updates.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2022-09-21 12:03:10 +03:00
Roman Khavronenko
23e85e0fc5
vmagent: expose metric vmagent_remotewrite_queues (#2871)
The new metric `vmagent_remotewrite_queues` exports a static value of
number of configured remote write queus. This metric is useful to
calculate total saturation per each configured URL with given number
of queues. See corresponding changes to vmagent alerts and dashboard.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2022-07-18 14:41:04 +03:00
Roman Khavronenko
a42063909f
alerts: correct expression for DiskRunsOutOfSpaceIn3Days (#2856)
The negative value for ETA can happen when deduplication is enabled
and `rate` over `vm_deduplicated_samples_total` becomes bigger
than actual ingestion rate.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2022-07-12 14:14:47 +02:00
Yurii Kravets
14397ba23e
Changed the level type in alerts.yml for TooManyLogs alert (#2759)
alerts: filter out non error log messages for `TooManyLogs`

Info and Warn error levels aren't always a result of malfunctioning
or faulty state. So we filter them out.
2022-06-20 16:45:52 +02:00
Roman Khavronenko
3458a3d593
Monitoring cluster (#2191)
* dashboards: add `CPU percentage` panel for cluster dashboards

The new panel `CPU percentage` was added instead if adding a limit
to the existing `CPU` panel because dasbhoard may display big number
of components each with own limits. The separate panel should provide
a clear display of CPU load.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards: sync vmagent and vmalert changes from single version

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* docker: remove unsupported param from vmagent config

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* alerts: add `TooHighCPUUsage` alert for all VM components

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2022-02-15 11:57:58 +02:00
Roman Khavronenko
ada18cd963
Dashboards vmagent updates (#1973)
* dashboards/vmagent: shuffle panels for better visibility

More important error/dropped panels were moved higher on the main row.
Network usage panel moved to Resource usage row.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards/vmagent: add Troubleshooting row to show top 5 instances/jobs by churn rate

New panels are supposed to show top 5 jobs or targets which generate the most
of the churn rate. They were placed into a new row "Troubleshooting".

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards/vmagent: add panels for showing persistent queue saturation

New panels were added to Torubleshooting row to show the persistent queue
saturation. The corresponding alerts were added and linked to these
panels as well.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards/vmagent: add alert "RejectedRemoteWriteDataBlocksAreDropped"

New alert suppose to send a notification when vmagent starts to drop
data blocks rejected by configured remote write destiantion.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2021-12-20 12:19:17 +02:00
Aliaksandr Valialkin
e4ebcebc8a
deployment/docker/alerts.yml: formatting fixes after 865a60f13e 2021-10-19 09:00:05 +03:00
Yurii Kravets
34f52de3a5
Update alerts.yml
Added Series Limit day\hour alerts
2021-10-19 09:00:05 +03:00
Roman Khavronenko
18313f3f8e
Cluster dashboard update (#1594)
* dashboards: sync `vmagent` updates from master branch

* dashboards: add new `Storage connection saturation` panel for cluster dashboard

* dashboards: add new cluster alert for corresponding `Storage connection saturation` panel
2021-09-01 17:05:17 +03:00
Roman Khavronenko
af8c1feddb Single dashboards upd (#1593)
* dasbhoard: replace `null` datasources

null datasource value may confuse Grafana and make it drop panel query in some
versions.

* docker: bump grafana image version

* dashboards: add URL variable selector to vmagent dashboard

* dashboards: add new panel `Remote write connection saturation` to vmagent dashboard

* alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard

* dashboards: add "Logging rate" panel to vmagent dashboard
2021-09-01 12:24:55 +03:00
Max Golionko
738741ab0d
rename group for cluster (#1546)
rename group for cluster, so that they not overlap when you have vmsingle and vmcluster deployed alongside
2021-08-18 16:03:04 +03:00
Roman Khavronenko
d63842cdbe
Cluster alerts (#1513)
* alerts: move `ProcessNearFDLimits` to `vm-health` group since it is relevant for all services

* alerts: add new `TooHighMemoryUsage` alerting rule
2021-08-02 17:54:24 +03:00
Roman Khavronenko
ce3f087d46
alerts: sync alert expression for DiskRunsOutOfSpaceIn3Days with dashboard (#1435) 2021-07-07 00:47:08 +03:00
k1rk
c6c789db8f rename serviceHealth group name to vm-health (#1360)
this causes conflicts in `victoria-metrics-k8s-stack` chart =)
2021-06-09 02:26:21 +03:00
Aliaksandr Valialkin
1c09e71f5b app/vminsert: add -disableRerouting command-line flag for disabling re-routing if some vmstorage nodes have lower performance than the others
Refactor the rerouting mechanism and make it more resilient to cases when some of vmstorage nodes are temporarily unavailable.

Reduce the probability of rerouting storm.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/791
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1054
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1165
2021-06-04 04:33:52 +03:00
Roman Khavronenko
c6fc3fa94d
alerts: make alerting rule RPCErrors compatible with PromQL (#1204)
Original query can't be executed via PromQL which results in error
if expression is evaluated by Prometheus. The new expression is
compatible with both engines.
2021-04-13 08:10:23 +03:00
Roman Khavronenko
c4f6b79d76
alerts: add ServiceDown alert to detect "dead" services (#1196) 2021-04-08 18:23:10 +03:00
Roman Khavronenko
51faea5e4b
deployment: add vmalert+alertmanager services and list of default alerts for cluster version (#1187) 2021-04-05 22:29:04 +03:00