Commit Graph

21 Commits

Author SHA1 Message Date
Zakhar Bessarab
6711eec109
docker-compose: move TooManyLogs into vm-health alerts set (#3199) 2022-10-05 19:23:36 +02:00
Roman Khavronenko
5714a68ac6
deployment/docker: move cluster compose env to master branch (#3130)
* deployment/docker: move cluster compose env to master branch

The change supposed to simplify the process of maintaining for
single/cluster docker-compose envs, alerts, dashboards. It also
supposes to reduce confusion for users when looking for cluster
related alerts/configs.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* deployment/docker: move cluster compose env to master branch

Review updates.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2022-09-21 11:48:38 +03:00
Roman Khavronenko
27f1c65074
vmagent: expose metric vmagent_remotewrite_queues (#2871)
The new metric `vmagent_remotewrite_queues` exports a static value of
number of configured remote write queus. This metric is useful to
calculate total saturation per each configured URL with given number
of queues. See corresponding changes to vmagent alerts and dashboard.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2022-07-18 14:31:35 +03:00
Aliaksandr Valialkin
8a6fb5ef2b
deployment/docker/alerts.yml: backport a42063909f 2022-07-12 19:53:06 +03:00
Yurii Kravets
aeeaf877ac
Changed the level type in alerts.yml for TooManyLogs alert (#2760)
alerts: filter out non error log messages for `TooManyLogs`

Info and Warn error levels aren't always a result of malfunctioning
or faulty state. So we filter them out.
2022-06-20 16:44:47 +02:00
Roman Khavronenko
7cd371f08f
alerts: lower the threshold for TooHighSlowInsertsRate (#2210)
Lowering threshold from 50% to 5% will be more sufficient
for discovering un-healthy system state. It also goes in
sync with alert definition in cluster branch.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2022-02-18 13:42:24 +02:00
Roman Khavronenko
e29b2b8444
Monitoring single (#2190)
* dashboards: plot cpu limits for vmagent, vmalert and vm-single dashboards

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* alerts: add `TooHighCPUUsage` alert for all VM components

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards: bump components version requirements

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2022-02-15 11:54:28 +02:00
Roman Khavronenko
bc79bdf68a
Dashboards vmagent updates (#1973)
* dashboards/vmagent: shuffle panels for better visibility

More important error/dropped panels were moved higher on the main row.
Network usage panel moved to Resource usage row.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards/vmagent: add Troubleshooting row to show top 5 instances/jobs by churn rate

New panels are supposed to show top 5 jobs or targets which generate the most
of the churn rate. They were placed into a new row "Troubleshooting".

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards/vmagent: add panels for showing persistent queue saturation

New panels were added to Torubleshooting row to show the persistent queue
saturation. The corresponding alerts were added and linked to these
panels as well.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* dashboards/vmagent: add alert "RejectedRemoteWriteDataBlocksAreDropped"

New alert suppose to send a notification when vmagent starts to drop
data blocks rejected by configured remote write destiantion.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2021-12-20 12:16:53 +02:00
Thomas Danielsson
77e19b3f87
Fix vmsingle dashboard link (#1894) 2021-12-02 14:43:30 +02:00
Aliaksandr Valialkin
ec40affb59
deployment/docker/alerts.yml: formatting fixes after 865a60f13e 2021-10-19 08:53:03 +03:00
Yurii Kravets
865a60f13e
Update alerts.yml
Added Series Limit day\hour alerts
2021-10-18 18:14:49 +03:00
Roman Khavronenko
0f4bcc00b2
Single dashboards upd (#1593)
* dasbhoard: replace `null` datasources

null datasource value may confuse Grafana and make it drop panel query in some
versions.

* docker: bump grafana image version

* dashboards: add URL variable selector to vmagent dashboard

* dashboards: add new panel `Remote write connection saturation` to vmagent dashboard

* alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard

* dashboards: add "Logging rate" panel to vmagent dashboard
2021-09-01 11:46:22 +03:00
Roman Khavronenko
408ba43092
Alerts single update (#1510)
* alerts: move `ProcessNearFDLimits` to `vm-health` group since it is relevant for all services

* alerts: add new `TooHighMemoryUsage` alerting rule
2021-08-02 15:51:24 +03:00
Roman Khavronenko
2f54559c89
alerts: sync alert expression for DiskRunsOutOfSpaceIn3Days with dashboard (#1436) 2021-07-07 10:31:09 +03:00
Roman Khavronenko
5e9f3777bf
alerts: add new alert LabelsLimitExceededOnIngestion (#1359) 2021-06-09 12:15:36 +03:00
k1rk
668165f53d
rename serviceHealth group name to vm-health (#1360)
this causes conflicts in `victoria-metrics-k8s-stack` chart =)
2021-06-08 23:34:38 +03:00
Roman Khavronenko
162681e60d
add new alerts (#1195)
* alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch

* alerts: add `ServiceDown` alert to detect "dead" services
2021-04-08 18:24:25 +03:00
Roman Khavronenko
cfdb6762e6
deployment: add new alert TooHighChurnRate24h (#1154)
Alert `TooHighChurnRate24h` suppose to cover cases when churn rate
is low but results in multiple times higher number than total
number of active series.
2021-03-29 12:38:03 +03:00
Roman Khavronenko
b457739f87
Single dashboard (#1126)
* dashboard: update single node dashboard

* add panel `Open FDs` for file descriptors metrics;
* add panel `Disk writes/reads` to show the real read/write
load on storage layer;
* add `process_resident_memory_bytes` metric to memory usage panel;
* add stats panel to show available CPUs, memory and disk space;
* rm flags panel since it didn't prove its usefulness.

* alerts: add alert for reaching FDs limit
2021-03-15 12:04:24 +02:00
Roman Khavronenko
14f0f90507
docker-compose: provide the example list of alerting rules for vm components (#1005)
List contains examples for the alerting rules which might be executed
via `vmalert` to track the health state of VM components. It is assumed
that list will be revised and calibrated for each system individually.
2021-01-11 13:03:15 +02:00
Artem Navoiev
4e391a5e39
[deployment] add vmalert + alertmanager to docker compose (#885) 2020-11-07 17:00:23 +02:00