VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-11-23 20:37:12 +01:00

Author	SHA1	Message	Date
Hui Wang	4369bc1df2	deployment/dashboards: fix `Storage full ETA` panels (#5747 ) During background downsampling, rate(vm_deduplicated_samples_total{type="merge"}) could be much bigger than rate(vm_rows_added_to_storage_total) and it could last quite some time, which causes negative values of Storage full ETA and confuses users, see playground. Instead of trying to get more accurate results during downsampling, I think it's ok to ignore vm_deduplicated_samples_total at all, it's more reasonable to see Storage full ETA increase after downsampling. --------- Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2024-02-08 09:43:39 +01:00
hagen1778	d0e4190969	deployment/alerts: add `job` label to `DiskRunsOutOfSpace` alerting rule So it is easier to understand to which installation the triggered instance belongs. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2024-01-16 09:49:39 +01:00
hagen1778	8fb68152e6	alerts: simplify aggregation of alerting rules This is follow-up after `75196d7234` It updates some of the alerting rules to remove unnecessary aggregations. It keeps aggregations for expressions which are using multiple time series filters to make sure their label will match. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2023-12-11 15:17:30 +01:00
hagen1778	2e4d0d0e41	alerts: move `ConcurrentFlushesHitTheLimit` alert to health alerts The `ConcurrentFlushesHitTheLimit` could be related to components like vminsert, vmstorage, vm-single-node and vmagent. Moving this alert to the `health` section of alerts will be benefitial for all components and will remove the duplicates from single/cluster alerts. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2023-08-03 10:46:26 +02:00
Aliaksandr Valialkin	91533531f5	docs/Troubleshooting.md: document an additional case, which could result in slow inserts If `-cacheExpireDuration` is lower than the interval between ingested samples for the same time series, then vm_slow_row_inserts_total` metric is increased. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183	2023-03-20 13:28:36 -07:00
Aliaksandr Valialkin	c63755c316	lib/writeconcurrencylimiter: improve the logic behind -maxConcurrentInserts limit Previously the -maxConcurrentInserts was limiting the number of established client connections, which write data to VictoriaMetrics. Some of these connections could be idle. Such connections do not consume big amounts of CPU and RAM, so there is a little sense in limiting the number of such connections. So now the -maxConcurrentInserts command-line option limits the number of concurrently executed insert requests, not including idle connections. It is recommended removing -maxConcurrentInserts command-line option, since the default value for this option should work good for most cases.	2023-01-06 22:20:19 -08:00
Roman Khavronenko	b9dc11612e	alerts: remove `show_at` label for RequestErrorsToAPI alert (#3455 ) Alert `RequestErrorsToAPI` could be permanently triggered due to mistakes in clients configuration. However, such requests are unlikely to cause VM health state change. So there is no need in displaying this alert because there will be no correlation caused by it. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-12-07 14:19:50 +03:00
Aliaksandr Valialkin	f3e84b4dea	{dashboards,alerts}: subtitute `{type="indexdb"}` with `{type=~"indexdb.*"}` inside queries after `8189770c50` Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337	2022-12-05 16:00:22 -08:00
Roman Khavronenko	bdd0683c4a	dashboards: update VM single dash (#3400 ) The change list is the following: * bump Grafana version to 9.2.6; * replace old "Graph" panel with "TimeSeries" panel; * show % usage of Mem and CPU additionally to of absolute values; * `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`; * add Annotations for Alert triggers. Not all alerts are supposed to be displayed on the dashboard, but only those with label `show_at: dashboard`. See `alerts.yml` change. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-11-29 19:28:22 +01:00
Zakhar Bessarab	6711eec109	docker-compose: move `TooManyLogs` into `vm-health` alerts set (#3199 )	2022-10-05 19:23:36 +02:00
Roman Khavronenko	5714a68ac6	deployment/docker: move cluster compose env to master branch (#3130 ) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-09-21 11:48:38 +03:00
Roman Khavronenko	27f1c65074	vmagent: expose metric `vmagent_remotewrite_queues` (#2871 ) The new metric `vmagent_remotewrite_queues` exports a static value of number of configured remote write queus. This metric is useful to calculate total saturation per each configured URL with given number of queues. See corresponding changes to vmagent alerts and dashboard. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-07-18 14:31:35 +03:00
Aliaksandr Valialkin	8a6fb5ef2b	deployment/docker/alerts.yml: backport `a42063909f`	2022-07-12 19:53:06 +03:00
Yurii Kravets	aeeaf877ac	Changed the level type in alerts.yml for TooManyLogs alert (#2760 ) alerts: filter out non error log messages for `TooManyLogs` Info and Warn error levels aren't always a result of malfunctioning or faulty state. So we filter them out.	2022-06-20 16:44:47 +02:00
Roman Khavronenko	7cd371f08f	alerts: lower the threshold for TooHighSlowInsertsRate (#2210 ) Lowering threshold from 50% to 5% will be more sufficient for discovering un-healthy system state. It also goes in sync with alert definition in cluster branch. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-02-18 13:42:24 +02:00
Roman Khavronenko	e29b2b8444	Monitoring single (#2190 ) * dashboards: plot cpu limits for vmagent, vmalert and vm-single dashboards Signed-off-by: hagen1778 <roman@victoriametrics.com> * alerts: add `TooHighCPUUsage` alert for all VM components Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards: bump components version requirements Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-02-15 11:54:28 +02:00
Roman Khavronenko	bc79bdf68a	Dashboards vmagent updates (#1973 ) * dashboards/vmagent: shuffle panels for better visibility More important error/dropped panels were moved higher on the main row. Network usage panel moved to Resource usage row. Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add Troubleshooting row to show top 5 instances/jobs by churn rate New panels are supposed to show top 5 jobs or targets which generate the most of the churn rate. They were placed into a new row "Troubleshooting". Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add panels for showing persistent queue saturation New panels were added to Torubleshooting row to show the persistent queue saturation. The corresponding alerts were added and linked to these panels as well. Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add alert "RejectedRemoteWriteDataBlocksAreDropped" New alert suppose to send a notification when vmagent starts to drop data blocks rejected by configured remote write destiantion. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2021-12-20 12:16:53 +02:00
Thomas Danielsson	77e19b3f87	Fix vmsingle dashboard link (#1894 )	2021-12-02 14:43:30 +02:00
Aliaksandr Valialkin	ec40affb59	deployment/docker/alerts.yml: formatting fixes after `865a60f13e`	2021-10-19 08:53:03 +03:00
Yurii Kravets	865a60f13e	Update alerts.yml Added Series Limit day\hour alerts	2021-10-18 18:14:49 +03:00
Roman Khavronenko	0f4bcc00b2	Single dashboards upd (#1593 ) * dasbhoard: replace `null` datasources null datasource value may confuse Grafana and make it drop panel query in some versions. * docker: bump grafana image version * dashboards: add URL variable selector to vmagent dashboard * dashboards: add new panel `Remote write connection saturation` to vmagent dashboard * alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard * dashboards: add "Logging rate" panel to vmagent dashboard	2021-09-01 11:46:22 +03:00
Roman Khavronenko	408ba43092	Alerts single update (#1510 ) * alerts: move `ProcessNearFDLimits` to `vm-health` group since it is relevant for all services * alerts: add new `TooHighMemoryUsage` alerting rule	2021-08-02 15:51:24 +03:00
Roman Khavronenko	2f54559c89	alerts: sync alert expression for `DiskRunsOutOfSpaceIn3Days` with dashboard (#1436 )	2021-07-07 10:31:09 +03:00
Roman Khavronenko	5e9f3777bf	alerts: add new alert `LabelsLimitExceededOnIngestion` (#1359 )	2021-06-09 12:15:36 +03:00
k1rk	668165f53d	rename serviceHealth group name to vm-health (#1360 ) this causes conflicts in `victoria-metrics-k8s-stack` chart =)	2021-06-08 23:34:38 +03:00
Roman Khavronenko	162681e60d	add new alerts (#1195 ) * alerts: backport `DiskRunsOutOfSpace` alert and some other tweaks from cluster branch * alerts: add `ServiceDown` alert to detect "dead" services	2021-04-08 18:24:25 +03:00
Roman Khavronenko	cfdb6762e6	deployment: add new alert `TooHighChurnRate24h` (#1154 ) Alert `TooHighChurnRate24h` suppose to cover cases when churn rate is low but results in multiple times higher number than total number of active series.	2021-03-29 12:38:03 +03:00
Roman Khavronenko	b457739f87	Single dashboard (#1126 ) * dashboard: update single node dashboard * add panel `Open FDs` for file descriptors metrics; * add panel `Disk writes/reads` to show the real read/write load on storage layer; * add `process_resident_memory_bytes` metric to memory usage panel; * add stats panel to show available CPUs, memory and disk space; * rm flags panel since it didn't prove its usefulness. * alerts: add alert for reaching FDs limit	2021-03-15 12:04:24 +02:00
Roman Khavronenko	14f0f90507	docker-compose: provide the example list of alerting rules for vm components (#1005 ) List contains examples for the alerting rules which might be executed via `vmalert` to track the health state of VM components. It is assumed that list will be revised and calibrated for each system individually.	2021-01-11 13:03:15 +02:00
Artem Navoiev	4e391a5e39	[deployment] add vmalert + alertmanager to docker compose (#885 )	2020-11-07 17:00:23 +02:00

30 Commits