VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-12-21 07:56:26 +01:00

Author	SHA1	Message	Date
Roman Khavronenko	f772ee8326	deployment/docker: move cluster compose env to master branch (#3130 ) * deployment/docker: move cluster compose env to master branch The change supposed to simplify the process of maintaining for single/cluster docker-compose envs, alerts, dashboards. It also supposes to reduce confusion for users when looking for cluster related alerts/configs. Signed-off-by: hagen1778 <roman@victoriametrics.com> * deployment/docker: move cluster compose env to master branch Review updates. Signed-off-by: hagen1778 <roman@victoriametrics.com> Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-09-21 12:03:10 +03:00
Roman Khavronenko	23e85e0fc5	vmagent: expose metric `vmagent_remotewrite_queues` (#2871 ) The new metric `vmagent_remotewrite_queues` exports a static value of number of configured remote write queus. This metric is useful to calculate total saturation per each configured URL with given number of queues. See corresponding changes to vmagent alerts and dashboard. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-07-18 14:41:04 +03:00
Roman Khavronenko	a42063909f	alerts: correct expression for `DiskRunsOutOfSpaceIn3Days` (#2856 ) The negative value for ETA can happen when deduplication is enabled and `rate` over `vm_deduplicated_samples_total` becomes bigger than actual ingestion rate. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-07-12 14:14:47 +02:00
Yurii Kravets	14397ba23e	Changed the level type in alerts.yml for TooManyLogs alert (#2759 ) alerts: filter out non error log messages for `TooManyLogs` Info and Warn error levels aren't always a result of malfunctioning or faulty state. So we filter them out.	2022-06-20 16:45:52 +02:00
Roman Khavronenko	3458a3d593	Monitoring cluster (#2191 ) * dashboards: add `CPU percentage` panel for cluster dashboards The new panel `CPU percentage` was added instead if adding a limit to the existing `CPU` panel because dasbhoard may display big number of components each with own limits. The separate panel should provide a clear display of CPU load. Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards: sync vmagent and vmalert changes from single version Signed-off-by: hagen1778 <roman@victoriametrics.com> * docker: remove unsupported param from vmagent config Signed-off-by: hagen1778 <roman@victoriametrics.com> * alerts: add `TooHighCPUUsage` alert for all VM components Signed-off-by: hagen1778 <roman@victoriametrics.com>	2022-02-15 11:57:58 +02:00
Roman Khavronenko	ada18cd963	Dashboards vmagent updates (#1973 ) * dashboards/vmagent: shuffle panels for better visibility More important error/dropped panels were moved higher on the main row. Network usage panel moved to Resource usage row. Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add Troubleshooting row to show top 5 instances/jobs by churn rate New panels are supposed to show top 5 jobs or targets which generate the most of the churn rate. They were placed into a new row "Troubleshooting". Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add panels for showing persistent queue saturation New panels were added to Torubleshooting row to show the persistent queue saturation. The corresponding alerts were added and linked to these panels as well. Signed-off-by: hagen1778 <roman@victoriametrics.com> * dashboards/vmagent: add alert "RejectedRemoteWriteDataBlocksAreDropped" New alert suppose to send a notification when vmagent starts to drop data blocks rejected by configured remote write destiantion. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2021-12-20 12:19:17 +02:00
Aliaksandr Valialkin	e4ebcebc8a	deployment/docker/alerts.yml: formatting fixes after `865a60f13e`	2021-10-19 09:00:05 +03:00
Yurii Kravets	34f52de3a5	Update alerts.yml Added Series Limit day\hour alerts	2021-10-19 09:00:05 +03:00
Roman Khavronenko	18313f3f8e	Cluster dashboard update (#1594 ) * dashboards: sync `vmagent` updates from master branch * dashboards: add new `Storage connection saturation` panel for cluster dashboard * dashboards: add new cluster alert for corresponding `Storage connection saturation` panel	2021-09-01 17:05:17 +03:00
Roman Khavronenko	af8c1feddb	Single dashboards upd (#1593 ) * dasbhoard: replace `null` datasources null datasource value may confuse Grafana and make it drop panel query in some versions. * docker: bump grafana image version * dashboards: add URL variable selector to vmagent dashboard * dashboards: add new panel `Remote write connection saturation` to vmagent dashboard * alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard * dashboards: add "Logging rate" panel to vmagent dashboard	2021-09-01 12:24:55 +03:00
Max Golionko	738741ab0d	rename group for cluster (#1546 ) rename group for cluster, so that they not overlap when you have vmsingle and vmcluster deployed alongside	2021-08-18 16:03:04 +03:00
Roman Khavronenko	d63842cdbe	Cluster alerts (#1513 ) * alerts: move `ProcessNearFDLimits` to `vm-health` group since it is relevant for all services * alerts: add new `TooHighMemoryUsage` alerting rule	2021-08-02 17:54:24 +03:00
Roman Khavronenko	ce3f087d46	alerts: sync alert expression for `DiskRunsOutOfSpaceIn3Days` with dashboard (#1435 )	2021-07-07 00:47:08 +03:00
k1rk	c6c789db8f	rename serviceHealth group name to vm-health (#1360 ) this causes conflicts in `victoria-metrics-k8s-stack` chart =)	2021-06-09 02:26:21 +03:00
Aliaksandr Valialkin	1c09e71f5b	app/vminsert: add `-disableRerouting` command-line flag for disabling re-routing if some vmstorage nodes have lower performance than the others Refactor the rerouting mechanism and make it more resilient to cases when some of vmstorage nodes are temporarily unavailable. Reduce the probability of rerouting storm. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/791 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1054 Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1165	2021-06-04 04:33:52 +03:00
Roman Khavronenko	c6fc3fa94d	alerts: make alerting rule `RPCErrors` compatible with PromQL (#1204 ) Original query can't be executed via PromQL which results in error if expression is evaluated by Prometheus. The new expression is compatible with both engines.	2021-04-13 08:10:23 +03:00
Roman Khavronenko	c4f6b79d76	alerts: add `ServiceDown` alert to detect "dead" services (#1196 )	2021-04-08 18:23:10 +03:00
Roman Khavronenko	51faea5e4b	deployment: add vmalert+alertmanager services and list of default alerts for cluster version (#1187 )	2021-04-05 22:29:04 +03:00

18 Commits