From 56622bff7338b999900e65b28185ed134d32147b Mon Sep 17 00:00:00 2001 From: Aliaksandr Valialkin Date: Thu, 30 Jun 2022 13:35:20 +0300 Subject: [PATCH] docs: add Troubleshooting doc This doc contains troubleshooting guides for typical problems with VictoriaMetrics. --- README.md | 22 +-- app/vmagent/README.md | 2 + docs/CHANGELOG.md | 11 +- docs/Cluster-VictoriaMetrics.md | 4 + docs/README.md | 22 +-- docs/Single-server-VictoriaMetrics.md | 22 +-- docs/Troubleshooting.md | 226 +++++++++++++++++++++++++ docs/guides/README.md | 2 +- docs/managed_victoriametrics/README.md | 2 +- docs/operator/README.md | 2 +- docs/vmagent.md | 2 + 11 files changed, 264 insertions(+), 53 deletions(-) create mode 100644 docs/Troubleshooting.md diff --git a/README.md b/README.md index 141f20b36..0c625eda5 100644 --- a/README.md +++ b/README.md @@ -1439,24 +1439,12 @@ Graphs on the dashboards contain useful hints - hover the `i` icon in the top le We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/alerts.yml) via [vmalert](https://docs.victoriametrics.com/vmalert.html) or via Prometheus. -The most interesting health metrics are the following: - -* `vm_cache_entries{type="storage/hour_metric_ids"}` - the number of time series with new data points during the last hour - aka [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). -* `increase(vm_new_timeseries_created_total[1h])` - time series [churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate) during the previous hour. -* `sum(vm_rows{type=~"storage/.*"})` - total number of `(timestamp, value)` data points in the database. -* `sum(rate(vm_rows_inserted_total[5m]))` - ingestion rate, i.e. how many samples are inserted in the database per second. -* `vm_free_disk_space_bytes` - free space left at `-storageDataPath`. -* `sum(vm_data_size_bytes)` - the total size of data on disk. -* `increase(vm_slow_row_inserts_total[5m])` - the number of slow inserts during the last 5 minutes. - If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling - of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). -* `increase(vm_slow_metric_name_loads_total[5m])` - the number of slow loads of metric names during the last 5 minutes. - If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling - of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). - VictoriaMetrics exposes currently running queries and their execution times at `/api/v1/status/active_queries` page. +VictoriaMetrics exposes queries, which take the most time to execute, at `/api/v1/status/top_queries` page. + +See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## TSDB stats VictoriaMetrics returns TSDB stats at `/api/v1/status/tsdb` page in the way similar to Prometheus - see [these Prometheus docs](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats). VictoriaMetrics accepts the following optional query args at `/api/v1/status/tsdb` page: @@ -1621,6 +1609,8 @@ See also more advanced [cardinality limiter in vmagent](https://docs.victoriamet * VictoriaMetrics ignores `NaN` values during data ingestion. +See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## Cache removal VictoriaMetrics uses various internal caches. These caches are stored to `<-storageDataPath>/cache` directory during graceful shutdown (e.g. when VictoriaMetrics is stopped by sending `SIGINT` signal). The caches are read on the next VictoriaMetrics startup. Sometimes it is needed to remove such caches on the next startup. This can be performed by placing `reset_cache_on_startup` file inside the `<-storageDataPath>/cache` directory before the restart of VictoriaMetrics. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1447) for details. diff --git a/app/vmagent/README.md b/app/vmagent/README.md index 6ce496659..76233e19f 100644 --- a/app/vmagent/README.md +++ b/app/vmagent/README.md @@ -582,6 +582,8 @@ It may be useful to perform `vmagent` rolling update without any scrape loss. regex: true ``` +See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## Kafka integration [Enterprise version](https://victoriametrics.com/products/enterprise/) of `vmagent` can read and write metrics from / to Kafka: diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md index f2f95b631..cb2769d62 100644 --- a/docs/CHANGELOG.md +++ b/docs/CHANGELOG.md @@ -32,7 +32,7 @@ scrape_configs: - targets: ["host123:8080"] ``` -* FEATURE: [query tracing](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#query-tracing): show timestamps in query traces in human-readable format (aka `RFC3339` in UTC timezone) instead of milliseconds since Unix epoch. For example, `2022-06-27T10:32:54.506Z` instead of `1656325974506`. +* FEATURE: [query tracing](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#query-tracing): show timestamps in query traces in human-readable format (aka `RFC3339` in UTC timezone) instead of milliseconds since Unix epoch. For example, `2022-06-27T10:32:54.506Z` instead of `1656325974506`. This improves traces' readability. * FEATURE: improve performance of [/api/v1/series](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) requests, which return big number of time series. * FEATURE: expose additional histogram metrics at `http://victoriametrics:8428/metrics`, which may help understanding query workload: @@ -41,7 +41,14 @@ scrape_configs: * `vm_rows_read_per_series` - the number of raw samples read per queried series. * `vm_series_read_per_query` - the number of series read per query. -* BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): allow using `__name__` label (aka [metric name](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors)) in alerting annotations. For example `{{ $labels.__name__ }}: Too high connection number for "{{ $labels.instance }}`. +* BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): allow using `__name__` label (aka [metric name](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors)) in alerting annotations. For example: + +{% raw %} +```console +{{ $labels.__name__ }}: Too high connection number for "{{ $labels.instance }} +``` +{% endraw %} + * BUGFIX: limit max memory occupied by the cache, which stores parsed regular expressions. Previously too long regular expressions passed in [MetricsQL queries](https://docs.victoriametrics.com/MetricsQL.html) could result in big amounts of used memory (e.g. multiple of gigabytes). Now the max cache size for parsed regexps is limited to a a few megabytes. * BUGFIX: [vmagent](https://docs.victoriametrics.com/vmagent.html): make sure that [stale markers](https://docs.victoriametrics.com/vmagent.html#prometheus-staleness-markers) are generated with the actual timestamp when unsuccessful scrape occurs. This should prevent from possible time series overlap on scrape target restart in dynmaic envirnoments such as Kubernetes. * BUGFIX: [VictoriaMetrics cluster](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html): assume that the response is complete if `-search.denyPartialResponse` is enabled and up to `-replicationFactor - 1` `vmstorage` nodes are unavailable. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1767). diff --git a/docs/Cluster-VictoriaMetrics.md b/docs/Cluster-VictoriaMetrics.md index f50323ae2..d0f5f157d 100644 --- a/docs/Cluster-VictoriaMetrics.md +++ b/docs/Cluster-VictoriaMetrics.md @@ -191,6 +191,10 @@ or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/gr It is recommended setting up alerts in [vmalert](https://docs.victoriametrics.com/vmalert.html) or in Prometheus from [this config](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/deployment/docker/alerts.yml). +## Troubleshooting + +See [trobuleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## Readonly mode `vmstorage` nodes automatically switch to readonly mode when the directory pointed by `-storageDataPath` contains less than `-storage.minFreeDiskSpaceBytes` of free space. `vminsert` nodes stop sending data to such nodes and start re-routing the data to the remaining `vmstorage` nodes. diff --git a/docs/README.md b/docs/README.md index 141f20b36..0c625eda5 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1439,24 +1439,12 @@ Graphs on the dashboards contain useful hints - hover the `i` icon in the top le We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/alerts.yml) via [vmalert](https://docs.victoriametrics.com/vmalert.html) or via Prometheus. -The most interesting health metrics are the following: - -* `vm_cache_entries{type="storage/hour_metric_ids"}` - the number of time series with new data points during the last hour - aka [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). -* `increase(vm_new_timeseries_created_total[1h])` - time series [churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate) during the previous hour. -* `sum(vm_rows{type=~"storage/.*"})` - total number of `(timestamp, value)` data points in the database. -* `sum(rate(vm_rows_inserted_total[5m]))` - ingestion rate, i.e. how many samples are inserted in the database per second. -* `vm_free_disk_space_bytes` - free space left at `-storageDataPath`. -* `sum(vm_data_size_bytes)` - the total size of data on disk. -* `increase(vm_slow_row_inserts_total[5m])` - the number of slow inserts during the last 5 minutes. - If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling - of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). -* `increase(vm_slow_metric_name_loads_total[5m])` - the number of slow loads of metric names during the last 5 minutes. - If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling - of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). - VictoriaMetrics exposes currently running queries and their execution times at `/api/v1/status/active_queries` page. +VictoriaMetrics exposes queries, which take the most time to execute, at `/api/v1/status/top_queries` page. + +See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## TSDB stats VictoriaMetrics returns TSDB stats at `/api/v1/status/tsdb` page in the way similar to Prometheus - see [these Prometheus docs](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats). VictoriaMetrics accepts the following optional query args at `/api/v1/status/tsdb` page: @@ -1621,6 +1609,8 @@ See also more advanced [cardinality limiter in vmagent](https://docs.victoriamet * VictoriaMetrics ignores `NaN` values during data ingestion. +See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## Cache removal VictoriaMetrics uses various internal caches. These caches are stored to `<-storageDataPath>/cache` directory during graceful shutdown (e.g. when VictoriaMetrics is stopped by sending `SIGINT` signal). The caches are read on the next VictoriaMetrics startup. Sometimes it is needed to remove such caches on the next startup. This can be performed by placing `reset_cache_on_startup` file inside the `<-storageDataPath>/cache` directory before the restart of VictoriaMetrics. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1447) for details. diff --git a/docs/Single-server-VictoriaMetrics.md b/docs/Single-server-VictoriaMetrics.md index 637a68805..5f382a633 100644 --- a/docs/Single-server-VictoriaMetrics.md +++ b/docs/Single-server-VictoriaMetrics.md @@ -1443,24 +1443,12 @@ Graphs on the dashboards contain useful hints - hover the `i` icon in the top le We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/alerts.yml) via [vmalert](https://docs.victoriametrics.com/vmalert.html) or via Prometheus. -The most interesting health metrics are the following: - -* `vm_cache_entries{type="storage/hour_metric_ids"}` - the number of time series with new data points during the last hour - aka [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). -* `increase(vm_new_timeseries_created_total[1h])` - time series [churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate) during the previous hour. -* `sum(vm_rows{type=~"storage/.*"})` - total number of `(timestamp, value)` data points in the database. -* `sum(rate(vm_rows_inserted_total[5m]))` - ingestion rate, i.e. how many samples are inserted in the database per second. -* `vm_free_disk_space_bytes` - free space left at `-storageDataPath`. -* `sum(vm_data_size_bytes)` - the total size of data on disk. -* `increase(vm_slow_row_inserts_total[5m])` - the number of slow inserts during the last 5 minutes. - If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling - of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). -* `increase(vm_slow_metric_name_loads_total[5m])` - the number of slow loads of metric names during the last 5 minutes. - If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling - of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). - VictoriaMetrics exposes currently running queries and their execution times at `/api/v1/status/active_queries` page. +VictoriaMetrics exposes queries, which take the most time to execute, at `/api/v1/status/top_queries` page. + +See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## TSDB stats VictoriaMetrics returns TSDB stats at `/api/v1/status/tsdb` page in the way similar to Prometheus - see [these Prometheus docs](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats). VictoriaMetrics accepts the following optional query args at `/api/v1/status/tsdb` page: @@ -1625,6 +1613,8 @@ See also more advanced [cardinality limiter in vmagent](https://docs.victoriamet * VictoriaMetrics ignores `NaN` values during data ingestion. +See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## Cache removal VictoriaMetrics uses various internal caches. These caches are stored to `<-storageDataPath>/cache` directory during graceful shutdown (e.g. when VictoriaMetrics is stopped by sending `SIGINT` signal). The caches are read on the next VictoriaMetrics startup. Sometimes it is needed to remove such caches on the next startup. This can be performed by placing `reset_cache_on_startup` file inside the `<-storageDataPath>/cache` directory before the restart of VictoriaMetrics. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1447) for details. diff --git a/docs/Troubleshooting.md b/docs/Troubleshooting.md new file mode 100644 index 000000000..46e15a0fe --- /dev/null +++ b/docs/Troubleshooting.md @@ -0,0 +1,226 @@ +--- +sort: 23 +--- + +# Troubleshooting + +This document contains troubleshooting guides for most common issues when working with VictoriaMetrics: + +- [Unexpected query results](#unexpected-query-results) +- [Slow data ingestion](#slow-data-ingestion) +- [Slow queries](#slow-queries) +- [Out of memory errors](#out-of-memory-errors) + + +## Unexpected query results + +If you see unexpected or unreliable query results from VictoriaMetrics, then try the following steps: + +1. Check whether simplified queries return unexpected results. For example, if the query looks like + `sum(rate(http_requests_total[5m])) by (job)`, then check whether the following queries return + expected results: + + - Remove the outer `sum`: `rate(http_requests_total[5m])`. If this query returns too many time series, + then try adding more specific label filters to it. For example, if you see that the original query + returns unexpected results for the `job="foo"`, then use `rate(http_requests_total{job="foo"}[5m])` query. + If this isn't enough, then continue adding more specific label filters, so the resulting query returns + manageable number of time series. + + - Remove the outer `rate`: `http_requests_total`. Additional label filters may be added here in order + to reduce the number of returned series. + +2. If the simplest query continues returning unexpected / unreliable results, then export raw samples + for this query via [/api/v1/export](https://docs.victoriametrics.com/#how-to-export-data-in-json-line-format) + on the given '[start..end]' time range and check whether they are expected: + + ```console + curl http://victoriametrics:8428/api/v1/export -d 'match[]=http_requests_total' -d 'start=...' -d 'end=...' + ``` + + Note that responses returned from [/api/v1/query](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries) + and from [/api/v1/query_range](https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries) contain **evaluated** data + instead of raw samples stored in VictoriaMetrics. See [these docs](https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness) + for details. + +3. Sometimes response caching may lead to unexpected results when samples with older timestamps + are ingested into VictoriaMetrics (aka [backfilling](https://docs.victoriametrics.com/#backfilling)). + Try disabling response cache and see whether this helps. This can be done in the following ways: + + - By passing `-search.disableCache` command-line flag to a single-node VictoriaMetrics + or to all the `vmselect` components if cluster version of VictoriaMetrics is used. + + - By passing `nocache=1` query arg to every request to `/api/v1/query` and `/api/v1/query_range`. + If you use Grafana, then this query arg can be specified in `Custom Query Parameters` field + at Prometheus datasource settings - see [these docs](https://grafana.com/docs/grafana/latest/datasources/prometheus/) for details. + +4. If you use cluster version of VictoriaMetrics, then it may return partial responses by default + when some of `vmstorage` nodes are temporarily unavailable - see [cluster availability docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-availability) + for details. If you want prioritizing query consistency over cluster availability, + then you can pass `-search.denyPartialResponse` command-line flag to all the `vmselect` nodes. + In this case VictoriaMetrics returns an error during querying if at least a single `vmstorage` node is unavailable. + Another option is to pass `deny_partial_response=1` query arg to `/api/v1/query` and `/api/v1/query_range`. + If you use Grafana, then this query arg can be specified in `Custom Query Parameters` field + at Prometheus datasource settings - see [these docs](https://grafana.com/docs/grafana/latest/datasources/prometheus/) for details. + +5. If you pass `-replicationFactor` command-line flag to `vmselect`, then it is recommended removing this flag from `vmselect`, + since it may lead to incomplete responses when `vmstorage` nodes contain less than `-replicationFactor` + copies of the requested data. + +6. Try upgrading to the [latest available version of VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics/releases) + and verifying whether the issue is fixed there. + +7. Try executing the query with `trace=1` query arg. This enables query tracing, which may contain + useful information on why the query returns unexpected data. See [query tracing docs](https://docs.victoriametrics.com/#query-tracing) for details. + +8. Inspect command-line flags passed to VictoriaMetrics components. If you don't understand clearly the purpose + or the effect of some flags, then remove them from the list of flags passed to VictoriaMetrics components, + because some command-line flags may change query results in unexpected ways when set to improper values. + VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly). + +9. If the steps above didn't help identifying the root cause of unexpected query results, + then [file a bugreport](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/new) with details on how to reproduce the issue. + + +## Slow data ingestion + +There are the following most commons reasons for slow data ingestion in VictoriaMetrics: + +1. Memory shortage for the given amounts of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series). + + VictoriaMetrics (or `vmstorage` in cluster version of VictoriaMetrics) maintains an in-memory cache + for quick search for internal series ids per each incoming metric. + This cache is named `storage/tsid`. VictoriaMetrics automatically determines the maximum size for this cache + depending on the available memory on the host where VictoriaMetrics (or `vmstorage`) runs. If the cache size isn't enough + for holding all the entries for active time series, then VictoriaMetrics locates the needed data on disk, + unpacks it, re-constructs the missing entry and puts it into the cache. This takes additional CPU time and disk read IO. + + The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring) + contain `Slow inserts` graph, which shows the cache miss percentage for `storage/tsid` cache + during data ingestion. If `slow inserts` graph shows values greater than 5% for more than 10 minutes, + then it is likely the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series) + cannot fit the `storage/tsid` cache. + + There are the following solutions exist for this issue: + + - To increase the available memory on the host where VictoriaMetrics runs until `slow inserts` percentage + will become lower than 5%. If you run VictoriaMetrics cluster, then you need increasing total available + memory at `vmstorage` nodes. This can be done in two ways: either increasing the available memory + per each existing `vmstorage` node or to add more `vmstorage` nodes to the cluster. + + - To reduce the number of active time series. The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring) + contain a graph showing the number of active time series. Recent versions of VictoriaMetrics + provide [cardinality explorer](https://docs.victoriametrics.com/#cardinality-explorer), + which can help determining and fixing the source of [high cardinality](https://docs.victoriametrics.com/FAQ.html#what-is-high-cardinality). + +2. [High churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate), + e.g. when old time series are substituted with new time series at a high rate. + When VitoriaMetrics encounters a sample for new time series, it needs to register the time series + in the internal index (aka `indexdb`), so it can be quickly located on subsequent select queries. + The process of registering new time series in the internal index is an order of magnitude slower + than the process of adding new sample to already registered time series. + So VictoriaMetrics may work slower than expected under [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate). + + The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring) + provides `Churn rate` graph, which shows the average number of new time series registered + during the last 24 hours. If this number exceeds the number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series), + then you need to identify and fix the source of [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate). + The most commons source of high churn rate is a label, which frequently change its value. Try avoiding such labels. + The [cardinality explorer](https://docs.victoriametrics.com/#cardinality-explorer) can help identifying + such labels. + +3. Resource shortage. The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring) + contain `resource usage` graphs, which show memory usage, CPU usage, disk IO usage and free disk size. + Make sure VictoriaMetrics has enough free resources for graceful handling of potential spikes in workload + according to the following recommendations: + + - 50% of free CPU + - 30% of free memory + - 20% of free disk space + + If VictoriaMetrics components have lower amounts of free resources, then this may lead + to **significant** performance degradation during data ingestion. + For example: + + - If the percentage of free CPU is close to 0, then VictoriaMetrics + may experience arbitrary long delays during data ingestion when it cannot keep up + with the data ingestion rate. + + - If the percentage of free memory reaches 0, then the Operating System where VictoriaMetrics components run + may have no enough memory for [page cache](https://en.wikipedia.org/wiki/Page_cache). + VictoriaMetrics relies on page cache for quick queries over recently ingested data. + If the operating system has no enough free memory for page cache, then it needs + to re-read the requested data from disk. This may **significantly** increase disk read IO. + + - If free disk space is lower than 20%, then VictoriaMetrics is unable to perform optimal + background merge of the incoming data. This leads to increased number of data files on disk, + which, in turn, slows down both data ingestion and querying. See [these docs](https://docs.victoriametrics.com/#storage) for details. + +4. If you run cluster version of VictoriaMetrics, then make sure `vminsert` and `vmstorage` components + are located in the same network with short network latency between them. + `vminsert` packs incoming data into in-memory packets and sends them to `vmstorage` on-by-one. + It waits until `vmstorage` returns back `ack` response before sending the next packet. + If the network latency between `vminsert` and `vmstorage` is big (for example, if they run in different datacenters), + then this may become limiting factor for data ingestion speed. + + The [official Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#monitoring) + contain `connection saturation` graph for `vminsert` components. If these graphs reach 100%, + then it is likely you have issues with network latency between `vminsert` and `vmstorage`. + Another possible issue for 100% connection saturation between `vminsert` and `vmstorage` + is resource shortage at `vmstorage` nodes. In this case you need to increase amounts + of available resources (CPU, RAM, disk IO) at `vmstorage` nodes or to add more `vmstorage` nodes to the cluster. + +5. Noisy neighboor. Make sure VictoriaMetrics components run in envirnoments without other resource-hungry apps. + Such apps may steal RAM, CPU, disk IO and network bandwidth, which is needed for VictoriaMetrics components. + +## Slow queries + +Some queries may take more time and resources (CPU, RAM, network bandwidth) than others. +VictoriaMetrics logs slow queries if their execution time exceeds the duration passed +to `-search.logSlowQueryDuration` command-line flag. +VictoriaMetrics also provides `/api/v1/status/top_queries` endpoint, which returns +queries took the most time to execute. +See [these docs](https://docs.victoriametrics.com/#prometheus-querying-api-enhancements) for details. + +There are the following solutions exist for slow queries: + +- Adding more CPU and memory to VictoriaMetrics, so it may perform the slow query faster. + If you use cluster version of VictoriaMetrics, then migration of `vmselect` nodes to machines + with more CPU and RAM should help improving speed for slow queries. + Sometimes adding more `vmstorage` nodes also can help improving the speed for slow queries. + +- Rewriting slow queries, so they become faster. Unfortunately it is hard determining + whether the given query will be slow by just looking at it. + VictoriaMetrics provides [query tracing](https://docs.victoriametrics.com/#query-tracing) functionality, + which can help determine the source of slow query. + See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986), + which explains how to determine and optimize slow queries. + + +## Out of memory errors + +There are the following most common sources of out of memory (aka OOM) crashes in VictoriaMetrics: + +1. Improper command-line flag values. Inspect command-line flags passed to VictoriaMetrics components. + If you don't understand clearly the purpose or the effect of some flags, then remove them + from the list of flags passed to VictoriaMetrics components, because some command-line flags + may lead to increased memory usage and increased CPU usage. The increased memory usage increases chances for OOM crashes. + VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly). + + For example, it isn't recommended tuning cache sizes in VictoriaMetrics, since it frequently leads to OOM. + [These docs](https://docs.victoriametrics.com/#cache-tuning) refer command-line flags, which aren't + recommended to tune. If you see that VictoriaMetrics needs increasing some cache sizes for the current workload, + then it is better migrating to a host with more memory instead of trying to tune cache sizes. + +2. Unexpected heavy queries. The query is considered heavy if it needs to select and process millions of unique time series. + Such query may lead to OOM, since VictoriaMetrics needs to keep some per-series data in memory. + VictoriaMetrics provides various settings, which can help limiting resource usage in this case - + see [these docs](https://docs.victoriametrics.com/#resource-usage-limits). + See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986), + which explains how to detect and optimize heavy queries. + VictoriaMetrics also provides [query tracer](https://docs.victoriametrics.com/#query-tracing), + which may help identifying the source of heavy query. + +3. Lack of free memory for processing workload spikes. If VictoriaMetrics components use almost all the available memory + under the current workload, then it is recommended migrating to a host with bigger amounts of memory + in order to protect from possible OOM crashes on workload spikes. It is recommended to have at least 30% + of free memory for graceful handling of possible workload spikes. diff --git a/docs/guides/README.md b/docs/guides/README.md index 8dbbaad7b..bf23036e3 100644 --- a/docs/guides/README.md +++ b/docs/guides/README.md @@ -1,5 +1,5 @@ --- -sort: 22 +sort: 24 --- # Guides diff --git a/docs/managed_victoriametrics/README.md b/docs/managed_victoriametrics/README.md index 7e51d1712..d4e00ac00 100644 --- a/docs/managed_victoriametrics/README.md +++ b/docs/managed_victoriametrics/README.md @@ -1,5 +1,5 @@ --- -sort: 22 +sort: 26 --- # Managed VictoriaMetrics diff --git a/docs/operator/README.md b/docs/operator/README.md index 0dd677e5c..3fe24691f 100644 --- a/docs/operator/README.md +++ b/docs/operator/README.md @@ -1,5 +1,5 @@ --- -sort: 23 +sort: 25 --- # VictoriaMetrics Operator diff --git a/docs/vmagent.md b/docs/vmagent.md index e3e97534c..d6f27d254 100644 --- a/docs/vmagent.md +++ b/docs/vmagent.md @@ -586,6 +586,8 @@ It may be useful to perform `vmagent` rolling update without any scrape loss. regex: true ``` +See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html). + ## Kafka integration [Enterprise version](https://victoriametrics.com/products/enterprise/) of `vmagent` can read and write metrics from / to Kafka: