docs: add Troubleshooting doc

This doc contains troubleshooting guides for typical problems with VictoriaMetrics.
This commit is contained in:
Aliaksandr Valialkin 2022-06-30 13:35:20 +03:00
parent 00956e585d
commit 92e6c55b75
No known key found for this signature in database
GPG Key ID: A72BEC6CD3D0DED1
11 changed files with 262 additions and 37 deletions

View File

@ -187,6 +187,10 @@ or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/gr
It is recommended setting up alerts in [vmalert](https://docs.victoriametrics.com/vmalert.html) or in Prometheus from [this config](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/deployment/docker/alerts.yml). It is recommended setting up alerts in [vmalert](https://docs.victoriametrics.com/vmalert.html) or in Prometheus from [this config](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/deployment/docker/alerts.yml).
## Troubleshooting
See [trobuleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html).
## Readonly mode ## Readonly mode
`vmstorage` nodes automatically switch to readonly mode when the directory pointed by `-storageDataPath` contains less than `-storage.minFreeDiskSpaceBytes` of free space. `vminsert` nodes stop sending data to such nodes and start re-routing the data to the remaining `vmstorage` nodes. `vmstorage` nodes automatically switch to readonly mode when the directory pointed by `-storageDataPath` contains less than `-storage.minFreeDiskSpaceBytes` of free space. `vminsert` nodes stop sending data to such nodes and start re-routing the data to the remaining `vmstorage` nodes.

View File

@ -582,6 +582,8 @@ It may be useful to perform `vmagent` rolling update without any scrape loss.
regex: true regex: true
``` ```
See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html).
## Kafka integration ## Kafka integration
[Enterprise version](https://victoriametrics.com/products/enterprise/) of `vmagent` can read and write metrics from / to Kafka: [Enterprise version](https://victoriametrics.com/products/enterprise/) of `vmagent` can read and write metrics from / to Kafka:

View File

@ -32,7 +32,7 @@ scrape_configs:
- targets: ["host123:8080"] - targets: ["host123:8080"]
``` ```
* FEATURE: [query tracing](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#query-tracing): show timestamps in query traces in human-readable format (aka `RFC3339` in UTC timezone) instead of milliseconds since Unix epoch. For example, `2022-06-27T10:32:54.506Z` instead of `1656325974506`. * FEATURE: [query tracing](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#query-tracing): show timestamps in query traces in human-readable format (aka `RFC3339` in UTC timezone) instead of milliseconds since Unix epoch. For example, `2022-06-27T10:32:54.506Z` instead of `1656325974506`. This improves traces' readability.
* FEATURE: improve performance of [/api/v1/series](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) requests, which return big number of time series. * FEATURE: improve performance of [/api/v1/series](https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers) requests, which return big number of time series.
* FEATURE: expose additional histogram metrics at `http://victoriametrics:8428/metrics`, which may help understanding query workload: * FEATURE: expose additional histogram metrics at `http://victoriametrics:8428/metrics`, which may help understanding query workload:
@ -41,7 +41,14 @@ scrape_configs:
* `vm_rows_read_per_series` - the number of raw samples read per queried series. * `vm_rows_read_per_series` - the number of raw samples read per queried series.
* `vm_series_read_per_query` - the number of series read per query. * `vm_series_read_per_query` - the number of series read per query.
* BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): allow using `__name__` label (aka [metric name](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors)) in alerting annotations. For example `{{ $labels.__name__ }}: Too high connection number for "{{ $labels.instance }}`. * BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): allow using `__name__` label (aka [metric name](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors)) in alerting annotations. For example:
{% raw %}
```console
{{ $labels.__name__ }}: Too high connection number for "{{ $labels.instance }}
```
{% endraw %}
* BUGFIX: limit max memory occupied by the cache, which stores parsed regular expressions. Previously too long regular expressions passed in [MetricsQL queries](https://docs.victoriametrics.com/MetricsQL.html) could result in big amounts of used memory (e.g. multiple of gigabytes). Now the max cache size for parsed regexps is limited to a a few megabytes. * BUGFIX: limit max memory occupied by the cache, which stores parsed regular expressions. Previously too long regular expressions passed in [MetricsQL queries](https://docs.victoriametrics.com/MetricsQL.html) could result in big amounts of used memory (e.g. multiple of gigabytes). Now the max cache size for parsed regexps is limited to a a few megabytes.
* BUGFIX: [vmagent](https://docs.victoriametrics.com/vmagent.html): make sure that [stale markers](https://docs.victoriametrics.com/vmagent.html#prometheus-staleness-markers) are generated with the actual timestamp when unsuccessful scrape occurs. This should prevent from possible time series overlap on scrape target restart in dynmaic envirnoments such as Kubernetes. * BUGFIX: [vmagent](https://docs.victoriametrics.com/vmagent.html): make sure that [stale markers](https://docs.victoriametrics.com/vmagent.html#prometheus-staleness-markers) are generated with the actual timestamp when unsuccessful scrape occurs. This should prevent from possible time series overlap on scrape target restart in dynmaic envirnoments such as Kubernetes.
* BUGFIX: [VictoriaMetrics cluster](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html): assume that the response is complete if `-search.denyPartialResponse` is enabled and up to `-replicationFactor - 1` `vmstorage` nodes are unavailable. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1767). * BUGFIX: [VictoriaMetrics cluster](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html): assume that the response is complete if `-search.denyPartialResponse` is enabled and up to `-replicationFactor - 1` `vmstorage` nodes are unavailable. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1767).

View File

@ -191,6 +191,10 @@ or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/gr
It is recommended setting up alerts in [vmalert](https://docs.victoriametrics.com/vmalert.html) or in Prometheus from [this config](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/deployment/docker/alerts.yml). It is recommended setting up alerts in [vmalert](https://docs.victoriametrics.com/vmalert.html) or in Prometheus from [this config](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/deployment/docker/alerts.yml).
## Troubleshooting
See [trobuleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html).
## Readonly mode ## Readonly mode
`vmstorage` nodes automatically switch to readonly mode when the directory pointed by `-storageDataPath` contains less than `-storage.minFreeDiskSpaceBytes` of free space. `vminsert` nodes stop sending data to such nodes and start re-routing the data to the remaining `vmstorage` nodes. `vmstorage` nodes automatically switch to readonly mode when the directory pointed by `-storageDataPath` contains less than `-storage.minFreeDiskSpaceBytes` of free space. `vminsert` nodes stop sending data to such nodes and start re-routing the data to the remaining `vmstorage` nodes.

View File

@ -1439,24 +1439,12 @@ Graphs on the dashboards contain useful hints - hover the `i` icon in the top le
We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/alerts.yml) We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/alerts.yml)
via [vmalert](https://docs.victoriametrics.com/vmalert.html) or via Prometheus. via [vmalert](https://docs.victoriametrics.com/vmalert.html) or via Prometheus.
The most interesting health metrics are the following:
* `vm_cache_entries{type="storage/hour_metric_ids"}` - the number of time series with new data points during the last hour
aka [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series).
* `increase(vm_new_timeseries_created_total[1h])` - time series [churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate) during the previous hour.
* `sum(vm_rows{type=~"storage/.*"})` - total number of `(timestamp, value)` data points in the database.
* `sum(rate(vm_rows_inserted_total[5m]))` - ingestion rate, i.e. how many samples are inserted in the database per second.
* `vm_free_disk_space_bytes` - free space left at `-storageDataPath`.
* `sum(vm_data_size_bytes)` - the total size of data on disk.
* `increase(vm_slow_row_inserts_total[5m])` - the number of slow inserts during the last 5 minutes.
If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling
of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series).
* `increase(vm_slow_metric_name_loads_total[5m])` - the number of slow loads of metric names during the last 5 minutes.
If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling
of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series).
VictoriaMetrics exposes currently running queries and their execution times at `/api/v1/status/active_queries` page. VictoriaMetrics exposes currently running queries and their execution times at `/api/v1/status/active_queries` page.
VictoriaMetrics exposes queries, which take the most time to execute, at `/api/v1/status/top_queries` page.
See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html).
## TSDB stats ## TSDB stats
VictoriaMetrics returns TSDB stats at `/api/v1/status/tsdb` page in the way similar to Prometheus - see [these Prometheus docs](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats). VictoriaMetrics accepts the following optional query args at `/api/v1/status/tsdb` page: VictoriaMetrics returns TSDB stats at `/api/v1/status/tsdb` page in the way similar to Prometheus - see [these Prometheus docs](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats). VictoriaMetrics accepts the following optional query args at `/api/v1/status/tsdb` page:
@ -1621,6 +1609,8 @@ See also more advanced [cardinality limiter in vmagent](https://docs.victoriamet
* VictoriaMetrics ignores `NaN` values during data ingestion. * VictoriaMetrics ignores `NaN` values during data ingestion.
See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html).
## Cache removal ## Cache removal
VictoriaMetrics uses various internal caches. These caches are stored to `<-storageDataPath>/cache` directory during graceful shutdown (e.g. when VictoriaMetrics is stopped by sending `SIGINT` signal). The caches are read on the next VictoriaMetrics startup. Sometimes it is needed to remove such caches on the next startup. This can be performed by placing `reset_cache_on_startup` file inside the `<-storageDataPath>/cache` directory before the restart of VictoriaMetrics. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1447) for details. VictoriaMetrics uses various internal caches. These caches are stored to `<-storageDataPath>/cache` directory during graceful shutdown (e.g. when VictoriaMetrics is stopped by sending `SIGINT` signal). The caches are read on the next VictoriaMetrics startup. Sometimes it is needed to remove such caches on the next startup. This can be performed by placing `reset_cache_on_startup` file inside the `<-storageDataPath>/cache` directory before the restart of VictoriaMetrics. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1447) for details.

View File

@ -1443,24 +1443,12 @@ Graphs on the dashboards contain useful hints - hover the `i` icon in the top le
We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/alerts.yml) We recommend setting up [alerts](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/alerts.yml)
via [vmalert](https://docs.victoriametrics.com/vmalert.html) or via Prometheus. via [vmalert](https://docs.victoriametrics.com/vmalert.html) or via Prometheus.
The most interesting health metrics are the following:
* `vm_cache_entries{type="storage/hour_metric_ids"}` - the number of time series with new data points during the last hour
aka [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series).
* `increase(vm_new_timeseries_created_total[1h])` - time series [churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate) during the previous hour.
* `sum(vm_rows{type=~"storage/.*"})` - total number of `(timestamp, value)` data points in the database.
* `sum(rate(vm_rows_inserted_total[5m]))` - ingestion rate, i.e. how many samples are inserted in the database per second.
* `vm_free_disk_space_bytes` - free space left at `-storageDataPath`.
* `sum(vm_data_size_bytes)` - the total size of data on disk.
* `increase(vm_slow_row_inserts_total[5m])` - the number of slow inserts during the last 5 minutes.
If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling
of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series).
* `increase(vm_slow_metric_name_loads_total[5m])` - the number of slow loads of metric names during the last 5 minutes.
If this number remains high during extended periods of time, then it is likely more RAM is needed for optimal handling
of the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series).
VictoriaMetrics exposes currently running queries and their execution times at `/api/v1/status/active_queries` page. VictoriaMetrics exposes currently running queries and their execution times at `/api/v1/status/active_queries` page.
VictoriaMetrics exposes queries, which take the most time to execute, at `/api/v1/status/top_queries` page.
See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html).
## TSDB stats ## TSDB stats
VictoriaMetrics returns TSDB stats at `/api/v1/status/tsdb` page in the way similar to Prometheus - see [these Prometheus docs](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats). VictoriaMetrics accepts the following optional query args at `/api/v1/status/tsdb` page: VictoriaMetrics returns TSDB stats at `/api/v1/status/tsdb` page in the way similar to Prometheus - see [these Prometheus docs](https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-stats). VictoriaMetrics accepts the following optional query args at `/api/v1/status/tsdb` page:
@ -1625,6 +1613,8 @@ See also more advanced [cardinality limiter in vmagent](https://docs.victoriamet
* VictoriaMetrics ignores `NaN` values during data ingestion. * VictoriaMetrics ignores `NaN` values during data ingestion.
See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html).
## Cache removal ## Cache removal
VictoriaMetrics uses various internal caches. These caches are stored to `<-storageDataPath>/cache` directory during graceful shutdown (e.g. when VictoriaMetrics is stopped by sending `SIGINT` signal). The caches are read on the next VictoriaMetrics startup. Sometimes it is needed to remove such caches on the next startup. This can be performed by placing `reset_cache_on_startup` file inside the `<-storageDataPath>/cache` directory before the restart of VictoriaMetrics. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1447) for details. VictoriaMetrics uses various internal caches. These caches are stored to `<-storageDataPath>/cache` directory during graceful shutdown (e.g. when VictoriaMetrics is stopped by sending `SIGINT` signal). The caches are read on the next VictoriaMetrics startup. Sometimes it is needed to remove such caches on the next startup. This can be performed by placing `reset_cache_on_startup` file inside the `<-storageDataPath>/cache` directory before the restart of VictoriaMetrics. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1447) for details.

226
docs/Troubleshooting.md Normal file
View File

@ -0,0 +1,226 @@
---
sort: 23
---
# Troubleshooting
This document contains troubleshooting guides for most common issues when working with VictoriaMetrics:
- [Unexpected query results](#unexpected-query-results)
- [Slow data ingestion](#slow-data-ingestion)
- [Slow queries](#slow-queries)
- [Out of memory errors](#out-of-memory-errors)
## Unexpected query results
If you see unexpected or unreliable query results from VictoriaMetrics, then try the following steps:
1. Check whether simplified queries return unexpected results. For example, if the query looks like
`sum(rate(http_requests_total[5m])) by (job)`, then check whether the following queries return
expected results:
- Remove the outer `sum`: `rate(http_requests_total[5m])`. If this query returns too many time series,
then try adding more specific label filters to it. For example, if you see that the original query
returns unexpected results for the `job="foo"`, then use `rate(http_requests_total{job="foo"}[5m])` query.
If this isn't enough, then continue adding more specific label filters, so the resulting query returns
manageable number of time series.
- Remove the outer `rate`: `http_requests_total`. Additional label filters may be added here in order
to reduce the number of returned series.
2. If the simplest query continues returning unexpected / unreliable results, then export raw samples
for this query via [/api/v1/export](https://docs.victoriametrics.com/#how-to-export-data-in-json-line-format)
on the given '[start..end]' time range and check whether they are expected:
```console
curl http://victoriametrics:8428/api/v1/export -d 'match[]=http_requests_total' -d 'start=...' -d 'end=...'
```
Note that responses returned from [/api/v1/query](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries)
and from [/api/v1/query_range](https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries) contain **evaluated** data
instead of raw samples stored in VictoriaMetrics. See [these docs](https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness)
for details.
3. Sometimes response caching may lead to unexpected results when samples with older timestamps
are ingested into VictoriaMetrics (aka [backfilling](https://docs.victoriametrics.com/#backfilling)).
Try disabling response cache and see whether this helps. This can be done in the following ways:
- By passing `-search.disableCache` command-line flag to a single-node VictoriaMetrics
or to all the `vmselect` components if cluster version of VictoriaMetrics is used.
- By passing `nocache=1` query arg to every request to `/api/v1/query` and `/api/v1/query_range`.
If you use Grafana, then this query arg can be specified in `Custom Query Parameters` field
at Prometheus datasource settings - see [these docs](https://grafana.com/docs/grafana/latest/datasources/prometheus/) for details.
4. If you use cluster version of VictoriaMetrics, then it may return partial responses by default
when some of `vmstorage` nodes are temporarily unavailable - see [cluster availability docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-availability)
for details. If you want prioritizing query consistency over cluster availability,
then you can pass `-search.denyPartialResponse` command-line flag to all the `vmselect` nodes.
In this case VictoriaMetrics returns an error during querying if at least a single `vmstorage` node is unavailable.
Another option is to pass `deny_partial_response=1` query arg to `/api/v1/query` and `/api/v1/query_range`.
If you use Grafana, then this query arg can be specified in `Custom Query Parameters` field
at Prometheus datasource settings - see [these docs](https://grafana.com/docs/grafana/latest/datasources/prometheus/) for details.
5. If you pass `-replicationFactor` command-line flag to `vmselect`, then it is recommended removing this flag from `vmselect`,
since it may lead to incomplete responses when `vmstorage` nodes contain less than `-replicationFactor`
copies of the requested data.
6. Try upgrading to the [latest available version of VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics/releases)
and verifying whether the issue is fixed there.
7. Try executing the query with `trace=1` query arg. This enables query tracing, which may contain
useful information on why the query returns unexpected data. See [query tracing docs](https://docs.victoriametrics.com/#query-tracing) for details.
8. Inspect command-line flags passed to VictoriaMetrics components. If you don't understand clearly the purpose
or the effect of some flags, then remove them from the list of flags passed to VictoriaMetrics components,
because some command-line flags may change query results in unexpected ways when set to improper values.
VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly).
9. If the steps above didn't help identifying the root cause of unexpected query results,
then [file a bugreport](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/new) with details on how to reproduce the issue.
## Slow data ingestion
There are the following most commons reasons for slow data ingestion in VictoriaMetrics:
1. Memory shortage for the given amounts of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series).
VictoriaMetrics (or `vmstorage` in cluster version of VictoriaMetrics) maintains an in-memory cache
for quick search for internal series ids per each incoming metric.
This cache is named `storage/tsid`. VictoriaMetrics automatically determines the maximum size for this cache
depending on the available memory on the host where VictoriaMetrics (or `vmstorage`) runs. If the cache size isn't enough
for holding all the entries for active time series, then VictoriaMetrics locates the needed data on disk,
unpacks it, re-constructs the missing entry and puts it into the cache. This takes additional CPU time and disk read IO.
The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring)
contain `Slow inserts` graph, which shows the cache miss percentage for `storage/tsid` cache
during data ingestion. If `slow inserts` graph shows values greater than 5% for more than 10 minutes,
then it is likely the current number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series)
cannot fit the `storage/tsid` cache.
There are the following solutions exist for this issue:
- To increase the available memory on the host where VictoriaMetrics runs until `slow inserts` percentage
will become lower than 5%. If you run VictoriaMetrics cluster, then you need increasing total available
memory at `vmstorage` nodes. This can be done in two ways: either increasing the available memory
per each existing `vmstorage` node or to add more `vmstorage` nodes to the cluster.
- To reduce the number of active time series. The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring)
contain a graph showing the number of active time series. Recent versions of VictoriaMetrics
provide [cardinality explorer](https://docs.victoriametrics.com/#cardinality-explorer),
which can help determining and fixing the source of [high cardinality](https://docs.victoriametrics.com/FAQ.html#what-is-high-cardinality).
2. [High churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate),
e.g. when old time series are substituted with new time series at a high rate.
When VitoriaMetrics encounters a sample for new time series, it needs to register the time series
in the internal index (aka `indexdb`), so it can be quickly located on subsequent select queries.
The process of registering new time series in the internal index is an order of magnitude slower
than the process of adding new sample to already registered time series.
So VictoriaMetrics may work slower than expected under [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate).
The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring)
provides `Churn rate` graph, which shows the average number of new time series registered
during the last 24 hours. If this number exceeds the number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series),
then you need to identify and fix the source of [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate).
The most commons source of high churn rate is a label, which frequently change its value. Try avoiding such labels.
The [cardinality explorer](https://docs.victoriametrics.com/#cardinality-explorer) can help identifying
such labels.
3. Resource shortage. The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring)
contain `resource usage` graphs, which show memory usage, CPU usage, disk IO usage and free disk size.
Make sure VictoriaMetrics has enough free resources for graceful handling of potential spikes in workload
according to the following recommendations:
- 50% of free CPU
- 30% of free memory
- 20% of free disk space
If VictoriaMetrics components have lower amounts of free resources, then this may lead
to **significant** performance degradation during data ingestion.
For example:
- If the percentage of free CPU is close to 0, then VictoriaMetrics
may experience arbitrary long delays during data ingestion when it cannot keep up
with the data ingestion rate.
- If the percentage of free memory reaches 0, then the Operating System where VictoriaMetrics components run
may have no enough memory for [page cache](https://en.wikipedia.org/wiki/Page_cache).
VictoriaMetrics relies on page cache for quick queries over recently ingested data.
If the operating system has no enough free memory for page cache, then it needs
to re-read the requested data from disk. This may **significantly** increase disk read IO.
- If free disk space is lower than 20%, then VictoriaMetrics is unable to perform optimal
background merge of the incoming data. This leads to increased number of data files on disk,
which, in turn, slows down both data ingestion and querying. See [these docs](https://docs.victoriametrics.com/#storage) for details.
4. If you run cluster version of VictoriaMetrics, then make sure `vminsert` and `vmstorage` components
are located in the same network with short network latency between them.
`vminsert` packs incoming data into in-memory packets and sends them to `vmstorage` on-by-one.
It waits until `vmstorage` returns back `ack` response before sending the next packet.
If the network latency between `vminsert` and `vmstorage` is big (for example, if they run in different datacenters),
then this may become limiting factor for data ingestion speed.
The [official Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#monitoring)
contain `connection saturation` graph for `vminsert` components. If these graphs reach 100%,
then it is likely you have issues with network latency between `vminsert` and `vmstorage`.
Another possible issue for 100% connection saturation between `vminsert` and `vmstorage`
is resource shortage at `vmstorage` nodes. In this case you need to increase amounts
of available resources (CPU, RAM, disk IO) at `vmstorage` nodes or to add more `vmstorage` nodes to the cluster.
5. Noisy neighboor. Make sure VictoriaMetrics components run in envirnoments without other resource-hungry apps.
Such apps may steal RAM, CPU, disk IO and network bandwidth, which is needed for VictoriaMetrics components.
## Slow queries
Some queries may take more time and resources (CPU, RAM, network bandwidth) than others.
VictoriaMetrics logs slow queries if their execution time exceeds the duration passed
to `-search.logSlowQueryDuration` command-line flag.
VictoriaMetrics also provides `/api/v1/status/top_queries` endpoint, which returns
queries took the most time to execute.
See [these docs](https://docs.victoriametrics.com/#prometheus-querying-api-enhancements) for details.
There are the following solutions exist for slow queries:
- Adding more CPU and memory to VictoriaMetrics, so it may perform the slow query faster.
If you use cluster version of VictoriaMetrics, then migration of `vmselect` nodes to machines
with more CPU and RAM should help improving speed for slow queries.
Sometimes adding more `vmstorage` nodes also can help improving the speed for slow queries.
- Rewriting slow queries, so they become faster. Unfortunately it is hard determining
whether the given query will be slow by just looking at it.
VictoriaMetrics provides [query tracing](https://docs.victoriametrics.com/#query-tracing) functionality,
which can help determine the source of slow query.
See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986),
which explains how to determine and optimize slow queries.
## Out of memory errors
There are the following most common sources of out of memory (aka OOM) crashes in VictoriaMetrics:
1. Improper command-line flag values. Inspect command-line flags passed to VictoriaMetrics components.
If you don't understand clearly the purpose or the effect of some flags, then remove them
from the list of flags passed to VictoriaMetrics components, because some command-line flags
may lead to increased memory usage and increased CPU usage. The increased memory usage increases chances for OOM crashes.
VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly).
For example, it isn't recommended tuning cache sizes in VictoriaMetrics, since it frequently leads to OOM.
[These docs](https://docs.victoriametrics.com/#cache-tuning) refer command-line flags, which aren't
recommended to tune. If you see that VictoriaMetrics needs increasing some cache sizes for the current workload,
then it is better migrating to a host with more memory instead of trying to tune cache sizes.
2. Unexpected heavy queries. The query is considered heavy if it needs to select and process millions of unique time series.
Such query may lead to OOM, since VictoriaMetrics needs to keep some per-series data in memory.
VictoriaMetrics provides various settings, which can help limiting resource usage in this case -
see [these docs](https://docs.victoriametrics.com/#resource-usage-limits).
See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986),
which explains how to detect and optimize heavy queries.
VictoriaMetrics also provides [query tracer](https://docs.victoriametrics.com/#query-tracing),
which may help identifying the source of heavy query.
3. Lack of free memory for processing workload spikes. If VictoriaMetrics components use almost all the available memory
under the current workload, then it is recommended migrating to a host with bigger amounts of memory
in order to protect from possible OOM crashes on workload spikes. It is recommended to have at least 30%
of free memory for graceful handling of possible workload spikes.

View File

@ -1,5 +1,5 @@
--- ---
sort: 22 sort: 24
--- ---
# Guides # Guides

View File

@ -1,5 +1,5 @@
--- ---
sort: 22 sort: 26
--- ---
# Managed VictoriaMetrics # Managed VictoriaMetrics

View File

@ -1,5 +1,5 @@
--- ---
sort: 23 sort: 25
--- ---
# VictoriaMetrics Operator # VictoriaMetrics Operator

View File

@ -586,6 +586,8 @@ It may be useful to perform `vmagent` rolling update without any scrape loss.
regex: true regex: true
``` ```
See also [troubleshooting docs](https://docs.victoriametrics.com/Troubleshooting.html).
## Kafka integration ## Kafka integration
[Enterprise version](https://victoriametrics.com/products/enterprise/) of `vmagent` can read and write metrics from / to Kafka: [Enterprise version](https://victoriametrics.com/products/enterprise/) of `vmagent` can read and write metrics from / to Kafka: