docs/Troubleshooting.md: various typo fixes and clarifications

This commit is contained in:
Aliaksandr Valialkin 2022-06-30 14:56:35 +03:00
parent bcb1175162
commit ca263371a6
No known key found for this signature in database
GPG Key ID: A72BEC6CD3D0DED1

View File

@ -104,7 +104,7 @@ There are the following most commons reasons for slow data ingestion in Victoria
- To increase the available memory on the host where VictoriaMetrics runs until `slow inserts` percentage
will become lower than 5%. If you run VictoriaMetrics cluster, then you need increasing total available
memory at `vmstorage` nodes. This can be done in two ways: either increasing the available memory
memory at `vmstorage` nodes. This can be done in two ways: either to increase the available memory
per each existing `vmstorage` node or to add more `vmstorage` nodes to the cluster.
- To reduce the number of active time series. The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/#monitoring)
@ -114,7 +114,7 @@ There are the following most commons reasons for slow data ingestion in Victoria
2. [High churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate),
e.g. when old time series are substituted with new time series at a high rate.
When VitoriaMetrics encounters a sample for new time series, it needs to register the time series
When VictoriaMetrics encounters a sample for new time series, it needs to register the time series
in the internal index (aka `indexdb`), so it can be quickly located on subsequent select queries.
The process of registering new time series in the internal index is an order of magnitude slower
than the process of adding new sample to already registered time series.
@ -124,7 +124,7 @@ There are the following most commons reasons for slow data ingestion in Victoria
provides `Churn rate` graph, which shows the average number of new time series registered
during the last 24 hours. If this number exceeds the number of [active time series](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series),
then you need to identify and fix the source of [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate).
The most commons source of high churn rate is a label, which frequently change its value. Try avoiding such labels.
The most commons source of high churn rate is a label, which frequently changes its value. Try avoiding such labels.
The [cardinality explorer](https://docs.victoriametrics.com/#cardinality-explorer) can help identifying
such labels.
@ -138,18 +138,19 @@ There are the following most commons reasons for slow data ingestion in Victoria
- 20% of free disk space
If VictoriaMetrics components have lower amounts of free resources, then this may lead
to **significant** performance degradation during data ingestion.
to **significant** performance degradation after workload increases slightly.
For example:
- If the percentage of free CPU is close to 0, then VictoriaMetrics
may experience arbitrary long delays during data ingestion when it cannot keep up
with the data ingestion rate.
- If the percentage of free memory reaches 0, then the Operating System where VictoriaMetrics components run
- If the percentage of free memory reaches 0, then the Operating System where VictoriaMetrics components run,
may have no enough memory for [page cache](https://en.wikipedia.org/wiki/Page_cache).
VictoriaMetrics relies on page cache for quick queries over recently ingested data.
If the operating system has no enough free memory for page cache, then it needs
to re-read the requested data from disk. This may **significantly** increase disk read IO.
to re-read the requested data from disk. This may **significantly** increase disk read IO
and slow down both queries and data ingestion.
- If free disk space is lower than 20%, then VictoriaMetrics is unable to perform optimal
background merge of the incoming data. This leads to increased number of data files on disk,
@ -163,7 +164,7 @@ There are the following most commons reasons for slow data ingestion in Victoria
then this may become limiting factor for data ingestion speed.
The [official Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#monitoring)
contain `connection saturation` graph for `vminsert` components. If these graphs reach 100%,
contain `connection saturation` graph for `vminsert` components. If this graph reaches 100%,
then it is likely you have issues with network latency between `vminsert` and `vmstorage`.
Another possible issue for 100% connection saturation between `vminsert` and `vmstorage`
is resource shortage at `vmstorage` nodes. In this case you need to increase amounts
@ -189,8 +190,8 @@ There are the following solutions exist for slow queries:
Sometimes adding more `vmstorage` nodes also can help improving the speed for slow queries.
- Rewriting slow queries, so they become faster. Unfortunately it is hard determining
whether the given query will be slow by just looking at it.
VictoriaMetrics provides [query tracing](https://docs.victoriametrics.com/#query-tracing) functionality,
whether the given query is slow by just looking at it.
VictoriaMetrics provides [query tracing](https://docs.victoriametrics.com/#query-tracing) feature,
which can help determine the source of slow query.
See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986),
which explains how to determine and optimize slow queries.
@ -233,16 +234,18 @@ for processing the current workload.
The most common sources of cluster instability are:
- Workload spike. For example, if the number of active time series increases by 2x while
- Workload spikes. For example, if the number of active time series increases by 2x while
the cluster has no enough free resources for processing the increased workload,
then it may become unstable.
VictoriaMetrics provides various configuration settings, which can be used for limiting unexpected workload spikes.
See [these docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#resource-usage-limits) for details.
- Various maintenance tasks such as rolling upgrades or rolling restarts during configuration changes.
For example, if a cluster contains `N=3` `vmstorage` nodes and they are restarted one-by-one (aka rolling restart),
then the cluster will have only `N-1=2` healthy `vmstorage` nodes during the rolling restart.
This means that the load on healthy `vmstorage` nodes increases by at least `100%/(N-1)=50%`
comparing to the load before rolling restart. E.g. they need to process 50% more incoming
data and return 50% more data during queries. In reality the load on the remaining `vmstorage`
data and to return 50% more data during queries. In reality the load on the remaining `vmstorage`
nodes increases even more because they need to register new time series, which were re-routed
from temporarily unavailable `vmstorage` node. If `vmstorage` nodes had less than 50%
of free resources (CPU, RAM, disk IO) before the rolling restart, then the rolling restart
@ -262,5 +265,4 @@ See [capacity planning docs](https://docs.victoriametrics.com/Cluster-VictoriaMe
and [cluster resizing and scalability docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-resizing-and-scalability)
for details.
VictoriaMetrics provides various configuration settings, which can be used for limiting unexpected workload spikes.
See [these docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#resource-usage-limits) for details.