vmselect: introduce search.skipSlowReplicas cmd-line flag (#4538)

* vmselect: introduce `search.skipSlowReplicas` cmd-line flag

vmselect has two logical conditions during request processing when
`-replicationFactor` cmd-line flag is set:
1. If at least `len(storageNodes) - replicationFactor` responded, it could skip
waiting for the rest of nodes to respond. This could lead to problems described
here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207.
2. Mark response as partial if less than `len(storageNodes) - replicationFactor` responded
without an error.

The P1 showed itself error-prone and became the main reason why
`-replicationFactor` wasn't recommended to use at vmselect level.
However, this optimization could be still very useful in situations
when there are slow and fast replicas in cluster.

But P2 remains viable and important conditionless.
Hiding P1 behind the feature-flag `search.skipSlowReplicas`
should make `-replicationFactor` flag usable again. And let users
choose whether they want P1 to be respected.

Related issues
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* docs: update changelog

Signed-off-by: hagen1778 <roman@victoriametrics.com>

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
This commit is contained in:
Roman Khavronenko 2023-07-07 11:50:26 +02:00 committed by Aliaksandr Valialkin
parent 45cec4728c
commit fb03762d4d
No known key found for this signature in database
GPG Key ID: A72BEC6CD3D0DED1
2 changed files with 4 additions and 0 deletions

View File

@ -67,6 +67,7 @@ Released at 2023-06-30
* BUGFIX: [storage](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html): prevent from possible crashloop after the migration from versions below `v1.90.0` to newer versions. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4336) for details.
* BUGFIX: [vmui](https://docs.victoriametrics.com/#vmui): fix a memory leak issue associated with chart updates. See [this pull request](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4455).
* BUGFIX: [vmbackupmanager](https://docs.victoriametrics.com/vmbackupmanager.html): fix removing storage data dir before restoring from backup.
* BUGFIX: vmselect: wait for all vmstorage nodes to respond when the `-replicationFactor` flag is set bigger than > 1. Before, vmselect could have [skip waiting for the slowest replicas](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711) to respond. This could have resulted in issues illustrated [here](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207). Now, this optimization is disabled by default and could be re-enabled by passing `-search.skipSlowReplicas` cmd-line flag to vmselect. See more details [here](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4538).
## [v1.91.2](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.91.2)

View File

@ -659,6 +659,7 @@ It is available in the [helm-charts](https://github.com/VictoriaMetrics/helm-cha
By default, VictoriaMetrics offloads replication to the underlying storage pointed by `-storageDataPath` such as [Google compute persistent disk](https://cloud.google.com/compute/docs/disks#pdspecs), which guarantees data durability. VictoriaMetrics supports application-level replication if replicated durable persistent disks cannot be used for some reason.
The replication can be enabled by passing `-replicationFactor=N` command-line flag to `vminsert`. This instructs `vminsert` to store `N` copies for every ingested sample on `N` distinct `vmstorage` nodes. This guarantees that all the stored data remains available for querying if up to `N-1` `vmstorage` nodes are unavailable.
Passing `-replicationFactor=N` command-line flag to `vmselect` instructs it to not mark responses as `partial` if less `replicationFactor` storage nodes failed to respond on query time.
The cluster must contain at least `2*N-1` `vmstorage` nodes, where `N` is replication factor, in order to maintain the given replication factor for newly ingested data when `N-1` of storage nodes are unavailable.
@ -1207,6 +1208,8 @@ Below is the output for `/path/to/vmselect -help`:
Optional authKey for resetting rollup cache via /internal/resetRollupResultCache call
-search.setLookbackToStep
Whether to fix lookback interval to 'step' query arg value. If set to true, the query model becomes closer to InfluxDB data model. If set to true, then -search.maxLookback and -search.maxStalenessInterval are ignored
-search.skipSlowReplicas
Whether to skip waiting for all replicas to respond during search query. Enabling this setting may improve query speed by serving results from the fastest vmstorage replicas in the cluster. But could also lead to incomplete results if replicas contain data gaps. Consider enabling this setting only if all replicas contain identical data.
-search.treatDotsAsIsInRegexps
Whether to treat dots as is in regexp label filters used in queries. For example, foo{bar=~"a.b.c"} will be automatically converted to foo{bar=~"a\\.b\\.c"}, i.e. all the dots in regexp filters will be automatically escaped in order to match only dot char instead of matching any char. Dots in ".+", ".*" and ".{n}" regexps aren't escaped. This option is DEPRECATED in favor of {__graphite__="a.*.c"} syntax for selecting metrics matching the given Graphite metrics filter
-selectNode array