docs/Cluster-VictoriaMetrics.md: document the best strategies for cluster update / upgrade

This commit is contained in:
Aliaksandr Valialkin 2022-08-21 23:18:36 +03:00
parent 1c7f402598
commit 7f46c17e4e
No known key found for this signature in database
GPG Key ID: A72BEC6CD3D0DED1
3 changed files with 83 additions and 11 deletions

View File

@ -296,8 +296,47 @@ All the node types - `vminsert`, `vmselect` and `vmstorage` - may be updated via
Send `SIGINT` signal to the corresponding process, wait until it finishes and then start new version Send `SIGINT` signal to the corresponding process, wait until it finishes and then start new version
with new configs. with new configs.
Cluster should remain in working state if at least a single node of each type remains available during There are the following cluster update / upgrade approaches exist:
the update process. See [cluster availability](#cluster-availability) section for details.
* `No downtime` strategy. Gracefully restart every node in the cluster one-by-one with the updated config / upgraded binary.
It is recommended restarting the nodes in the following order:
1. Restart `vmstorage` nodes.
2. Restart `vminsert` nodes.
3. Restart `vmselect` nodes.
This strategy allows upgrading the cluster without downtime if the following conditions are met:
- The cluster has at least a pair of nodes of each type - `vminsert`, `vmselect` and `vmstorage`,
so it can continue accept new data and serve incoming requests when a single node is temporary unavailable
during its restart. See [cluster availability docs](#cluster-availability) for details.
- The cluster has enough compute resources (CPU, RAM, network bandwidth, disk IO) for processing
the current workload when a single node of any type (`vminsert`, `vmselect` or `vmstorage`)
is temporarily unavailable during its restart.
- The updated config / upgraded binary is compatible with the remaining components in the cluster.
See the [CHANGELOG](https://docs.victoriametrics.com/CHANGELOG.html) for compatibility notes between different releases.
If at least a single condition isn't met, then the rolling restart may result in cluster unavailability
during the config update / version upgrade. In this case the following strategy is recommended.
* `Minimum downtime` strategy:
1. Gracefully stop all the `vminsert` and `vmselect` nodes in parallel.
2. Gracefully restart all the `vmstorage` nodes in parallel.
3. Start all the `vminsert` and `vmselect` nodes in parallel.
The cluster is unavailable for data ingestion and querying when performing the steps above.
The downtime is minimized by restarting cluster nodes in parallel at every step above.
The `minimum downtime` strategy has the following benefits comparing to `no downtime` startegy:
- It allows performing config update / version upgrade with minimum disruption
when the previous config / version is incompatible with the new config / version.
- It allows perorming config update / version upgrade with minimum disruption
when the cluster has no enough compute resources (CPU, RAM, disk IO, network bandwidth)
for rolling upgrade.
- It allows minimizing the duration of config update / version ugprade for clusters with big number of nodes
of for clusters with big `vmstorage` nodes, which may take long time for graceful restart.
## Cluster availability ## Cluster availability

View File

@ -300,8 +300,47 @@ All the node types - `vminsert`, `vmselect` and `vmstorage` - may be updated via
Send `SIGINT` signal to the corresponding process, wait until it finishes and then start new version Send `SIGINT` signal to the corresponding process, wait until it finishes and then start new version
with new configs. with new configs.
Cluster should remain in working state if at least a single node of each type remains available during There are the following cluster update / upgrade approaches exist:
the update process. See [cluster availability](#cluster-availability) section for details.
* `No downtime` strategy. Gracefully restart every node in the cluster one-by-one with the updated config / upgraded binary.
It is recommended restarting the nodes in the following order:
1. Restart `vmstorage` nodes.
2. Restart `vminsert` nodes.
3. Restart `vmselect` nodes.
This strategy allows upgrading the cluster without downtime if the following conditions are met:
- The cluster has at least a pair of nodes of each type - `vminsert`, `vmselect` and `vmstorage`,
so it can continue accept new data and serve incoming requests when a single node is temporary unavailable
during its restart. See [cluster availability docs](#cluster-availability) for details.
- The cluster has enough compute resources (CPU, RAM, network bandwidth, disk IO) for processing
the current workload when a single node of any type (`vminsert`, `vmselect` or `vmstorage`)
is temporarily unavailable during its restart.
- The updated config / upgraded binary is compatible with the remaining components in the cluster.
See the [CHANGELOG](https://docs.victoriametrics.com/CHANGELOG.html) for compatibility notes between different releases.
If at least a single condition isn't met, then the rolling restart may result in cluster unavailability
during the config update / version upgrade. In this case the following strategy is recommended.
* `Minimum downtime` strategy:
1. Gracefully stop all the `vminsert` and `vmselect` nodes in parallel.
2. Gracefully restart all the `vmstorage` nodes in parallel.
3. Start all the `vminsert` and `vmselect` nodes in parallel.
The cluster is unavailable for data ingestion and querying when performing the steps above.
The downtime is minimized by restarting cluster nodes in parallel at every step above.
The `minimum downtime` strategy has the following benefits comparing to `no downtime` startegy:
- It allows performing config update / version upgrade with minimum disruption
when the previous config / version is incompatible with the new config / version.
- It allows perorming config update / version upgrade with minimum disruption
when the cluster has no enough compute resources (CPU, RAM, disk IO, network bandwidth)
for rolling upgrade.
- It allows minimizing the duration of config update / version ugprade for clusters with big number of nodes
of for clusters with big `vmstorage` nodes, which may take long time for graceful restart.
## Cluster availability ## Cluster availability

View File

@ -335,10 +335,4 @@ The query engine may behave differently for some functions. Please see [this art
Single-node VictoriaMetrics cannot be restarted / upgraded or downgraded without downtime, since it needs to be gracefully shut down and then started again. See [how to upgrade VictoriaMetrics](https://docs.victoriametrics.com/#how-to-upgrade-victoriametrics). Single-node VictoriaMetrics cannot be restarted / upgraded or downgraded without downtime, since it needs to be gracefully shut down and then started again. See [how to upgrade VictoriaMetrics](https://docs.victoriametrics.com/#how-to-upgrade-victoriametrics).
[Cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html) can be restarted / upgraded / downgraded without downtime if the following conditions are met: [Cluster version of VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html) can be restarted / upgraded / downgraded without downtime according to [these instructions](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#updating--reconfiguring-cluster-nodes).
* If every component of the cluster - `vminsert`, `vmselect` and `vmstorage` - has at least 2 instances.
* If the cluster has enough compute resources (CPU, RAM, disk IO, network bandwidth) to perform rolling restart of all its components.
* If the version used for upgrade / downgrade is compatible with the currently running version. The [CHANGELOG](https://docs.victoriametrics.com/CHANGELOG.html) contains compatibility notes for the published releases.
See [updating / reconfiguring cluster nodes](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#updating--reconfiguring-cluster-nodes) for details on cluster upgrade.