docs/Cluster-VictoriaMetrics.md: add Profiling section

This commit is contained in:
Aliaksandr Valialkin 2020-12-16 00:59:44 +02:00
parent 34622f7f9b
commit 69b3ca37d0
2 changed files with 74 additions and 32 deletions

View File

@ -122,7 +122,7 @@ ROOT_IMAGE=scratch make package
## Operation ## Operation
### Cluster setup ## Cluster setup
A minimal cluster must contain the following nodes: A minimal cluster must contain the following nodes:
@ -141,7 +141,7 @@ Ports may be altered by setting `-httpListenAddr` on the corresponding nodes.
It is recommended setting up [monitoring](#monitoring) for the cluster. It is recommended setting up [monitoring](#monitoring) for the cluster.
#### Environment variables ### Environment variables
Each flag values can be set thru environment variables by following these rules: Each flag values can be set thru environment variables by following these rules:
@ -151,7 +151,7 @@ Each flag values can be set thru environment variables by following these rules:
- It is possible setting prefix for environment vars with `-envflag.prefix`. For instance, if `-envflag.prefix=VM_`, then env vars must be prepended with `VM_` - It is possible setting prefix for environment vars with `-envflag.prefix`. For instance, if `-envflag.prefix=VM_`, then env vars must be prepended with `VM_`
### Monitoring ## Monitoring
All the cluster components expose various metrics in Prometheus-compatible format at `/metrics` page on the TCP port set in `-httpListenAddr` command-line flag. All the cluster components expose various metrics in Prometheus-compatible format at `/metrics` page on the TCP port set in `-httpListenAddr` command-line flag.
By default the following TCP ports are used: By default the following TCP ports are used:
@ -165,7 +165,7 @@ with [the official Grafana dashboard for VictoriaMetrics cluster](https://grafan
or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/grafana/dashboards/11831). or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/grafana/dashboards/11831).
### URL format ## URL format
* URLs for data ingestion: `http://<vminsert>:8480/insert/<accountID>/<suffix>`, where: * URLs for data ingestion: `http://<vminsert>:8480/insert/<accountID>/<suffix>`, where:
- `<accountID>` is an arbitrary 32-bit integer identifying namespace for data ingestion (aka tenant). It is possible to set it as `accountID:projectID`, - `<accountID>` is an arbitrary 32-bit integer identifying namespace for data ingestion (aka tenant). It is possible to set it as `accountID:projectID`,
@ -231,7 +231,7 @@ or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/gr
across `vmstorage` nodes. across `vmstorage` nodes.
### Cluster resizing and scalability ## Cluster resizing and scalability
Cluster performance and capacity scales with adding new nodes. Cluster performance and capacity scales with adding new nodes.
@ -250,7 +250,7 @@ Steps to add `vmstorage` node:
3. Gradually restart all the `vminsert` nodes with new `-storageNode` arg containing `<new_vmstorage_host>:8400`. 3. Gradually restart all the `vminsert` nodes with new `-storageNode` arg containing `<new_vmstorage_host>:8400`.
### Updating / reconfiguring cluster nodes ## Updating / reconfiguring cluster nodes
All the node types - `vminsert`, `vmselect` and `vmstorage` - may be updated via graceful shutdown. All the node types - `vminsert`, `vmselect` and `vmstorage` - may be updated via graceful shutdown.
Send `SIGINT` signal to the corresponding process, wait until it finishes and then start new version Send `SIGINT` signal to the corresponding process, wait until it finishes and then start new version
@ -260,7 +260,7 @@ Cluster should remain in working state if at least a single node of each type re
the update process. See [cluster availability](#cluster-availability) section for details. the update process. See [cluster availability](#cluster-availability) section for details.
### Cluster availability ## Cluster availability
* HTTP load balancer must stop routing requests to unavailable `vminsert` and `vmselect` nodes. * HTTP load balancer must stop routing requests to unavailable `vminsert` and `vmselect` nodes.
* The cluster remains available if at least a single `vmstorage` node exists: * The cluster remains available if at least a single `vmstorage` node exists:
@ -271,11 +271,11 @@ the update process. See [cluster availability](#cluster-availability) section fo
Data replication can be used for increasing storage durability. See [these docs](#replication-and-data-safety) for details. Data replication can be used for increasing storage durability. See [these docs](#replication-and-data-safety) for details.
### Capacity planning ## Capacity planning
Each instance type - `vminsert`, `vmselect` and `vmstorage` - can run on the most suitable hardware. Each instance type - `vminsert`, `vmselect` and `vmstorage` - can run on the most suitable hardware.
#### vminsert ### vminsert
* The recommended total number of vCPU cores for all the `vminsert` instances can be calculated from the ingestion rate: `vCPUs = ingestion_rate / 150K`. * The recommended total number of vCPU cores for all the `vminsert` instances can be calculated from the ingestion rate: `vCPUs = ingestion_rate / 150K`.
* The recommended number of vCPU cores per each `vminsert` instance should equal to the number of `vmstorage` instances in the cluster. * The recommended number of vCPU cores per each `vminsert` instance should equal to the number of `vmstorage` instances in the cluster.
@ -285,7 +285,7 @@ Each instance type - `vminsert`, `vmselect` and `vmstorage` - can run on the mos
* Sometimes `-rpc.disableCompression` command-line flag on `vminsert` instances could increase ingestion capacity at the cost * Sometimes `-rpc.disableCompression` command-line flag on `vminsert` instances could increase ingestion capacity at the cost
of higher network bandwidth usage between `vminsert` and `vmstorage`. of higher network bandwidth usage between `vminsert` and `vmstorage`.
#### vmstorage ### vmstorage
* The recommended total number of vCPU cores for all the `vmstorage` instances can be calculated from the ingestion rate: `vCPUs = ingestion_rate / 150K`. * The recommended total number of vCPU cores for all the `vmstorage` instances can be calculated from the ingestion rate: `vCPUs = ingestion_rate / 150K`.
* The recommended total amount of RAM for all the `vmstorage` instances can be calculated from the number of active time series: `RAM = 2 * active_time_series * 1KB`. * The recommended total amount of RAM for all the `vmstorage` instances can be calculated from the number of active time series: `RAM = 2 * active_time_series * 1KB`.
@ -299,7 +299,7 @@ Each instance type - `vminsert`, `vmselect` and `vmstorage` - can run on the mos
* The recommended total amount of storage space for all the `vmstorage` instances can be calculated * The recommended total amount of storage space for all the `vmstorage` instances can be calculated
from the ingestion rate and retention: `storage_space = ingestion_rate * retention_seconds`. from the ingestion rate and retention: `storage_space = ingestion_rate * retention_seconds`.
#### vmselect ### vmselect
The recommended hardware for `vmselect` instances highly depends on the type of queries. Lightweight queries over small number of time series usually require The recommended hardware for `vmselect` instances highly depends on the type of queries. Lightweight queries over small number of time series usually require
small number of vCPU cores and small amount of RAM on `vmselect`, while heavy queries over big number of time series (>10K) usually require small number of vCPU cores and small amount of RAM on `vmselect`, while heavy queries over big number of time series (>10K) usually require
@ -309,7 +309,7 @@ In general it is recommended increasing the number of vCPU cores and RAM per `vm
while adding new `vmselect` nodes only when old nodes are overloaded with incoming query stream. while adding new `vmselect` nodes only when old nodes are overloaded with incoming query stream.
### High availability ## High availability
It is recommended to run all the components for a single cluster in the same subnetwork with high bandwidth, low latency and low error rates. It is recommended to run all the components for a single cluster in the same subnetwork with high bandwidth, low latency and low error rates.
This improves cluster performance and availability. This improves cluster performance and availability.
@ -321,18 +321,18 @@ If you need multi-AZ setup, then it is recommended running independed clusters i
into all the cluster. Then [promxy](https://github.com/jacksontj/promxy) could be used for querying the data from multiple clusters. into all the cluster. Then [promxy](https://github.com/jacksontj/promxy) could be used for querying the data from multiple clusters.
### Helm ## Helm
Helm chart simplifies managing cluster version of VictoriaMetrics in Kubernetes. Helm chart simplifies managing cluster version of VictoriaMetrics in Kubernetes.
It is available in the [helm-charts](https://github.com/VictoriaMetrics/helm-charts) repository. It is available in the [helm-charts](https://github.com/VictoriaMetrics/helm-charts) repository.
### Kubernetes operator ## Kubernetes operator
[K8s operator](https://github.com/VictoriaMetrics/operator) simplifies managing VictoriaMetrics components in Kubernetes. [K8s operator](https://github.com/VictoriaMetrics/operator) simplifies managing VictoriaMetrics components in Kubernetes.
### Replication and data safety ## Replication and data safety
In order to enable application-level replication, `-replicationFactor=N` command-line flag must be passed to `vminsert`. In order to enable application-level replication, `-replicationFactor=N` command-line flag must be passed to `vminsert`.
This guarantees that all the data remains available for querying if up to `N-1` `vmstorage` nodes are unavailable. This guarantees that all the data remains available for querying if up to `N-1` `vmstorage` nodes are unavailable.
@ -355,7 +355,7 @@ HDD-based persistent disks should be enough for the majority of use cases.
It is recommended using durable replicated persistent volumes in Kubernetes. It is recommended using durable replicated persistent volumes in Kubernetes.
### Backups ## Backups
It is recommended performing periodical backups from [instant snapshots](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282) It is recommended performing periodical backups from [instant snapshots](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282)
for protecting from user errors such as accidental data deletion. for protecting from user errors such as accidental data deletion.
@ -376,6 +376,27 @@ Restoring from backup:
3. Start `vmstorage` node. 3. Start `vmstorage` node.
## Profiling
All the cluster components provide the following handlers for [profiling](https://blog.golang.org/profiling-go-programs):
* `http://vminsert:8480/debug/pprof/heap` for memory profile and `http://vminsert:8480/debug/pprof/profile` for CPU profile
* `http://vmselect:8481/debug/pprof/heap` for memory profile and `http://vmselect:8481/debug/pprof/profile` for CPU profile
* `http://vmstorage:8482/debug/pprof/heap` for memory profile and `http://vmstorage:8482/debug/pprof/profile` for CPU profile
Example command for collecting cpu profile from `vmstorage`:
```bash
curl -s http://<victoria-metrics-host>:8428/debug/pprof/profile > cpu.pprof
```
Example command for collecting memory profile from `vminsert`:
```bash
curl -s http://<victoria-metrics-host>:8428/debug/pprof/heap > mem.pprof
```
## Community and contributions ## Community and contributions
We are open to third-party pull requests provided they follow [KISS design principle](https://en.wikipedia.org/wiki/KISS_principle): We are open to third-party pull requests provided they follow [KISS design principle](https://en.wikipedia.org/wiki/KISS_principle):

View File

@ -122,7 +122,7 @@ ROOT_IMAGE=scratch make package
## Operation ## Operation
### Cluster setup ## Cluster setup
A minimal cluster must contain the following nodes: A minimal cluster must contain the following nodes:
@ -141,7 +141,7 @@ Ports may be altered by setting `-httpListenAddr` on the corresponding nodes.
It is recommended setting up [monitoring](#monitoring) for the cluster. It is recommended setting up [monitoring](#monitoring) for the cluster.
#### Environment variables ### Environment variables
Each flag values can be set thru environment variables by following these rules: Each flag values can be set thru environment variables by following these rules:
@ -151,7 +151,7 @@ Each flag values can be set thru environment variables by following these rules:
- It is possible setting prefix for environment vars with `-envflag.prefix`. For instance, if `-envflag.prefix=VM_`, then env vars must be prepended with `VM_` - It is possible setting prefix for environment vars with `-envflag.prefix`. For instance, if `-envflag.prefix=VM_`, then env vars must be prepended with `VM_`
### Monitoring ## Monitoring
All the cluster components expose various metrics in Prometheus-compatible format at `/metrics` page on the TCP port set in `-httpListenAddr` command-line flag. All the cluster components expose various metrics in Prometheus-compatible format at `/metrics` page on the TCP port set in `-httpListenAddr` command-line flag.
By default the following TCP ports are used: By default the following TCP ports are used:
@ -165,7 +165,7 @@ with [the official Grafana dashboard for VictoriaMetrics cluster](https://grafan
or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/grafana/dashboards/11831). or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/grafana/dashboards/11831).
### URL format ## URL format
* URLs for data ingestion: `http://<vminsert>:8480/insert/<accountID>/<suffix>`, where: * URLs for data ingestion: `http://<vminsert>:8480/insert/<accountID>/<suffix>`, where:
- `<accountID>` is an arbitrary 32-bit integer identifying namespace for data ingestion (aka tenant). It is possible to set it as `accountID:projectID`, - `<accountID>` is an arbitrary 32-bit integer identifying namespace for data ingestion (aka tenant). It is possible to set it as `accountID:projectID`,
@ -231,7 +231,7 @@ or [an alternative dashboard for VictoriaMetrics cluster](https://grafana.com/gr
across `vmstorage` nodes. across `vmstorage` nodes.
### Cluster resizing and scalability ## Cluster resizing and scalability
Cluster performance and capacity scales with adding new nodes. Cluster performance and capacity scales with adding new nodes.
@ -250,7 +250,7 @@ Steps to add `vmstorage` node:
3. Gradually restart all the `vminsert` nodes with new `-storageNode` arg containing `<new_vmstorage_host>:8400`. 3. Gradually restart all the `vminsert` nodes with new `-storageNode` arg containing `<new_vmstorage_host>:8400`.
### Updating / reconfiguring cluster nodes ## Updating / reconfiguring cluster nodes
All the node types - `vminsert`, `vmselect` and `vmstorage` - may be updated via graceful shutdown. All the node types - `vminsert`, `vmselect` and `vmstorage` - may be updated via graceful shutdown.
Send `SIGINT` signal to the corresponding process, wait until it finishes and then start new version Send `SIGINT` signal to the corresponding process, wait until it finishes and then start new version
@ -260,7 +260,7 @@ Cluster should remain in working state if at least a single node of each type re
the update process. See [cluster availability](#cluster-availability) section for details. the update process. See [cluster availability](#cluster-availability) section for details.
### Cluster availability ## Cluster availability
* HTTP load balancer must stop routing requests to unavailable `vminsert` and `vmselect` nodes. * HTTP load balancer must stop routing requests to unavailable `vminsert` and `vmselect` nodes.
* The cluster remains available if at least a single `vmstorage` node exists: * The cluster remains available if at least a single `vmstorage` node exists:
@ -271,11 +271,11 @@ the update process. See [cluster availability](#cluster-availability) section fo
Data replication can be used for increasing storage durability. See [these docs](#replication-and-data-safety) for details. Data replication can be used for increasing storage durability. See [these docs](#replication-and-data-safety) for details.
### Capacity planning ## Capacity planning
Each instance type - `vminsert`, `vmselect` and `vmstorage` - can run on the most suitable hardware. Each instance type - `vminsert`, `vmselect` and `vmstorage` - can run on the most suitable hardware.
#### vminsert ### vminsert
* The recommended total number of vCPU cores for all the `vminsert` instances can be calculated from the ingestion rate: `vCPUs = ingestion_rate / 150K`. * The recommended total number of vCPU cores for all the `vminsert` instances can be calculated from the ingestion rate: `vCPUs = ingestion_rate / 150K`.
* The recommended number of vCPU cores per each `vminsert` instance should equal to the number of `vmstorage` instances in the cluster. * The recommended number of vCPU cores per each `vminsert` instance should equal to the number of `vmstorage` instances in the cluster.
@ -285,7 +285,7 @@ Each instance type - `vminsert`, `vmselect` and `vmstorage` - can run on the mos
* Sometimes `-rpc.disableCompression` command-line flag on `vminsert` instances could increase ingestion capacity at the cost * Sometimes `-rpc.disableCompression` command-line flag on `vminsert` instances could increase ingestion capacity at the cost
of higher network bandwidth usage between `vminsert` and `vmstorage`. of higher network bandwidth usage between `vminsert` and `vmstorage`.
#### vmstorage ### vmstorage
* The recommended total number of vCPU cores for all the `vmstorage` instances can be calculated from the ingestion rate: `vCPUs = ingestion_rate / 150K`. * The recommended total number of vCPU cores for all the `vmstorage` instances can be calculated from the ingestion rate: `vCPUs = ingestion_rate / 150K`.
* The recommended total amount of RAM for all the `vmstorage` instances can be calculated from the number of active time series: `RAM = 2 * active_time_series * 1KB`. * The recommended total amount of RAM for all the `vmstorage` instances can be calculated from the number of active time series: `RAM = 2 * active_time_series * 1KB`.
@ -299,7 +299,7 @@ Each instance type - `vminsert`, `vmselect` and `vmstorage` - can run on the mos
* The recommended total amount of storage space for all the `vmstorage` instances can be calculated * The recommended total amount of storage space for all the `vmstorage` instances can be calculated
from the ingestion rate and retention: `storage_space = ingestion_rate * retention_seconds`. from the ingestion rate and retention: `storage_space = ingestion_rate * retention_seconds`.
#### vmselect ### vmselect
The recommended hardware for `vmselect` instances highly depends on the type of queries. Lightweight queries over small number of time series usually require The recommended hardware for `vmselect` instances highly depends on the type of queries. Lightweight queries over small number of time series usually require
small number of vCPU cores and small amount of RAM on `vmselect`, while heavy queries over big number of time series (>10K) usually require small number of vCPU cores and small amount of RAM on `vmselect`, while heavy queries over big number of time series (>10K) usually require
@ -309,7 +309,7 @@ In general it is recommended increasing the number of vCPU cores and RAM per `vm
while adding new `vmselect` nodes only when old nodes are overloaded with incoming query stream. while adding new `vmselect` nodes only when old nodes are overloaded with incoming query stream.
### High availability ## High availability
It is recommended to run all the components for a single cluster in the same subnetwork with high bandwidth, low latency and low error rates. It is recommended to run all the components for a single cluster in the same subnetwork with high bandwidth, low latency and low error rates.
This improves cluster performance and availability. This improves cluster performance and availability.
@ -321,18 +321,18 @@ If you need multi-AZ setup, then it is recommended running independed clusters i
into all the cluster. Then [promxy](https://github.com/jacksontj/promxy) could be used for querying the data from multiple clusters. into all the cluster. Then [promxy](https://github.com/jacksontj/promxy) could be used for querying the data from multiple clusters.
### Helm ## Helm
Helm chart simplifies managing cluster version of VictoriaMetrics in Kubernetes. Helm chart simplifies managing cluster version of VictoriaMetrics in Kubernetes.
It is available in the [helm-charts](https://github.com/VictoriaMetrics/helm-charts) repository. It is available in the [helm-charts](https://github.com/VictoriaMetrics/helm-charts) repository.
### Kubernetes operator ## Kubernetes operator
[K8s operator](https://github.com/VictoriaMetrics/operator) simplifies managing VictoriaMetrics components in Kubernetes. [K8s operator](https://github.com/VictoriaMetrics/operator) simplifies managing VictoriaMetrics components in Kubernetes.
### Replication and data safety ## Replication and data safety
In order to enable application-level replication, `-replicationFactor=N` command-line flag must be passed to `vminsert`. In order to enable application-level replication, `-replicationFactor=N` command-line flag must be passed to `vminsert`.
This guarantees that all the data remains available for querying if up to `N-1` `vmstorage` nodes are unavailable. This guarantees that all the data remains available for querying if up to `N-1` `vmstorage` nodes are unavailable.
@ -355,7 +355,7 @@ HDD-based persistent disks should be enough for the majority of use cases.
It is recommended using durable replicated persistent volumes in Kubernetes. It is recommended using durable replicated persistent volumes in Kubernetes.
### Backups ## Backups
It is recommended performing periodical backups from [instant snapshots](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282) It is recommended performing periodical backups from [instant snapshots](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282)
for protecting from user errors such as accidental data deletion. for protecting from user errors such as accidental data deletion.
@ -376,6 +376,27 @@ Restoring from backup:
3. Start `vmstorage` node. 3. Start `vmstorage` node.
## Profiling
All the cluster components provide the following handlers for [profiling](https://blog.golang.org/profiling-go-programs):
* `http://vminsert:8480/debug/pprof/heap` for memory profile and `http://vminsert:8480/debug/pprof/profile` for CPU profile
* `http://vmselect:8481/debug/pprof/heap` for memory profile and `http://vmselect:8481/debug/pprof/profile` for CPU profile
* `http://vmstorage:8482/debug/pprof/heap` for memory profile and `http://vmstorage:8482/debug/pprof/profile` for CPU profile
Example command for collecting cpu profile from `vmstorage`:
```bash
curl -s http://<victoria-metrics-host>:8428/debug/pprof/profile > cpu.pprof
```
Example command for collecting memory profile from `vminsert`:
```bash
curl -s http://<victoria-metrics-host>:8428/debug/pprof/heap > mem.pprof
```
## Community and contributions ## Community and contributions
We are open to third-party pull requests provided they follow [KISS design principle](https://en.wikipedia.org/wiki/KISS_principle): We are open to third-party pull requests provided they follow [KISS design principle](https://en.wikipedia.org/wiki/KISS_principle):