diff --git a/deployment/docker/vmanomaly/vmanomaly-integration/docker-compose.yml b/deployment/docker/vmanomaly/vmanomaly-integration/docker-compose.yml index 84d09b4ae4..b90047a836 100644 --- a/deployment/docker/vmanomaly/vmanomaly-integration/docker-compose.yml +++ b/deployment/docker/vmanomaly/vmanomaly-integration/docker-compose.yml @@ -73,7 +73,7 @@ services: restart: always vmanomaly: container_name: vmanomaly - image: victoriametrics/vmanomaly:v1.16.3 + image: victoriametrics/vmanomaly:v1.17.0 depends_on: - "victoriametrics" ports: diff --git a/docs/anomaly-detection/CHANGELOG.md b/docs/anomaly-detection/CHANGELOG.md index f6e4d60dc4..338519591b 100644 --- a/docs/anomaly-detection/CHANGELOG.md +++ b/docs/anomaly-detection/CHANGELOG.md @@ -11,6 +11,29 @@ aliases: --- Please find the changelog for VictoriaMetrics Anomaly Detection below. +## v1.17.0 +Released: 2024-10-17 + +- FEATURE: Added `max_points_per_query` (global and [query-specific](https://docs.victoriametrics.com/anomaly-detection/components/reader/#per-query-parameters)) [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader/#vm-reader) arg to control query chunking. This overrides how `search.maxPointsPerTimeseries` flag (introduced in [v1.14.1](#v1141)) is used in `vmanomaly` for splitting long `fit_window` queries into smaller sub-intervals. This helps users avoid hitting the `search.maxQueryDuration` limit for individual queries by distributing initial query across multiple subquery requests with minimal overhead. + +- IMPROVEMENT: Enhanced the [self-monitoring](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#metrics-generated-by-vmanomaly) metrics for consistency across the components. Key changes include: + - Converted several [self-monitoring](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#metrics-generated-by-vmanomaly) metrics from `Summary` to `Histogram` to enable quantile calculation. This addresses the limitation of the `prometheus_client`'s [Summary](https://prometheus.github.io/client_python/instrumenting/summary/) implementation, which does not support quantiles. The change ensures metrics are more informative for performance analysis. Affected metrics are: + - `vmanomaly_reader_request_duration_seconds` ([VmReader](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics)) + - `vmanomaly_reader_response_parsing_seconds` ([VmReader](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics)) + - `vmanomaly_writer_request_duration_seconds` ([VmWriter](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#writer-behaviour-metrics)) + - `vmanomaly_writer_request_serialize_seconds` ([VmWriter](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#writer-behaviour-metrics)) + - Added a `query_key` label to the `vmanomaly_reader_response_parsing_seconds` [metric](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics) to provide finer granularity in tracking the performance of individual queries. This metric has also been switched from `Summary` to `Histogram` to align with the other metrics and support quantile calculations. + - Added `preset` and `scheduler_alias` keys to [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics) and [VmWriter](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#writer-behaviour-metrics) metrics for consistency in multi-[scheduler](https://docs.victoriametrics.com/anomaly-detection/components/scheduler/) setups. + - Renamed [Counters](https://prometheus.io/docs/concepts/metric_types/#counter) `vmanomaly_reader_response_count` to `vmanomaly_reader_responses` and `vmanomaly_writer_response_count` to `vmanomaly_writer_responses`. + - Updated [docs](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#metrics-generated-by-vmanomaly) for better clarity. + +- IMPROVEMENT: Accelerated performance of model fitting stages on multicore systems. +- IMPROVEMENT: Optimized query handling in multi-[scheduler](https://docs.victoriametrics.com/anomaly-detection/components/scheduler/) setups by filtering [queries](https://docs.victoriametrics.com/anomaly-detection/components/models/#queries) for each scheduler based on model requirements. This reduces unnecessary data fetching from VictoriaMetrics, ensuring only relevant queries are processed by the [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader#vm-reader), leading to better performance and efficiency of configs with multiple active schedulers. + +- IMPROVEMENT: Implemented automatic cleanup of files in subdirectories within `/tmp` ([generated by the Stan backend](https://mc-stan.org/cmdstanpy/users-guide/outputs.html) when utilizing [Prophet](https://docs.victoriametrics.com/anomaly-detection/components/models/#prophet) models) after each `fit` operation. This prevents the accumulation of unused data over time in `/tmp`, addressing a potential issue where these files would only be deleted upon termination of the current Python session or service, leading to uncontrolled disk growth. + +- FIX: Re-enable the `vmanomaly_reader_response_count` (now called `vmanomaly_reader_responses`) self-monitoring [metric](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics) for the [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader/#vm-reader), which was unintentionally disabled in previous releases and now updates correctly as intended. + ## v1.16.3 Released: 2024-10-08 - IMPROVEMENT: Added `tls_cert_file` and `tls_key_file` arguments to support mTLS (mutual TLS) in `vmanomaly` components. This enhancement applies to the following components: [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader/#vm-reader), [VmWriter](https://docs.victoriametrics.com/anomaly-detection/components/writer/#vm-writer), and [Monitoring/Push](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#push-config-parameters). You can also use these arguments in conjunction with `verify_tls` when it is set as a path to a custom CA certificate file. diff --git a/docs/anomaly-detection/FAQ.md b/docs/anomaly-detection/FAQ.md index 773ead7096..24b11806cd 100644 --- a/docs/anomaly-detection/FAQ.md +++ b/docs/anomaly-detection/FAQ.md @@ -132,7 +132,7 @@ services: # ... vmanomaly: container_name: vmanomaly - image: victoriametrics/vmanomaly:v1.16.3 + image: victoriametrics/vmanomaly:v1.17.0 # ... ports: - "8490:8490" @@ -225,6 +225,76 @@ P.s. `infer` data volume will remain the same for both models, so it does not af - Old: 8064 hours/week (fit) + 168 hours/week (infer) - New: 4 hours/week (fit) + 168 hours/week (infer) +## Handling large queries in vmanomaly + + +If you're dealing with a large query in the `queries` argument of [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader/#vm-reader) (especially when running [within a scheduler using a long](https://docs.victoriametrics.com/anomaly-detection/components/scheduler/?highlight=fit_window#periodic-scheduler) `fit_window`), you may encounter issues such as query timeouts (due to the `search.maxQueryDuration` server limit) or rejections (if the `search.maxPointsPerTimeseries` server limit is exceeded). + +We recommend upgrading to [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170), which introduced the `max_points_per_query` argument (both global and [query-specific](https://docs.victoriametrics.com/anomaly-detection/components/reader/#per-query-parameters)) for the [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader/#vm-reader). This argument overrides how `search.maxPointsPerTimeseries` flag handling (introduced in [v1.14.1](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1141)) is used in `vmanomaly` for splitting long `fit_window` queries into smaller sub-intervals. This helps users avoid hitting the `search.maxQueryDuration` limit for individual queries by distributing initial query across multiple subquery requests with minimal overhead. + +By splitting long `fit_window` queries into smaller sub-intervals, this helps avoid hitting the `search.maxQueryDuration` limit, distributing the load across multiple subquery requests with minimal overhead. To resolve the issue, reduce `max_points_per_query` to a value lower than `search.maxPointsPerTimeseries` until the problem is gone: + +```yaml +reader: + # other reader args + max_points_per_query: 10000 # reader-level constraint + queries: + sum_alerts: + expr: 'sum(ALERTS{alertstate=~'(pending|firing)'}) by (alertstate)' + max_points_per_query: 5000 # query-level override +models: + prophet: + # other model args + queries: [ + 'sum_alerts', + ] +# other config sections +``` + +### Alternative workaround for older versions + +If upgrading is not an option, you can partially address the issue by splitting your large query into smaller ones using appropriate label filters: + +For example, such query + +```yaml +reader: + # other reader args + queries: + sum_alerts: + expr: 'sum(ALERTS{alertstate=~'(pending|firing)'}) by (alertstate)' +models: + prophet: + # other model args + queries: [ + 'sum_alerts', + ] +# other config sections +``` + +can be modified to: + +```yaml +reader: + # other reader args + queries: + sum_alerts_pending: + expr: 'sum(ALERTS{alertstate='pending'}) by ()' + sum_alerts_firing: + expr: 'sum(ALERTS{alertstate='firing'}) by ()' +models: + prophet: + # other model args + queries: [ + 'sum_alerts_pending', + 'sum_alerts_firing', + ] +# other config sections +``` + +Please note that this approach may not fully resolve the issue if subqueries are not evenly distributed in terms of returned timeseries. Additionally, this workaround is not suitable for queries used in [multivariate models](https://docs.victoriametrics.com/anomaly-detection/components/models#multivariate-models) (especially when using the [groupby](https://docs.victoriametrics.com/anomaly-detection/components/models/#group-by) argument). + + ## Scaling vmanomaly > **Note:** As of latest release we do not support cluster or auto-scaled version yet (though, it's in our roadmap for - better backends, more parallelization, etc.), so proposed workarounds should be addressed *manually*. diff --git a/docs/anomaly-detection/Overview.md b/docs/anomaly-detection/Overview.md index 0db99ecfe6..4c7856cece 100644 --- a/docs/anomaly-detection/Overview.md +++ b/docs/anomaly-detection/Overview.md @@ -229,7 +229,7 @@ This will expose metrics at `http://0.0.0.0:8080/metrics` page. To use *vmanomaly* you need to pull docker image: ```sh -docker pull victoriametrics/vmanomaly:v1.16.3 +docker pull victoriametrics/vmanomaly:v1.17.0 ``` > Note: please check what is latest release in [CHANGELOG](https://docs.victoriametrics.com/anomaly-detection/changelog/) @@ -239,7 +239,7 @@ docker pull victoriametrics/vmanomaly:v1.16.3 You can put a tag on it for your convenience: ```sh -docker image tag victoriametrics/vmanomaly:v1.16.3 vmanomaly +docker image tag victoriametrics/vmanomaly:v1.17.0 vmanomaly ``` Here is an example of how to run *vmanomaly* docker container with [license file](#licensing): diff --git a/docs/anomaly-detection/components/models.md b/docs/anomaly-detection/components/models.md index 4ebd669c1d..56ae1b4ccc 100644 --- a/docs/anomaly-detection/components/models.md +++ b/docs/anomaly-detection/components/models.md @@ -962,7 +962,7 @@ monitoring: Let's pull the docker image for `vmanomaly`: ```sh -docker pull victoriametrics/vmanomaly:v1.16.3 +docker pull victoriametrics/vmanomaly:v1.17.0 ``` Now we can run the docker container putting as volumes both config and model file: @@ -976,7 +976,7 @@ docker run -it \ -v $(PWD)/license:/license \ -v $(PWD)/custom_model.py:/vmanomaly/model/custom.py \ -v $(PWD)/custom.yaml:/config.yaml \ -victoriametrics/vmanomaly:v1.16.3 /config.yaml \ +victoriametrics/vmanomaly:v1.17.0 /config.yaml \ --licenseFile=/license ``` diff --git a/docs/anomaly-detection/components/monitoring.md b/docs/anomaly-detection/components/monitoring.md index a2563b3504..235e38f0d2 100644 --- a/docs/anomaly-detection/components/monitoring.md +++ b/docs/anomaly-detection/components/monitoring.md @@ -11,6 +11,16 @@ aliases: --- There are 2 models to monitor VictoriaMetrics Anomaly Detection behavior - [push](https://docs.victoriametrics.com/keyconcepts/#push-model) and [pull](https://docs.victoriametrics.com/keyconcepts/#pull-model). Parameters for each of them should be specified in the config file, `monitoring` section. +> **Note**: there was an enhancement of [self-monitoring](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#metrics-generated-by-vmanomaly) metrics for consistency across the components ([v.1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170)). Documentation was updated accordingly. Key changes included: +- Converting several [self-monitoring](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#metrics-generated-by-vmanomaly) metrics from `Summary` to `Histogram` to enable quantile calculation. This addresses the limitation of the `prometheus_client`'s [Summary](https://prometheus.github.io/client_python/instrumenting/summary/) implementation, which does not support quantiles. The change ensures metrics are more informative for performance analysis. Affected metrics are: + - `vmanomaly_reader_request_duration_seconds` ([VmReader](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics)) + - `vmanomaly_reader_response_parsing_seconds` ([VmReader](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics)) + - `vmanomaly_writer_request_duration_seconds` ([VmWriter](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#writer-behaviour-metrics)) + - `vmanomaly_writer_request_serialize_seconds` ([VmWriter](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#writer-behaviour-metrics)) +- Adding a `query_key` label to the `vmanomaly_reader_response_parsing_seconds` [metric](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics) to provide finer granularity in tracking the performance of individual queries. This metric has also been switched from `Summary` to `Histogram` to align with the other metrics and support quantile calculations. +- Adding `preset` and `scheduler_alias` keys to [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics) and [VmWriter](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#writer-behaviour-metrics) metrics for consistency in multi-[scheduler](https://docs.victoriametrics.com/anomaly-detection/components/scheduler/) setups. +- Renaming [Counters](https://prometheus.io/docs/concepts/metric_types/#counter) `vmanomaly_reader_response_count` to `vmanomaly_reader_responses` and `vmanomaly_writer_response_count` to `vmanomaly_writer_responses`. + ## Pull Model Config parameters
Counter | -How many times models ran (per model) | +`Counter` | +How many successful `stage` (`fit`, `infer`, `fit_infer`) runs occurred for models of class `model_alias` based on results from the `query_key` query, within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`stage, query_key, model_alias, scheduler_alias, preset` +`stage`, `query_key`, `model_alias`, `scheduler_alias`, `preset` | |
Summary | -How much time (in seconds) model invocations took | +`Histogram` (was `Summary` prior to [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170)) | +The total time (in seconds) taken by model invocations during the `stage` (`fit`, `infer`, `fit_infer`), based on the results of the `query_key` query, for models of class `model_alias`, within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`stage, query_key, model_alias, scheduler_alias, preset` +`stage`, `query_key`, `model_alias`, `scheduler_alias`, `preset` | |
Counter | -How many datapoints did models accept | +`Counter` | +The number of datapoints accepted (excluding NaN or Inf values) by models of class `model_alias` from the results of the `query_key` query during the `stage` (`infer`, `fit_infer`), within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`stage, query_key, model_alias, scheduler_alias, preset` +`stage`, `query_key`, `model_alias`, `scheduler_alias`, `preset` | |
Counter | -How many datapoints were generated by models | +`Counter` | +The number of datapoints generated by models of class `model_alias` during the `stage` (`infer`, `fit_infer`) based on results from the `query_key` query, within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`stage, query_key, model_alias, scheduler_alias, preset` +`stage`, `query_key`, `model_alias`, `scheduler_alias`, `preset` | |
Gauge | -How many models are currently inferring | +`Gauge` | +The number of model instances of class `model_alias` currently available for inference for the `query_key` query, within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`query_key, model_alias, scheduler_alias, preset` +`query_key`, `model_alias`, `scheduler_alias`, `preset` | |
Counter | -How many times a run was skipped (per model) | +`Counter` | +The number of times model runs (of class `model_alias`) were skipped in expected situations (e.g., no data for fitting/inference, or no new data to infer on) during the `stage` (`fit`, `infer`, `fit_infer`), based on results from the `query_key` query, within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`stage, query_key, model_alias, scheduler_alias, preset` +`stage`, `query_key`, `model_alias`, `scheduler_alias`, `preset` + | +|
+ +`vmanomaly_model_run_errors` + | +`Counter` | +The number of times model runs (of class `model_alias`) failed due to internal service errors during the `stage` (`fit`, `infer`, `fit_infer`), based on results from the `query_key` query, within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | ++`stage`, `query_key`, `model_alias`, `scheduler_alias`, `preset` |
Summary | -How much time (in seconds) did requests to VictoriaMetrics take | +`Histogram` (was `Summary` prior to [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170)) | +The total time (in seconds) taken by write requests to VictoriaMetrics `url` for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. + | -`url, query_key` +`url`, `query_key`, `scheduler_alias`, `preset` | |
-`vmanomaly_writer_response_count` +`vmanomaly_writer_responses` (named `vmanomaly_reader_response_count` prior to [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170)) | -Counter | -Response code counts we got from VictoriaMetrics | +`Counter` | +The count of response codes received from VictoriaMetrics `url` for the `query_key` query, categorized by `code`, within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. + | -`url, query_key, code` +`url`, `code`, `query_key`, `scheduler_alias`, `preset` |
Counter | -How much bytes were sent to VictoriaMetrics | +`Counter` | +The total number of bytes sent to VictoriaMetrics `url` for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`url, query_key` +`url`, `query_key`, `scheduler_alias`, `preset` | |
Summary | -How much time (in seconds) did serializing take | +`Histogram` (was `Summary` prior to [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170)) | +The total time (in seconds) taken for serializing data for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`query_key` +`query_key`, `scheduler_alias`, `preset` | |
Counter | -How many datapoints were sent to VictoriaMetrics | +`Counter` | +The total number of datapoints sent to VictoriaMetrics for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`query_key` +`query_key`, `scheduler_alias`, `preset` | |
- `vmanomaly_writer_timeseries_sent` | -Counter | -How many timeseries were sent to VictoriaMetrics | +`Counter` | +The total number of timeseries sent to VictoriaMetrics for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | - -`query_key` +`query_key`, `scheduler_alias`, `preset` | Summary | -How much time (in seconds) did queries to VictoriaMetrics take | +`Histogram` (was `Summary` prior to [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170)) | +The total time (in seconds) taken by queries to VictoriaMetrics `url` for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`url, query_key` +`url`, `query_key`, `scheduler_alias`, `preset` |
-`vmanomaly_reader_response_count` +`vmanomaly_reader_responses` (named `vmanomaly_reader_response_count` prior to [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170)) | -Counter | -Response code counts we got from VictoriaMetrics | +`Counter` | +The count of responses received from VictoriaMetrics `url` for the `query_key` query, categorized by `code`, within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`url, query_key, code` +`url`, `query_key`, `code`, `scheduler_alias`, `preset` |
Counter | -How much bytes were received in responses | +`Counter` | +The total number of bytes received in responses for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`query_key` +`query_key`, `scheduler_alias`, `preset` | |
Summary | -How much time (in seconds) did parsing take for each step | +`Histogram` (was `Summary` prior to [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170)) | +The total time (in seconds) taken for data parsing at each `step` (json, dataframe) for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`step` +`step`, `query_key`, `scheduler_alias`, `preset` | |
`vmanomaly_reader_timeseries_received` - | -Counter | -How many timeseries were received from VictoriaMetrics | + +`Counter` | +The total number of timeseries received from VictoriaMetrics for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`query_key` +`query_key`, `scheduler_alias`, `preset` |
Counter | -How many rows were received from VictoriaMetrics | +`Counter` | +The total number of datapoints received from VictoriaMetrics for the `query_key` query within the specified scheduler `scheduler_alias`, in the `vmanomaly` service running in `preset` mode. | -`query_key` +`query_key`, `scheduler_alias`, `preset` | |
+`max_points_per_query` + | ++`10000` + | ++Introduced in [v1.17.0](https://docs.victoriametrics.com/anomaly-detection/changelog/#v1170), optional arg overrides how `search.maxPointsPerTimeseries` flag (available since [v1.14.1](#v1141)) impacts `vmanomaly` on splitting long `fit_window` [queries](https://docs.victoriametrics.com/anomaly-detection/components/reader/?highlight=queries#vm-reader) into smaller sub-intervals. This helps users avoid hitting the `search.maxQueryDuration` limit for individual queries by distributing initial query across multiple subquery requests with minimal overhead. Set less than `search.maxPointsPerTimeseries` if hitting `maxQueryDuration` limits. You can also set it on [per-query](#per-query-parameters) basis to override this global one. + | +