vmanomaly guides - fix formating, add missing piece, clarify statement, use common languagefor VM ecosystem (#5620)

Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
This commit is contained in:
Artem Navoiev 2024-01-15 13:26:26 -08:00 committed by Aliaksandr Valialkin
parent 44ff8b3647
commit a90c9bf8c9
No known key found for this signature in database
GPG Key ID: 52C003EE2BCDB9EB

View File

@ -1,6 +1,6 @@
--- ---
weight: 1 weight: 1
~# sort: 1 sort: 1
title: Getting started with vmanomaly title: Getting started with vmanomaly
menu: menu:
docs: docs:
@ -13,17 +13,19 @@ aliases:
--- ---
# Getting started with vmanomaly # Getting started with vmanomaly
**Prerequisites** **Prerequisites**:
- *vmanomaly* is a part of enterprise package. You can get license key [here](https://victoriametrics.com/products/enterprise/trial) to try this tutorial.
- To use *vmanomaly*, part of the enterprise package, a license key is required. Obtain your key [here](https://victoriametrics.com/products/enterprise/trial) for this tutorial or for enterprise use.
- In the tutorial, we'll be using the following VictoriaMetrics components: - In the tutorial, we'll be using the following VictoriaMetrics components:
- [VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) (v.1.96.0) - [VictoriaMetrics Single-Node](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) (v.1.96.0)
- [vmalert](https://docs.victoriametrics.com/vmalert.html) (v.1.96.0) - [vmalert](https://docs.victoriametrics.com/vmalert.html) (v.1.96.0)
- [vmagent](https://docs.victoriametrics.com/vmagent.html) (v.1.96.0) - [vmagent](https://docs.victoriametrics.com/vmagent.html) (v.1.96.0)
- [Grafana](https://grafana.com/)(v.10.2.1)
If you're unfamiliar with the listed components, please read [QuickStart](https://docs.victoriametrics.com/Quick-Start.html) first. - [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/)
- It is assumed that you are familiar with [Grafana](https://grafana.com/)(v.10.2.1) and [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/). - [Node exporter](https://github.com/prometheus/node_exporter#node-exporter)
## 1. What is vmanomaly? ## 1. What is vmanomaly?
*VictoriaMetrics Anomaly Detection* ([vmanomaly](https://docs.victoriametrics.com/vmanomaly.html)) is a service that continuously scans time series stored in VictoriaMetrics and detects unexpected changes within data patterns in real-time. It does so by utilizing user-configurable machine learning models. *VictoriaMetrics Anomaly Detection* ([vmanomaly](https://docs.victoriametrics.com/vmanomaly.html)) is a service that continuously scans time series stored in VictoriaMetrics and detects unexpected changes within data patterns in real-time. It does so by utilizing user-configurable machine learning models.
All the service parameters are defined in a config file. All the service parameters are defined in a config file.
@ -34,25 +36,32 @@ A single config file supports only one model. It is ok to run multiple vmanomaly
- periodically queries user-specified metrics - periodically queries user-specified metrics
- computes an **anomaly score** for them - computes an **anomaly score** for them
- pushes back the computed **anomaly score** to VictoriaMetrics. - pushes back the computed **anomaly score** to VictoriaMetrics.
### What is anomaly score? ### What is anomaly score?
**Anomaly score** is a calculated non-negative (in interval [0, +inf)) numeric value. It takes into account how well data fit a predicted distribution, periodical patterns, trends, seasonality, etc. **Anomaly score** is a calculated non-negative (in interval [0, +inf)) numeric value. It takes into account how well data fit a predicted distribution, periodical patterns, trends, seasonality, etc.
The value is designed to: The value is designed to:
- *fall between 0 and 1* if model consider that datapoint is following usual pattern,
- *exceed 1* if the datapoint is abnormal. - *fall between 0 and 1* if model consider that datapoint is following usual pattern
- *exceed 1* if the datapoint is abnormal
Then, users can enable alerting rules based on the **anomaly score** with [vmalert](#what-is-vmalert). Then, users can enable alerting rules based on the **anomaly score** with [vmalert](#what-is-vmalert).
## 2. What is vmalert? ## 2. What is vmalert?
[vmalert](https://docs.victoriametrics.com/vmalert.html) is an alerting tool for VictoriaMetrics. It executes a list of the given alerting or recording rules against configured `-datasource.url`. [vmalert](https://docs.victoriametrics.com/vmalert.html) is an alerting tool for VictoriaMetrics. It executes a list of the given alerting or recording rules against configured `-datasource.url`.
[Alerting rules](https://docs.victoriametrics.com/vmalert.html#alerting-rules) allow you to define conditions that, when met, will notify the user. The alerting condition is defined in a form of a query expression via [MetricsQL query language](https://docs.victoriametrics.com/MetricsQL.html). For example, in our case, the expression `anomaly_score > 1.0` will notify a user when the calculated anomaly score exceeds a threshold of 1. [Alerting rules](https://docs.victoriametrics.com/vmalert.html#alerting-rules) allow you to define conditions that, when met, will notify the user. The alerting condition is defined in a form of a query expression via [MetricsQL query language](https://docs.victoriametrics.com/MetricsQL.html). For example, in our case, the expression `anomaly_score > 1.0` will notify a user when the calculated anomaly score exceeds a threshold of `1.0`.
## 3. How does vmanomaly works with vmalert? ## 3. How does vmanomaly works with vmalert?
Compared to classical alerting rules, anomaly detection is more "hands-off" and data-aware. Instead of thinking of critical conditions to define, user can rely on catching anomalies that were not expected to happen. In other words, by setting up alerting rules, a user must know what to look for, ahead of time, while anomaly detection looks for any deviations from past behavior. Compared to classical alerting rules, anomaly detection is more "hands-off" and data-aware. Instead of thinking of critical conditions to define, user can rely on catching anomalies that were not expected to happen. In other words, by setting up alerting rules, a user must know what to look for, ahead of time, while anomaly detection looks for any deviations from past behavior.
Practical use case is to put anomaly score generated by vmanomaly into alerting rules with some threshold. Practical use case is to put anomaly score generated by vmanomaly into alerting rules with some threshold.
**In this tutorial we are going to:** **In this tutorial we are going to:**
- Configure docker-compose file with all needed services ([VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html), [vmalert](https://docs.victoriametrics.com/vmalert.html), [vmagent](https://docs.victoriametrics.com/vmagent.html), [Grafana](https://grafana.com/), [Node Exporter](https://prometheus.io/docs/guides/node-exporter/) and [vmanomaly](https://docs.victoriametrics.com/vmanomaly.html) ). - Configure docker-compose file with all needed services ([VictoriaMetrics Single-Node](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html), [vmalert](https://docs.victoriametrics.com/vmalert.html), [vmagent](https://docs.victoriametrics.com/vmagent.html), [Grafana](https://grafana.com/), [Node Exporter](https://prometheus.io/docs/guides/node-exporter/) and [vmanomaly](https://docs.victoriametrics.com/vmanomaly.html) ).
- Explore configuration files for [vmanomaly](https://docs.victoriametrics.com/vmanomaly.html) and [vmalert](https://docs.victoriametrics.com/vmalert.html). - Explore configuration files for [vmanomaly](https://docs.victoriametrics.com/vmanomaly.html) and [vmalert](https://docs.victoriametrics.com/vmalert.html).
- Run our own [VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) database with data scraped from [Node Exporter](https://prometheus.io/docs/guides/node-exporter/). - Run our own [VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) database with data scraped from [Node Exporter](https://prometheus.io/docs/guides/node-exporter/).
- Explore data for analysis in [Grafana](https://grafana.com/). - Explore data for analysis in [Grafana](https://grafana.com/).
@ -62,11 +71,13 @@ Practical use case is to put anomaly score generated by vmanomaly into alerting
_____________________________ _____________________________
## 4. Data to analyze ## 4. Data to analyze
Let's talk about data used for anomaly detection in this tutorial. Let's talk about data used for anomaly detection in this tutorial.
We are going to collect our own CPU usage data with [Node Exporter](https://prometheus.io/docs/guides/node-exporter/) into the VictoriaMetrics database. We are going to collect our own CPU usage data with [Node Exporter](https://prometheus.io/docs/guides/node-exporter/) into the VictoriaMetrics database.
On a Node Exporter's metrics page, part of the output looks like this: On a Node Exporter's metrics page, part of the output looks like this:
```
```text
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode. # HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter # TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 94965.14 node_cpu_seconds_total{cpu="0",mode="idle"} 94965.14
@ -81,6 +92,7 @@ node_cpu_seconds_total{cpu="1",mode="idle"} 94386.53
node_cpu_seconds_total{cpu="1",mode="iowait"} 51.22 node_cpu_seconds_total{cpu="1",mode="iowait"} 51.22
... ...
``` ```
Here, metric `node_cpu_seconds_total` tells us how many seconds each CPU spent in different modes: _user_, _system_, _iowait_, _idle_, _irq&softirq_, _guest_, or _steal_. Here, metric `node_cpu_seconds_total` tells us how many seconds each CPU spent in different modes: _user_, _system_, _iowait_, _idle_, _irq&softirq_, _guest_, or _steal_.
These modes are mutually exclusive. A high _iowait_ means that you are disk or network bound, high _user_ or _system_ means that you are CPU bound. These modes are mutually exclusive. A high _iowait_ means that you are disk or network bound, high _user_ or _system_ means that you are CPU bound.
@ -90,9 +102,8 @@ Here is how this query may look like in Grafana:
This query result will generate 8 time series per each cpu, and we will use them as an input for our VM Anomaly Detection. vmanomaly will start learning configured model type separately for each of the time series. This query result will generate 8 time series per each cpu, and we will use them as an input for our VM Anomaly Detection. vmanomaly will start learning configured model type separately for each of the time series.
______________________________
## 5. vmanomaly configuration and parameter description ## 5. vmanomaly configuration and parameter description
**Parameter description**: **Parameter description**:
There are 4 required sections in config file: There are 4 required sections in config file:
@ -106,28 +117,18 @@ There are 4 required sections in config file:
Let's look into parameters in each section: Let's look into parameters in each section:
* `scheduler` * `scheduler`
* `infer_every` - how often trained models will make inferences on new data. Basically, how often to generate new datapoints for anomaly_score. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days). You can look at this as how often a model will write its conclusions on newly added data. Here in example we are asking every 1 minute: based on the previous data, do these new datapoints look abnormal?
* `infer_every` - how often trained models will make inferences on new data. Basically, how often to generate new datapoints for anomaly_score. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days).
You can look at this as how often a model will write its conclusions on newly added data. Here in example we are asking every 1 minute: based on the previous data, do these new datapoints look abnormal?
* `fit_every` - how often to retrain the models. The higher the frequency -- the fresher the model, but the more CPU it consumes. If omitted, the models will be retrained on each infer_every cycle. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days). * `fit_every` - how often to retrain the models. The higher the frequency -- the fresher the model, but the more CPU it consumes. If omitted, the models will be retrained on each infer_every cycle. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days).
* `fit_window` - what data interval to use for model training. Longer intervals capture longer historical behavior and detect seasonalities better, but is slower to adapt to permanent changes to metrics behavior. Recommended value is at least two full seasons. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days). Here is the previous 14 days of data to put into the model training.
* `fit_window` - what data interval to use for model training. Longer intervals capture longer historical behavior and detect seasonalities better, but is slower to adapt to permanent changes to metrics behavior. Recommended value is at least two full seasons. Format examples: 30s, 4m, 2h, 1d. Time granularity ('s' - seconds, 'm' - minutes, 'h' - hours, 'd' - days).
Here is the previous 14 days of data to put into the model training.
* `model` * `model`
* `class` - what model to run. You can use your own model or choose from built-in models: Seasonal Trend Decomposition, Facebook Prophet, ZScore, Rolling Quantile, Holt-Winters, Isolation Forest and ARIMA. Here we use Facebook Prophet (`model.prophet.ProphetModel`). * `class` - what model to run. You can use your own model or choose from built-in models: Seasonal Trend Decomposition, Facebook Prophet, ZScore, Rolling Quantile, Holt-Winters, Isolation Forest and ARIMA. Here we use Facebook Prophet (`model.prophet.ProphetModel`).
* `args` - Model specific parameters, represented as YAML dictionary in a simple `key: value` form. For example, you can use parameters that are available in [FB Prophet](https://facebook.github.io/prophet/docs/quick_start.html).
* `args` - Model specific parameters, represented as YAML dictionary in a simple `key: value` form. For example, you can use parameters that are available in [FB Prophet](https://facebook.github.io/prophet/docs/quick_start.html).
* `reader` * `reader`
* `datasource_url` - Data source. An HTTP endpoint that serves `/api/v1/query_range`. * `datasource_url` - Data source. An HTTP endpoint that serves `/api/v1/query_range`.
* `queries`: - MetricsQL (extension of PromQL) expressions, where you want to find anomalies. * `queries`: - MetricsQL (extension of PromQL) expressions, where you want to find anomalies.
You can put several queries in a form:
You can put several queries in a form: `<QUERY_ALIAS>: "QUERY"`. QUERY_ALIAS will be used as a `for` label in generated metrics and anomaly scores.
`<QUERY_ALIAS>: "QUERY"`. QUERY_ALIAS will be used as a `for` label in generated metrics and anomaly scores.
* `writer` * `writer`
* `datasource_url` - Output destination. An HTTP endpoint that serves `/api/v1/import`. * `datasource_url` - Output destination. An HTTP endpoint that serves `/api/v1/import`.
@ -158,8 +159,8 @@ writer:
</div> </div>
_____________________________________________
## 6. vmanomaly output ## 6. vmanomaly output
As the result of running vmanomaly, it produces the following metrics: As the result of running vmanomaly, it produces the following metrics:
- `anomaly_score` - the main one. Ideally, if it is between 0.0 and 1.0 it is considered to be a non-anomalous value. If it is greater than 1.0, it is considered an anomaly (but you can reconfigure that in alerting config, of course), - `anomaly_score` - the main one. Ideally, if it is between 0.0 and 1.0 it is considered to be a non-anomalous value. If it is greater than 1.0, it is considered an anomaly (but you can reconfigure that in alerting config, of course),
- `yhat` - predicted expected value, - `yhat` - predicted expected value,
@ -171,8 +172,6 @@ Here is an example of how output metric will be written into VictoriaMetrics:
`anomaly_score{for="node_cpu_rate", cpu="0", instance="node-xporter:9100", job="node-exporter", mode="idle"} 0.85` `anomaly_score{for="node_cpu_rate", cpu="0", instance="node-xporter:9100", job="node-exporter", mode="idle"} 0.85`
____________________________________________
## 7. vmalert configuration ## 7. vmalert configuration
Here we provide an example of the config for vmalert `vmalert_config.yml`. Here we provide an example of the config for vmalert `vmalert_config.yml`.
@ -194,7 +193,7 @@ groups:
In the query expression we need to put a condition on the generated anomaly scores. Usually if the anomaly score is between 0.0 and 1.0, the analyzed value is not abnormal. The more anomaly score exceeded 1 the more our model is sure that value is an anomaly. In the query expression we need to put a condition on the generated anomaly scores. Usually if the anomaly score is between 0.0 and 1.0, the analyzed value is not abnormal. The more anomaly score exceeded 1 the more our model is sure that value is an anomaly.
You can choose your threshold value that you consider reasonable based on the anomaly score metric, generated by vmanomaly. One of the best ways is to estimate it visually, by plotting the `anomaly_score` metric, along with predicted "expected" range of `yhat_lower` and `yhat_upper`. Later in this tutorial we will show an example You can choose your threshold value that you consider reasonable based on the anomaly score metric, generated by vmanomaly. One of the best ways is to estimate it visually, by plotting the `anomaly_score` metric, along with predicted "expected" range of `yhat_lower` and `yhat_upper`. Later in this tutorial we will show an example
____________________________________________
## 8. Docker Compose configuration ## 8. Docker Compose configuration
Now we are going to configure the `docker-compose.yml` file to run all needed services. Now we are going to configure the `docker-compose.yml` file to run all needed services.
Here are all services we are going to run: Here are all services we are going to run:
@ -229,7 +228,8 @@ datasources:
</div> </div>
### Prometheus config ### Scrape config
Let's create `prometheus.yml` file for `vmagent` configuration. Let's create `prometheus.yml` file for `vmagent` configuration.
<div class="with-copy" markdown="1"> <div class="with-copy" markdown="1">
@ -259,6 +259,7 @@ scrape_configs:
</div> </div>
### vmanomaly licencing ### vmanomaly licencing
We are going to use license stored locally in file `vmanomaly_licence.txt` with key in it. We are going to use license stored locally in file `vmanomaly_licence.txt` with key in it.
You can explore other license options [here](https://docs.victoriametrics.com/vmanomaly.html#licensing) You can explore other license options [here](https://docs.victoriametrics.com/vmanomaly.html#licensing)
@ -414,9 +415,8 @@ docker logs vmanomaly
</div> </div>
___________________________________________________________
## 9. Model results ## 9. Model results
To look at model results we need to go to grafana on the `localhost:3000`. Data To look at model results we need to go to grafana on the `localhost:3000`. Data
vmanomaly need some time to generate more data to visualize. vmanomaly need some time to generate more data to visualize.
Let's investigate model output visualization in Grafana. Let's investigate model output visualization in Grafana.
@ -427,6 +427,7 @@ In the Grafana Explore tab enter queries:
* `yhat_upper` * `yhat_upper`
Each of these metrics will contain same labels our query `rate(node_cpu_seconds_total)` returns. Each of these metrics will contain same labels our query `rate(node_cpu_seconds_total)` returns.
### Anomaly scores for each metric with its according labels. ### Anomaly scores for each metric with its according labels.
Query: `anomaly_score` Query: `anomaly_score`
@ -435,18 +436,25 @@ Query: `anomaly_score`
<br>Check out if the anomaly score is high for datapoints you think are anomalies. If not, you can try other parameters in the config file or try other model type. <br>Check out if the anomaly score is high for datapoints you think are anomalies. If not, you can try other parameters in the config file or try other model type.
As you may notice a lot of data shows anomaly score greater than 1. It is expected as we just started to scrape and store data and there are not enough datapoints to train on. Just wait for some more time for gathering more data to see how well this particular model can find anomalies. In our configs we put 2 days of data required. As you may notice a lot of data shows anomaly score greater than 1. It is expected as we just started to scrape and store data and there are not enough datapoints to train on. Just wait for some more time for gathering more data to see how well this particular model can find anomalies. In our configs we put 2 days of data required.
### Actual value from input query with predicted `yhat` metric. ### Actual value from input query with predicted `yhat` metric.
Query: `yhat` Query: `yhat`
<img alt="yhat" src="guide-vmanomaly-vmalert_yhat.webp"> <img alt="yhat" src="guide-vmanomaly-vmalert_yhat.webp">
<br>Here we are using one particular set of metrics for visualization. Check out the difference between model prediction and actual values. If values are very different from prediction, it can be considered as anomalous. Here we are using one particular set of metrics for visualization. Check out the difference between model prediction and actual values. If values are very different from prediction, it can be considered as anomalous.
### Lower and upper boundaries that model predicted. ### Lower and upper boundaries that model predicted.
Queries: `yhat_lower` and `yhat_upper` Queries: `yhat_lower` and `yhat_upper`
<img alt="yhat lower and yhat upper" src="guide-vmanomaly-vmalert_yhat-lower-upper.webp"> <img alt="yhat lower and yhat upper" src="guide-vmanomaly-vmalert_yhat-lower-upper.webp">
Boundaries of 'normal' metric values according to model inference. Boundaries of 'normal' metric values according to model inference.
### Alerting ### Alerting
On the page `http://localhost:8880/vmalert/groups` you can find our configured Alerting rule: On the page `http://localhost:8880/vmalert/groups` you can find our configured Alerting rule:
<img alt="alert rule" src="guide-vmanomaly-vmalert_alert-rule.webp"> <img alt="alert rule" src="guide-vmanomaly-vmalert_alert-rule.webp">
@ -455,4 +463,5 @@ According to the rule configured for vmalert we will see Alert when anomaly scor
<img alt="alerts firing" src="guide-vmanomaly-vmalert_alerts-firing.webp"> <img alt="alerts firing" src="guide-vmanomaly-vmalert_alerts-firing.webp">
## 10. Conclusion ## 10. Conclusion
Now we know how to set up Victoria Metric Anomaly Detection tool and use it together with vmalert. We also discovered core vmanomaly generated metrics and behaviour. Now we know how to set up Victoria Metric Anomaly Detection tool and use it together with vmalert. We also discovered core vmanomaly generated metrics and behaviour.