VictoriaMetrics/app/vmalert-tool/README.md


# vmalert-tool

VMAlert command-line tool

## Unit testing for rules

You can use `vmalert-tool` to run unit tests for alerting and recording rules.
It will perform the following actions:
* sets up an isolated VictoriaMetrics instance;
* simulates the periodic ingestion of time series;
* queries the ingested data for recording and alerting rules evaluation like [vmalert](https://docs.victoriametrics.com/vmalert.html);
* checks whether the firing alerts or resulting recording rules match the expected results.

See how to run vmalert-tool for unit test below:

```
# Run vmalert-tool with one or multiple test files via --files cmd-line flag
./vmalert-tool unittest --files test1.yaml --files test2.yaml
```

vmalert-tool unittest is compatible with [Prometheus config format for tests](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/#test-file-format)
except `promql_expr_test` field. Use `metricsql_expr_test` field name instead. The name is different because vmalert-tool
validates and executes [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html) expressions,
which aren't always backward compatible with [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/).

### Test file format

The configuration format for files specified in `--files` cmd-line flag is the following:

```yaml
# Path to the files or http url containing [rule groups](https://docs.victoriametrics.com/vmalert.html#groups) configuration.
# Enterprise version of vmalert-tool supports S3 and GCS paths to rules.
rule_files:
  [ - <string> ]

# The evaluation interval for rules specified in `rule_files`
[ evaluation_interval: <duration> | default = 1m ]

# Groups listed below will be evaluated by order.
# Not All the groups need not be mentioned, if not, they will be evaluated by define order in rule_files.
group_eval_order:
  [ - <string> ]

# The list of unit test files to be checked during evaluation.
tests:
  [ - <test_group> ]
```

#### `<test_group>`

```yaml
# Interval between samples for input series
interval: <duration>
# Time series to persist into the database according to configured <interval> before running tests.
input_series:
  [ - <series> ]

# Name of the test group, optional
[ name: <string> ]

# Unit tests for alerting rules
alert_rule_test:
  [ - <alert_test_case> ]

# Unit tests for Metricsql expressions.
metricsql_expr_test:
  [ - <metricsql_expr_test> ]

# External labels accessible for templating.
external_labels:
  [ <labelname>: <string> ... ]

```

#### `<series>`

```yaml
# series in the following format '<metric name>{<label name>=<label value>, ...}'
# Examples:
#      series_name{label1="value1", label2="value2"}
#      go_goroutines{job="prometheus", instance="localhost:9090"}
series: <string>

# values support several special equations:
#    'a+bxc' becomes 'a a+b a+(2*b) a+(3*b) … a+(c*b)'
#     Read this as series starts at a, then c further samples incrementing by b.
#    'a-bxc' becomes 'a a-b a-(2*b) a-(3*b) … a-(c*b)'
#     Read this as series starts at a, then c further samples decrementing by b (or incrementing by negative b).
#    '_' represents a missing sample from scrape
#    'stale' indicates a stale sample
# Examples:
#     1. '-2+4x3' becomes '-2 2 6 10' - series starts at -2, then 3 further samples incrementing by 4.
#     2. ' 1-2x4' becomes '1 -1 -3 -5 -7' - series starts at 1, then 4 further samples decrementing by 2.
#     3. ' 1x4' becomes '1 1 1 1 1' - shorthand for '1+0x4', series starts at 1, then 4 further samples incrementing by 0.
#     4. ' 1 _x3 stale' becomes '1 _ _ _ stale' - the missing sample cannot increment, so 3 missing samples are produced by the '_x3' expression.
values: <string>
```

#### `<alert_test_case>`

vmalert by default adds `alertgroup` and `alertname` to the generated alerts and time series.
So you will need to specify both `groupname` and `alertname` under a single `<alert_test_case>`,
but no need to add them under `exp_alerts`.
You can also pass `--disableAlertgroupLabel` to skip `alertgroup` check.

```yaml
# The time elapsed from time=0s when this alerting rule should be checked.
# Means this rule should be firing at this point, or shouldn't be firing if 'exp_alerts' is empty.
eval_time: <duration>

# Name of the group name to be tested.
groupname: <string>

# Name of the alert to be tested.
alertname: <string>

# List of the expected alerts that are firing under the given alertname at
# the given evaluation time. If you want to test if an alerting rule should
# not be firing, then you can mention only the fields above and leave 'exp_alerts' empty.
exp_alerts:
  [ - <alert> ]
```

#### `<alert>`

```yaml
# These are the expanded labels and annotations of the expected alert.
# Note: labels also include the labels of the sample associated with the alert
exp_labels:
  [ <labelname>: <string> ]
exp_annotations:
  [ <labelname>: <string> ]
```

#### `<metricsql_expr_test>`

```yaml
# Expression to evaluate
expr: <string>

# The time elapsed from time=0s when this expression be evaluated.
eval_time: <duration>

# Expected samples at the given evaluation time.
exp_samples:
  [ - <sample> ]
```

#### `<sample>`

```yaml
# Labels of the sample in usual series notation '<metric name>{<label name>=<label value>, ...}'
# Examples:
#      series_name{label1="value1", label2="value2"}
#      go_goroutines{job="prometheus", instance="localhost:9090"}
labels: <string>

# The expected value of the Metricsql expression.
value: <number>
```

### Example

This is an example input file for unit testing which will pass.
`test.yaml` is the test file which follows the syntax above and `alerts.yaml` contains the alerting rules.

With `rules.yaml` in the same directory, run `./vmalert-tool unittest --files=./unittest/testdata/test.yaml`.

#### `test.yaml`

```yaml
rule_files:
  - rules.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'up{job="prometheus", instance="localhost:9090"}'
        values: "0+0x1440"

    metricsql_expr_test:
      - expr: suquery_interval_test
        eval_time: 4m
        exp_samples:
          - labels: '{__name__="suquery_interval_test", datacenter="dc-123", instance="localhost:9090", job="prometheus"}'
            value: 1

    alert_rule_test:
      - eval_time: 2h
        groupname: group1
        alertname: InstanceDown
        exp_alerts:
          - exp_labels:
              job: prometheus
              severity: page
              instance: localhost:9090
              datacenter: dc-123
            exp_annotations:
              summary: "Instance localhost:9090 down"
              description: "localhost:9090 of job prometheus has been down for more than 5 minutes."

      - eval_time: 0
        groupname: group1
        alertname: AlwaysFiring
        exp_alerts:
          - exp_labels:
              datacenter: dc-123

      - eval_time: 0
        groupname: group1
        alertname: InstanceDown
        exp_alerts: []

    external_labels:
      datacenter: dc-123
```

#### `alerts.yaml`

```yaml
# This is the rules file.

groups:
  - name: group1
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
      - alert: AlwaysFiring
        expr: 1

  - name: group2
    rules:
      - record: job:test:count_over_time1m
        expr: sum without(instance) (count_over_time(test[1m]))
      - record: suquery_interval_test
        expr: count_over_time(up[5m:])
```