From df59ac7f0eb5cdfbbf38041d951b93d00fd3fc85 Mon Sep 17 00:00:00 2001
From: Roman Khavronenko <roman@victoriametrics.com>
Date: Fri, 26 Jan 2024 22:42:21 +0100
Subject: [PATCH] app/vmalert: fix data race during hot-config reload (#5698)

* app/vmalert: fix data race during hot-config reload

During hot-reload, the logic evokes the group update and rules evaluation
interruption simultaneously. Falsely assuming that interruption happens before
the update. However, it could happen that group will be updated first and only
after the rules evaluation will be cancelled. Which will result in permanent
interruption for all rules within the group.

The fix caches the cancel context function into local variable first. And only after
performs the group update. With cached cancel function we can safely call it without
worrying that we cancel the evaluation for already updated group.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* Revert "app/vmalert: fix data race during hot-config reload"

This reverts commit a4bb7e8932c998ca0f3c7bd2da1a6d4da10fe1a1.

* app/vmalert: fix data race during hot-config reload

During hot-reload, the logic evokes the group update and rules evaluation
interruption simultaneously. Falsely assuming that interruption happens before
the update. However, it could happen that group will be updated first and only
after the rules evaluation will be cancelled. Which will result in permanent
interruption for all rules within the group.

The fix cancels the evaulation context before applying the update, making sure
that the context will be cancelled for old group always.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* wip

Signed-off-by: hagen1778 <roman@victoriametrics.com>

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
---
 app/vmalert/manager.go | 5 ++++-
 docs/CHANGELOG.md      | 3 ++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/app/vmalert/manager.go b/app/vmalert/manager.go
index a15b4c97e..b8ae851c9 100644
--- a/app/vmalert/manager.go
+++ b/app/vmalert/manager.go
@@ -156,11 +156,14 @@ func (m *manager) update(ctx context.Context, groupsCfg []config.Group, restore
 		var wg sync.WaitGroup
 		for _, item := range toUpdate {
 			wg.Add(1)
+			// cancel evaluation so the Update will be applied as fast as possible.
+			// it is important to call InterruptEval before the update, because cancel fn
+			// can be re-assigned during the update.
+			item.old.InterruptEval()
 			go func(old *rule.Group, new *rule.Group) {
 				old.UpdateWith(new)
 				wg.Done()
 			}(item.old, item.new)
-			item.old.InterruptEval()
 		}
 		wg.Wait()
 	}
diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
index c7f75c6b3..09f2dca29 100644
--- a/docs/CHANGELOG.md
+++ b/docs/CHANGELOG.md
@@ -71,7 +71,8 @@ The sandbox cluster installation is running under the constant load generated by
 * BUGFIX: `vmstorage`: properly expire `storage/prefetchedMetricIDs` cache. Previously this cache was never expired, so it could grow big under [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate). This could result in increasing CPU load over time.
 * BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): check `-external.url` schema when starting vmalert, must be `http` or `https`. Before, alertmanager could reject alert notifications if `-external.url` contained no or wrong schema.
 * BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): automatically add `exported_` prefix for original evaluation result label if it's conflicted with external or reserved one, previously it was overridden. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5161).
-* BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): autogenerate `ALERTS_FOR_STATE` time series for alerting rules with `for: 0`. Previously, `ALERTS_FOR_STATE` was generated only for alerts with `for > 0`. The change aligns with Prometheus behavior. See more details in [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5648). 
+* BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): autogenerate `ALERTS_FOR_STATE` time series for alerting rules with `for: 0`. Previously, `ALERTS_FOR_STATE` was generated only for alerts with `for > 0`. The change aligns with Prometheus behavior. See more details in [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5648).
+* BUGFIX: [vmalert](https://docs.victoriametrics.com/vmalert.html): fix data race during hot-config reload. The result of the race could cause all rules from updated group to fail with `context cancelled` error.
 * BUGFIX: [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html): consistently sort results for `q1 or q2` query, so they do not change colors with each refresh in Grafana. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5393).
 * BUGFIX: [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html): properly return results from [bottomk](https://docs.victoriametrics.com/MetricsQL.html#bottomk) and `bottomk_*()` functions when some of these results contain NaN values. See [this issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5506). Thanks to @xiaozongyang for [the fix](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5509).
 * BUGFIX: [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html): properly handle queries, which wrap [rollup functions](https://docs.victoriametrics.com/MetricsQL.html#rollup-functions) with multiple arguments without explicitly specified lookbehind window in square brackets into [aggregate functions](https://docs.victoriametrics.com/MetricsQL.html#aggregate-functions). For example, `sum(quantile_over_time(0.5, process_resident_memory_bytes))` was resulting to `expecting at least 2 args to ...; got 1 args` error. Thanks to @atykhyy for [the pull request](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5414).