Commit Graph

278 Commits

Author SHA1 Message Date
Aliaksandr Valialkin
ac5b740750
lib/promscrape/discovery/kubernetes: typo fix in the comment for ContainerStateTerminated struct
This is a follow-up for ef12598ad4
2024-01-24 15:06:46 +02:00
Aliaksandr Valialkin
ef12598ad4
lib/promscrape/discovery/kubernetes: do not generate targets for already terminated pods and containers
Already terminated pods and containers cannot be scraped and will never resurrect,
so there is zero sense in creating scrape targets for them.
2024-01-24 14:57:53 +02:00
Aliaksandr Valialkin
3449d563bd
all: add up to 10% random jitter to the interval between periodic tasks performed by various components
This should smooth CPU and RAM usage spikes related to these periodic tasks,
by reducing the probability that multiple concurrent periodic tasks are performed at the same time.
2024-01-22 18:40:32 +02:00
Hui Wang
4e3242b02d
lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice (#5557)
* lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice

Previously the groupWatcher could be mistakenly stopped when requests for pod or services resources take too long.

* remove mislead comment

* docs/sd_configs.md: mention -promscrape.kubernetes.attachNodeMetadataAll flag in the description for attach_metadata section

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640

* wip

* lib/promscrape/kubernetes: prevent from stopping groupWatcher when there are in-flight apiWatcher.mustStart() calls

groupWatcher is stopped if it has zero registered apiWatchers during 14 seconds.
But such a groupWatcher can be still in use if apiWatcher for `role: endpoints` or `role: endpointslice`
is being registered and the discovery of the associated `pod` and/or `service` objects takes longer
than 14 seconds - see the beginning of groupWatcher.startWatchersForRole() function for details.

Track the number of in-flight calls to apiWatcher.mustStart() and prevent from stopping the associated groupWatcher
if the number of in-flight calls is non-zero.

P.S. postponing the discovery of `pod` and/or `service` objects associated with `endpoints` or `endpointslice` roles
isn't the best solution, since it slows down initial discovery of `endpoints` and `endpointslice` targets.

* typo fix

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2024-01-21 23:13:15 +02:00
Aliaksandr Valialkin
1f105dde98
all: allow dynamically reading *AuthKey flag values from files and urls
Examples:

1) -metricsAuthKey=file:///abs/path/to/file - reads flag value from the given absolute filepath
2) -metricsAuthKey=file://./relative/path/to/file - reads flag value from the given relative filepath
3) -metricsAuthKey=http://some-host/some/path?query_arg=abc - reads flag value from the given url

The flag value is automatically updated when the file contents changes.
2024-01-21 22:03:38 +02:00
Aliaksandr Valialkin
7fba73ce11
lib/promscrape/discovery/kubernetes: add -promscrape.kubernetes.attachNodeMetadataAll command-line flag
This flag allows setting attach_metadata.node=true for all the kubernetes_sd_configs defined at -promscrape.config

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640

Thanks to wasim-nihal for the initial implementation at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5593
2024-01-21 03:13:56 +02:00
Aliaksandr Valialkin
74448a7e57
lib/promscrape/discovery/hetzner: follow-up after 03a97dc678
- docs/sd_configs.md: moved hetzner_sd_configs docs to the correct place according to alphabetical order of SD names,
  document missing __meta_hetzner_role label.
- lib/promscrape/config.go: added missing MustStop() call for Hetzner SD,
  and moved the code to the correct place according to alphabetical order of SD names.
- lib/promscrape/discovery/hetzner: properly handle pagination for hloud API responses,
  populate missing __meta_hetzner_role label like Prometheus does.
- Properly populate __meta_hetzner_public_ipv6_network label like Prometheus does.
- Remove unused SDConfig.Token.
- Remove "omitempty" annotation from SDConfig.Role field, since this field is mandatory.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5550
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3154
2024-01-20 17:01:53 +02:00
Aliaksandr Valialkin
4b42c8abbb
lib/promscrape/discovery/hetzner: fix golangci-lint warnings after 03a97dc678
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5550
2024-01-15 17:12:40 +02:00
Aleksandr Stepanov
03a97dc678
vmagent: added hetzner sd config (#5550)
* added hetzner robot and hetzner cloud sd configs

* remove gettoken fun and update docs

* Updated CHANGELOG and vmagent docs

* Updated CHANGELOG and vmagent docs

---------

Co-authored-by: Nikolay <nik@victoriametrics.com>
2024-01-15 10:13:22 +01:00
Aliaksandr Valialkin
613b545dfd
lib/promscrape/discovery/kubernetes: propagate possible errors at newAPIWatcher() to the caller
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.

This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
2023-10-27 20:24:46 +02:00
Hui Wang
90427abc65
lib/promscrape/discovery/kubernetes: avoid possible panic if given caFile under kubernetes.SDConfig.HTTPClientConfig is not exist (#5243)
follow up d5a599badc
2023-10-27 20:20:22 +02:00
Aliaksandr Valialkin
632d788b63
lib/promscrape/discovery/kubernetes: stop all the url watchers, which belong to a particular groupWatcher, at once
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216

The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-10-27 13:51:35 +02:00
Hui Wang
7c90ce39cb
do not print redundant error logs when failed to scrape consul or no… (#5239)
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
2023-10-27 13:31:55 +08:00
Aliaksandr Valialkin
d5a599badc
lib/promauth: follow-up for e16d3f5639
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
  don't prevent from processing the corresponding scrape targets after the file becomes correct,
  without the need to restart vmagent.
  Previously scrape targets with invalid TLS CA file or TLS client certificate files
  were permanently dropped after the first attempt to initialize them, and they didn't
  appear until the next vmagent reload or the next change in other places of the loaded scrape configs.

- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
  Previously the old TLS CA was used until vmagent restart.

- Properly handle errors during http request creation for the second attempt to send data to remote system
  at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
  since the returned request is nil on error.

- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
  Previously it could miss details on the source of the request.

- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
  of every http request issued by vmagent during service discovery or target scraping.
  Re-use the HTTP client instead until the corresponding scrape config changes.

- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
  e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
  when auth header cannot be obtained because of temporary error.

- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
  Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
  when scraping big number of targets with the same tls_config.

- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.

- Improve test coverage at lib/promauth

- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
  Previously vmagent was exitting in this case.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
2023-10-25 23:19:37 +02:00
Aliaksandr Valialkin
c22e3e7b1d
lib/promscrape/discovery/kubernetes/kubeconfig_test.go: make TestParseKubeConfigSuccess test code easier to follow 2023-10-25 23:17:18 +02:00
Aliaksandr Valialkin
eed5206376
lib/promauth: properly parse string contents for ca, cert and key fields at tls_config
Previously yaml parser wasn't accepting string values for these fields,
because it was mistakenly expecting a list of uint8 values instead.
2023-10-25 23:12:21 +02:00
Aliaksandr Valialkin
42dd71bb63
all: consistently use %w instead of %s in when error is passed to fmt.Errorf()
This allows consistently using errors.Is() for verifying whether the given error wraps some other known error.
2023-10-25 21:24:03 +02:00
Hui Wang
e16d3f5639
fix inconsistent behaviors with prometheus when scraping (#5153)
* fix inconsistent behaviors with prometheus when scraping

1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
    1. vmagent remoteWrite;
    2. vmalert datasource/remoteRead/remoteWrite/notifier.

* add changelogs

* address review comments

* fix ut
2023-10-17 17:58:19 +08:00
Zakhar Bessarab
8d99c12a7d
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs (#5048)
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs

It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).

Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-09-22 13:01:33 +02:00
Aliaksandr Valialkin
30a645cd82
lib/promscrape/discovery/kubernetes: follow-up after 03fece44e0
- Properly update vm_promscrape_discovery_kubernetes_url_watchers
  and vm_promscrape_discovery_kubernetes_group_watchers metrics after config changes

- Properly stop goroutine responsible for recreating scrapeWorks after the corresponding urlWatcher is stopped

- Log the event when urlWatcher is stopped in order to simplify debugging

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-18 23:23:45 +02:00
Aliaksandr Valialkin
03fece44e0
lib/promscrape/discovery/kubernetes: wait for 10 seconds before checking whether the urlWatcher must be stopped
This should prevent from excess urlWatcher churn on config reload, since it leads to removal of all the apiWatchers
before creating new apiWatchers. So, every config reload would lead to stopping of all the previous urlWatchers
and starting new urlWatchers.

The new logic gives 10 seconds for config reload before stopping unused urlWatchers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-18 17:45:12 +02:00
Aliaksandr Valialkin
76af32d869
lib/promscrape/discovery/kubernetes: follow-up after eeb862f3ff
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
  since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-09-18 17:06:39 +02:00
Zakhar Bessarab
eeb862f3ff
lib/promscrape/discovery/kubernetes: fix leaking api watcher (#4861)
* lib/promscrape/discovery/kubernetes: fix leaking api watcher

goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server

This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: clarify comment

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* Apply suggestions from code review

* lib/promscrape/discovery/kubernetes: address review feedback

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2023-09-15 19:40:13 +02:00
Aliaksandr Valialkin
edee262ecc
Makefile: update golangci-lint from v1.51.2 to v1.54.2
See https://github.com/golangci/golangci-lint/releases/tag/v1.54.2
2023-09-01 10:16:42 +02:00
Nikolay
00685b627f
lib/promscrape/k8s_sd: set resourceVersion to 0 by default for watch … (#4901)
* lib/promscrape/k8s_sd: set resourceVersion to 0 by default for watch requests
it must reduce load for kubernetes ETCD servers. Since requests without resourceVersion performs force cache sync at kubernetes API server with ETCD
more info at https://kubernetes.io/docs/reference/using-api/api-concepts/\#semantics-for-watch
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4855

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-08-30 16:03:41 +02:00
Aliaksandr Valialkin
3d73640815
lib/promscrape/discovery: close unused HTTP connections to service discovery servers
This should prevent from connection leaks

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4724
2023-07-27 14:48:56 -07:00
Aliaksandr Valialkin
140e7b6b74
all: replace atomic.Value with atomic.Pointer[T]
This eliminates the need in .(*T) casting for results obtained from Load()

Leave atomic.Value for map, since atomic.Pointer[map[...]...] makes double pointer to map,
because map is already a pointer type.
2023-07-19 17:42:06 -07:00
Aliaksandr Valialkin
8a07621a0c
lib/promscrape: disable support for service discovery and metrics scrape via http2
Reasons for disabling http2:

- http2 is used very rarely comparing to http for Prometheus metrics exposition and service discovery
- http2 is much harder to debug than http
- http2 has very bad security record because of its complexity - see https://portswigger.net/research/http2

VictoriaMetrics components are compiled with nethttpomithttp2 tag because of these issues.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4283
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4274

This is a follow-up for 72c3cd47eb
2023-07-06 16:03:37 -07:00
Alexander Marshalov
40d12be607
fixed service name detection for consulagent service discovery in case of a difference in service name and service id (#4390) (#4439)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
2023-06-12 16:16:43 +02:00
Haleygo
72c3cd47eb
vmagent:scrape config support enable_http2 (#4295)
app/vmagent: support `enable_http2` in scrape config 

This change adds HTTP2 support for scrape config
and improves compatibility with Prometheus config. 

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4283
2023-06-05 15:56:49 +02:00
Haleygo
b3d0ff463a
vmagent:support follow_redirects on SD level (#4286)
* vmagent:support follow_redirects on SD level

* fix follow_redirects on sd level

https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4282
2023-05-26 09:39:45 +02:00
Aliaksandr Valialkin
b9bb64ce55
lib/promscrape/discovery/consulagent: substitute metaPrefix with the __meta_consulagent_ plaintext string
This simplifies future code navigation and search for the specific meta-label starting from __meta_consulagent_* prefix.
For example, `grep __meta_consulagent_namespace` finds the exact place where this label is defined.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3953
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4217
2023-05-08 23:40:13 -07:00
Aliaksandr Valialkin
74155afb71
docs: clarify docs after 5ee344824f
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4183
2023-05-08 16:11:44 -07:00
Zakhar Bessarab
4e71003620
lib/promscrape/discovery/kubernetes: follow-up for d5e94721db (#4255)
- add changelog reference to an author
- fix tests
- add metadata to match Prometheus behavior

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-05-05 14:41:17 +02:00
Vasilchenko Anton
22e65402af
Add endpoint labels for pod targets discovered form endpoint but has different ports (#4253)
Signed-off-by: Vasilchenko Anton <vasilchenko-as@yandex.ru>
2023-05-05 15:46:07 +04:00
Alexander Marshalov
56b84140a9
added new consulagent service discovery (#3953) (#4217) 2023-05-04 11:36:21 +02:00
Zakhar Bessarab
bf3b6732bd
lib/promscrape/discovery/kubernetes: add common labels to all ports discovered from endpoints (#4235)
* lib/promscrape/discovery/kubernetes: add common labels to all ports discovered from endpoints

Sets
`__meta_kubernetes_endpoints_name` and `__meta_kubernetes_namespace` labels to all ports of pod.
Prometheus sets those labels to all ports in pod (0ab9553611/discovery/kubernetes/endpoints.go (L267C15-L269)) even if port is not matching any service.

See: #4154

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: fix test for updated discovery logic

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-05-03 02:17:33 +02:00
Nikolay
5ee344824f
lib/promscrape: adds filter for consul_sd_configs: (#4184)
* lib/promscrape: adds filter for consul_sd_configs:
it allows advanced filtering for consul service discovery requests
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4183

* typo fix

* removes deprecation mentions since it's not relevant

* Update docs/CHANGELOG.md

Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>

---------

Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
2023-04-26 19:16:27 +02:00
Aliaksandr Valialkin
f7ef80aaad
.golangci.yml: properly enable revive linter and fix all the warnings it detects 2023-02-26 12:18:59 -08:00
Aliaksandr Valialkin
e688121de8
lib/promscrape/discovery/kuma: substitute blocking HTTP call with non-blocking HTTP call at discoveryutils.Client 2023-02-23 15:13:08 -08:00
Mattias Ängehov
6d019a3c37
Azure Service Discovery - Fix token fetch for Container Apps/App Services (#3832)
* Modify API version when running in Container App

* Handle expires on from token response

Response from IMDS does not always contain expires in value which is
currently used to get the token expiry time. An example resources that
doesn't provide it are Container Apps and App Service.

Signed-off-by: Mattias Ängehov <mattias.angehov@castoredc.com>

* Fix client id parameter for user assigned identity

* Apply suggestions from code review

---------

Signed-off-by: Mattias Ängehov <mattias.angehov@castoredc.com>
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
2023-02-22 19:19:53 -08:00
Aliaksandr Valialkin
510f78a96b
all: consistently use http.Method{Get,Post,Put} across the codebase
This is a follow-up after 9dec3c8f80
2023-02-22 18:58:46 -08:00
my-git9
9dec3c8f80
chore: Use http constants to replace numbers (#3846)
Signed-off-by: xin.li <xin.li@daocloud.io>
2023-02-22 18:53:05 -08:00
Aliaksandr Valialkin
9fbd45a22f
lib/promscrape/discovery/kuma: follow-up for 317fef95f9
- Do not generate __meta_server label, since it is unavailable in Prometheus.
- Add a link to https://docs.victoriametrics.com/sd_configs.html#kuma_sd_configs to docs/CHANGELOG.md,
  so users could click it and read the docs without the need to search the corresponding docs.
- Remove kumaTarget struct, since it is easier generating labels for discovered targets
  directly from the response returned by Kuma. This simplifies the code.
- Store the generated labels for discovered targets inside atomic.Value. This allows reading them
  from concurrent goroutines without the need to use mutex.
- Use synchronouse requests to Kuma instead of long polling, since there is a little sense
  in the long polling when the Kuma server may return 304 Not Modified response every -promscrape.kumaSDCheckInterval.
- Remove -promscrape.kuma.waitTime command-line flag, since it is no longer needed when long polling isn't used.
- Set default value for -promscrape.kumaSDCheckInterval to 30s in order to be consistent with Prometheus.
- Remove unnecessary indirections for string literals, which are used only once, in order to improve code readability.
- Remove unused fields from discoveryRequest and discoveryResponse.
- Update tests.
- Document why fetch_timeout and refresh_interval options are missing in kuma_sd_config.
- Add docs to discoveryutils.RequestCallback and discoveryutils.ResponseCallback,
  since these are public types.

Side notes: it is weird that Prometheus implementation for kuma_sd_configs sets `instance` label,
since usually this label is set by the Prometheus itself to __address__ after the relabeling phase.
See https://www.robustperception.io/life-of-a-label/

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3389

See https://github.com/prometheus/prometheus/issues/7919
and https://github.com/prometheus/prometheus/pull/8844
as a reference implementation in Prometheus
2023-02-22 17:51:51 -08:00
Aliaksandr Valialkin
eb08579452
lib/promscrape/discovery: add a comment explaining why duplicates are removed from the generated target labels 2023-02-22 17:51:51 -08:00
Alexander Marshalov
317fef95f9
add kuma_sd_config for Kuma Control Plane targets discovery (#3389) (#3840) 2023-02-22 13:59:56 +01:00
Oleksandr Redko
9fff48c3e3
app,lib: fix typos in comments (#3804) 2023-02-13 13:27:13 +01:00
Aliaksandr Valialkin
f9b3409ee3
lib/promscrape/discovery/openstack: use port 80 for the discovered target by default if it isnt specified in the config 2023-02-11 14:41:58 -08:00
Karan Sharma
146fd2eca3
sd/nomad: panic in nomad watcher because of nil map (#3784)
properly initialize url.Values
2023-02-08 09:43:29 +01:00
Aliaksandr Valialkin
0788be35eb
lib/promscrape/discovery/azure: add __meta_azure_machine_size label in the same way as Prometheus does
See https://github.com/prometheus/prometheus/pull/11650
2023-01-27 17:07:12 -08:00