Commit Graph

2088 Commits

Author SHA1 Message Date
Aliaksandr Valialkin
7b33a27874
lib/logstorage: follow-up for 8a23d08c21
- Compare the actual free disk space to the value provided via -storage.minFreeDiskSpaceBytes
  directly inside the Storage.IsReadOnly(). This should work fast in most cases.
  This simplifies the logic at lib/storage.

- Do not take into account -storage.minFreeDiskSpaceBytes during background merges, since
  it results in uncontrolled growth of small parts when the free disk space approaches -storage.minFreeDiskSpaceBytes.
  The background merge logic uses another mechanism for determining whether there is enough
  disk space for the merge - it reserves the needed disk space before the merge
  and releases it after the merge. This prevents from out of disk space errors during background merge.

- Properly handle corner cases for flushing in-memory data to disk when the storage
  enters read-only mode. This is better than losing the in-memory data.

- Return back Storage.MustAddRows() instead of Storage.AddRows(),
  since the only case when AddRows() can return error is when the storage is in read-only mode.
  This case must be handled by the caller by calling Storage.IsReadOnly()
  before adding rows to the storage.
  This simplifies the code a bit, since the caller of Storage.MustAddRows() shouldn't handle
  errors returned by Storage.AddRows().

- Properly store parsed logs to Storage if parts of the request contain invalid log lines.
  Previously the parsed logs could be lost in this case.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4737
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4945
2023-10-02 16:52:23 +02:00
Aliaksandr Valialkin
10d9214980
lib/logstorage: run up to GOMAXPROCS flushers of old in-memory parts to disk
One flusher isn't enough under high data ingestion rate.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4775
2023-10-02 16:20:59 +02:00
Aliaksandr Valialkin
da9ef90277
lib/logstorage: assist merging in-memory parts at data ingestion path if their number starts exceeding maxInmemoryPartsPerPartition
This is a follow-up for 9310e9f584 , which removed data ingestion pacing.
This can result in uncontrolled growth of in-memory parts under high data ingestion rate,
which, in turn, can result in unbounded RAM usage, OOM crashes and slow query performance.

While at it, consistently reset isInMerge field for parts passed to mergeParts() before returning from this function.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4775
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4828
2023-10-02 08:24:58 +02:00
Aliaksandr Valialkin
d41841c0c9
lib/{mergeset,storage}: consistently reset isInMerge field in parts passed to mergeParts() before returning from the function
While at it consistently check that the isInMerge field is set in all the parts passed to mergeParts()
2023-10-02 08:05:29 +02:00
Aliaksandr Valialkin
3ca6fea858
lib/{mergeset,storage}: perform at most one assisted merge per each call to addRows/addItems
This should reduce tail latency during data ingestion.

This shouldn't slow down data ingestion in the worst case, since assisted merges are spread among
distinct addRows/addItems calls after this change.
2023-10-01 22:19:46 +02:00
Zakhar Bessarab
94627113db
lib/logstorage: prevent from panic during background merge (#4969)
* lib/logstorage: prevent from panic during background merge

Fixes panic during background merge when resulting block would contain more columns than maxColumnsPerBlock.
Buffered data will be flushed and replaced by the next block.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/logstorage: clarify field description and comment

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-09-29 11:58:20 +02:00
Zakhar Bessarab
8a23d08c21
lib/logstorage: switch to read-only mode when running out of disk space (#4945)
* lib/logstorage: switch to read-only mode when running out of disk space

Added support of `--storage.minFreeDiskSpaceBytes` command-line flag to allow graceful handling of running out of disk space at `--storageDataPath`.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4737
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/logstorage: fix error handling logic during merge

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/logstorage: fix log level

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2023-09-29 11:55:38 +02:00
Zakhar Bessarab
9310e9f584
lib/logstorage/datadb: remove parts merge cond (#4828)
It was added in order to limit number of goroutines performing assisted merges during ingestion.
It turned out that blocking ingestion goroutines lower ingestion performance and limits overall ingestion around 40k items per seconds because of lock contention.
Removing parts merge sync.Cond allows to remove lock contention at write path and significantly improves write performance.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4775

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-09-29 11:50:14 +02:00
Aliaksandr Valialkin
223ef96198
lib/storage: remove unused atomicSetBool function after 717c53af27 2023-09-25 17:37:24 +02:00
Aliaksandr Valialkin
15dfd94f3b
lib/storage: make it clear that the number of big merge workers always equals to 4
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4915#issuecomment-1733922830
2023-09-25 17:15:45 +02:00
Aliaksandr Valialkin
717c53af27
lib/storage: stop exposing vm_merge_need_free_disk_space metric
This metric confuses users and has no any useful information.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/686#issuecomment-1733844128
2023-09-25 16:52:39 +02:00
Zakhar Bessarab
8d99c12a7d
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs (#5048)
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs

It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).

Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-09-22 13:01:33 +02:00
Aliaksandr Valialkin
3140ef7261
lib/storage: log fatal error inside searchMetricName() instead of propagating it to the caller
This simplifies the code a bit at searchMetricName() and searchMetricNameWithCache() call sites

This is a result of investigating https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4972
2023-09-22 11:41:06 +02:00
Zakhar Bessarab
760cdcec68
lib/backup: fix issue with inconsistent copying of appliedRetention.txt (#5027)
* lib/backup: fix issue with inconsistent copying of appliedRetention.txt

appliedRetention.txt can be modified in place, so it should be always copied just the same as parts.json

Updates: https://github.com/victoriaMetrics/victoriaMetrics/issues/5005
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* docs: add changelog entry for appliedRetention.txt copying fix

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-09-21 11:25:19 +02:00
Zakhar Bessarab
bea3431ed1
lib/storage/partition: add check to ensure parts exist on disk (#5017)
* lib/storage/partition: add check to ensure parts exist on disk

If part exists in parts.json but is missing on disk there will be a misleading error similar to "unexpected number of substrings in the part name".

This change forces verification of part existence and throws a correct error in case it is missing on disk.

Such issue can be result of https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5005 or disk corruption.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/storage/partition: use filepath.Join instead of string concatenation

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/storage/partition: add action points for error message

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* all: add a check for missing part in lib/mergeset and lib/logstorage

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-09-19 11:17:41 +02:00
Aliaksandr Valialkin
30a645cd82
lib/promscrape/discovery/kubernetes: follow-up after 03fece44e0
- Properly update vm_promscrape_discovery_kubernetes_url_watchers
  and vm_promscrape_discovery_kubernetes_group_watchers metrics after config changes

- Properly stop goroutine responsible for recreating scrapeWorks after the corresponding urlWatcher is stopped

- Log the event when urlWatcher is stopped in order to simplify debugging

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-18 23:23:45 +02:00
Aliaksandr Valialkin
03fece44e0
lib/promscrape/discovery/kubernetes: wait for 10 seconds before checking whether the urlWatcher must be stopped
This should prevent from excess urlWatcher churn on config reload, since it leads to removal of all the apiWatchers
before creating new apiWatchers. So, every config reload would lead to stopping of all the previous urlWatchers
and starting new urlWatchers.

The new logic gives 10 seconds for config reload before stopping unused urlWatchers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4861
2023-09-18 17:45:12 +02:00
Aliaksandr Valialkin
76af32d869
lib/promscrape/discovery/kubernetes: follow-up after eeb862f3ff
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
  since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-09-18 17:06:39 +02:00
Aliaksandr Valialkin
4d01bc6d52
lib/backup: properly copy parts.json files inside indexdb directory additional to data directory
This is a follow-up for 264ffe3fa1

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5005
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5006
2023-09-18 16:16:50 +02:00
Aliaksandr Valialkin
f93a7b8457
lib/backup/common: consistently use canonical path with / directory separators at Part.Path
Previously Part.Path could contain `\` directory separators on Windows OS,
which could result in incorrect filepaths generation when making backups at object storage.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4704

This is a follow-up for f2df8ad480
2023-09-18 16:15:34 +02:00
Zakhar Bessarab
eeb862f3ff
lib/promscrape/discovery/kubernetes: fix leaking api watcher (#4861)
* lib/promscrape/discovery/kubernetes: fix leaking api watcher

goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server

This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/promscrape/discovery/kubernetes: clarify comment

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* Apply suggestions from code review

* lib/promscrape/discovery/kubernetes: address review feedback

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2023-09-15 19:40:13 +02:00
faceair
b6ad581b45
lib/storage: remove ForceMergeAllParts internal loop (#4999)
Signed-off-by: faceair <git@faceair.me>
2023-09-15 19:04:54 +02:00
Zakhar Bessarab
264ffe3fa1
lib/backup: force copying of parts.json (#5006)
* lib/backup: force copying of parts.json

Copying of parts.json is required because `part.key()` comparison can create same key value for files with different contents. This will result in inconsistent backup being created or restored.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5005
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/backup: ensure parts.json is only copied once

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
2023-09-15 19:04:38 +02:00
Aliaksandr Valialkin
a09c680170
lib/storage: handle fatal errors inside indexSearch.getTSIDByMetricID() instead of returning them to the caller
This simplifies the code a bit at caller side
2023-09-15 11:55:42 +02:00
Aliaksandr Valialkin
9de440c803
lib/logger: increase the maximum log arg size from 200 to 500
The 200 chars limit has been appeared too small for typical log messages emitted by VictoriaMetrics components

This is a follow-up for 87fea7d8ac
2023-09-07 16:11:08 +02:00
Aliaksandr Valialkin
87fea7d8ac
lib/logger: limit the maximum arg length, which can be emitted to log lines
This should prevent from emitting too long lines when too long args are passed to logger.* functions.
For example, too long MetricsQL queries or too long data samples.
2023-09-07 15:22:46 +02:00
Aliaksandr Valialkin
24d61bf193
lib/flagutil: add Duration.Milliseconds() convenience function after 0c7d46d637
This function is a faster replacement for Duration.Duration().Milliseconds() call
2023-09-03 10:56:44 +02:00
Dima Lazerka
0c7d46d637
flagutil: Make .Msecs private (#4906)
* Introduce flagutil.Duration

To avoid conversion bugs

* Fix tests

* Clarify documentation re. month=31 days

* Add fasttime.UnixTime() to obtain time.Time

The goal is to refactor out the last usage of `.Msecs`.

* Use fasttime for time.Now()

* wip

- Remove fasttime.UnixTime(), since it doesn't improve code readability and maintainability
- Run `make docs-sync` for syncing changes from README.md to docs/ folder
- Make lib/flagutil.Duration.Msec private
- Rename msecsPerMonth const to msecsPer31Days in order to be consistent with retention31Days

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-09-03 10:33:37 +02:00
Aliaksandr Valialkin
edee262ecc
Makefile: update golangci-lint from v1.51.2 to v1.54.2
See https://github.com/golangci/golangci-lint/releases/tag/v1.54.2
2023-09-01 10:16:42 +02:00
Dima Lazerka
e0e856d2e7
Add flagutil.Duration to avoid conversion bugs (#4835)
* Introduce flagutil.Duration

To avoid conversion bugs

* Fix tests

* Comment why not .Seconds()
2023-09-01 09:27:51 +02:00
Nikolay
00685b627f
lib/promscrape/k8s_sd: set resourceVersion to 0 by default for watch … (#4901)
* lib/promscrape/k8s_sd: set resourceVersion to 0 by default for watch requests
it must reduce load for kubernetes ETCD servers. Since requests without resourceVersion performs force cache sync at kubernetes API server with ETCD
more info at https://kubernetes.io/docs/reference/using-api/api-concepts/\#semantics-for-watch
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4855

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-08-30 16:03:41 +02:00
Aliaksandr Valialkin
1bba4c5118
lib/auth: add NewTokenPossibleMultitenant() for parsing auth token, which can be multitenant
Disallow parsing multitenant token at auth.NewToken().

Use auth.NewTokenPossibleMultitenant() at vminsert only. All the other callers should call auth.NewToken(),
since they do not support multitenant token.

This is a follow-up for f0c06b428e

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4910
2023-08-30 14:17:55 +02:00
Aliaksandr Valialkin
039b8667c4
lib/proxy: consistently use gopkg.in/yaml.v2 across all the code 2023-08-29 13:12:38 +02:00
Aliaksandr Valialkin
317a273c6d
lib/logstorage: eliminate data race when clearing s.ptwHot after deleting the corresponding partition
The previous code could result in the following data race:
1. The s.ptwHot partition is marked to be deleted
2. ptw.decRef() is called on it
3. ptw.pt is set to nil
4. s.ptwHot.pt is accessed from concurrent goroutine, which leads to panic.

The change clears s.ptwHot under s.partitionsLock in order to prevent from the data race.

This is a follow-up for 8d50032dd6

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4895
2023-08-29 11:09:55 +02:00
crossoverJie
8d50032dd6
lib/logstorage: Set ptwHot to nil when the partition pointed by ptwHot is dropped (#4902) 2023-08-29 11:01:19 +02:00
hagen1778
5d848363f0
lib/promscrape: follow-up after eabcfc9bcd
`-promscrape.cluster.membersCount` by default should be `1`, like every
single vmagent is a cluster of one member on its own.
The change additionally validates that user can't set `-promscrape.cluster.membersCount`
to value lower than `1`.

eabcfc9bcd
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-08-29 10:04:57 +02:00
Haleygo
eabcfc9bcd
fix clusterMembersCount check (#4900) 2023-08-29 15:58:24 +08:00
crossoverJie
cde5029bce
lib/logstorage: add nil check for ptwHot.pt (#4896) 2023-08-27 01:24:26 +02:00
Zakhar Bessarab
6e8611f301
lib/promscrape/client: sync timeout for HostClient and http.Client (#4889)
Initially, stream parse mode was reading data from response and parsing it on flight. This was causing longer delay to read the whole response and required increasing timeout value to allow data processing while reading. So that 908e35affd increased timeout value to fix this.

But after 74c00a8762 response in stream parse mode is saved into memory and then parsed eliminating necessity of having timeout value higher that for usual scrape.

Updates: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4847
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-08-25 15:47:11 +02:00
Aliaksandr Valialkin
f1c2508243
lib/promscrape: add -promscrape.cluster.memberLabel command-line flag
This flag allows specifying an additional label to add to all the scraped metrics.
The flag must contain label name to add. The label value will be equal to -promscrape.cluster.memberNum.

This functionality can help when there is a need to differentiate metrics scraped
by distinct vmagent instances in the cluster according to https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247#issuecomment-1692279393
2023-08-24 22:03:54 +02:00
hagen1778
4ebe8bb1d5
app/vmagent: follow-up after 6788704152
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4884
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-08-24 11:36:42 +02:00
Zakhar Bessarab
6788704152
lib/promscrape/client: make User-Agent consistent between fasthttp and native client (#4886)
User agent was not set for native client which resulted in using one provided by Golang.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4884

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-08-24 11:31:13 +02:00
Nikolay
c5aac34b68
lib/storage: properly caclucate nextRotationTimestamp (#4874)
cause of typo unix millis was used instead of unix for current timestamp
calculation
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4873
2023-08-23 13:22:53 +02:00
Dmytro Kozlov
b7d07e5acf
lib/protoparser: handle unexpected EOF error when parsing lines in prometheus exposition format (#4851)
Previously only io.EOF was handled, and io.ErrUnexpectedEOF was ignored, but it may happen if the client interrupts the connection.

https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4817
2023-08-18 08:55:42 +02:00
Aliaksandr Valialkin
cd9f86afe1
lib/envflag: do not allow unsupported form for boolean command-line flags in the form -boolFlag value
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4845
2023-08-17 13:26:53 +02:00
Aliaksandr Valialkin
fdae53a75b
lib/promrelabel: properly replace : char with _ in metric names when -usePromCompatibleNaming command-line flag is set
This addresses https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3113#issuecomment-1275077071 comment from @johnseekins
2023-08-14 16:14:42 +02:00
Aliaksandr Valialkin
63e3571e8c
lib/promrelabel: stop emitting DEBUG log lines when parsing if expressions
These lines were accidentally left in the commit 62651570bb

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4635
2023-08-14 15:24:31 +02:00
Aliaksandr Valialkin
ac6c40e896
all: refer to https://docs.victoriametrics.com/#resource-usage-limits in the error message about -search.max* limit
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4827
2023-08-14 01:57:34 -07:00
Aliaksandr Valialkin
a0f695f5de
app/vmbackup: add ability to make server-side copying of existing backups 2023-08-13 17:24:24 -07:00
Nikolay
d144e39592
lib/protoparser/openetelemetry: fixes panic (#4821)
Opentelemetry format allows histograms with non-counter buckets. In this case it makes no sense to add buckets into database
and save only counter with _count suffix.
It could be used as gauge.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4814

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-08-12 05:09:18 -07:00