2023-06-20 07:55:12 +02:00
|
|
|
package logstorage
|
|
|
|
|
|
|
|
import (
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
"strings"
|
2023-06-20 07:55:12 +02:00
|
|
|
"sync"
|
|
|
|
|
|
|
|
"github.com/VictoriaMetrics/VictoriaMetrics/lib/bytesutil"
|
|
|
|
"github.com/VictoriaMetrics/VictoriaMetrics/lib/encoding"
|
|
|
|
"github.com/VictoriaMetrics/VictoriaMetrics/lib/logger"
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
"github.com/VictoriaMetrics/VictoriaMetrics/lib/slicesutil"
|
2023-06-20 07:55:12 +02:00
|
|
|
)
|
|
|
|
|
2024-05-14 01:49:20 +02:00
|
|
|
// The number of blocks to search at once by a single worker
|
|
|
|
//
|
|
|
|
// This number must be increased on systems with many CPU cores in order to amortize
|
|
|
|
// the overhead for passing the blockSearchWork to worker goroutines.
|
|
|
|
const blockSearchWorksPerBatch = 64
|
|
|
|
|
2023-06-20 07:55:12 +02:00
|
|
|
type blockSearchWork struct {
|
|
|
|
// p is the part where the block belongs to.
|
|
|
|
p *part
|
|
|
|
|
2024-05-12 16:33:29 +02:00
|
|
|
// so contains search options for the block search.
|
2023-06-20 07:55:12 +02:00
|
|
|
so *searchOptions
|
|
|
|
|
|
|
|
// bh is the header of the block to search.
|
|
|
|
bh blockHeader
|
|
|
|
}
|
|
|
|
|
2024-05-14 01:49:20 +02:00
|
|
|
func (bsw *blockSearchWork) reset() {
|
|
|
|
bsw.p = nil
|
|
|
|
bsw.so = nil
|
|
|
|
bsw.bh.reset()
|
|
|
|
}
|
|
|
|
|
|
|
|
type blockSearchWorkBatch struct {
|
|
|
|
bsws []blockSearchWork
|
|
|
|
}
|
|
|
|
|
|
|
|
func (bswb *blockSearchWorkBatch) reset() {
|
|
|
|
bsws := bswb.bsws
|
|
|
|
for i := range bsws {
|
|
|
|
bsws[i].reset()
|
|
|
|
}
|
|
|
|
bswb.bsws = bsws[:0]
|
|
|
|
}
|
|
|
|
|
|
|
|
func getBlockSearchWorkBatch() *blockSearchWorkBatch {
|
|
|
|
v := blockSearchWorkBatchPool.Get()
|
|
|
|
if v == nil {
|
|
|
|
return &blockSearchWorkBatch{
|
|
|
|
bsws: make([]blockSearchWork, 0, blockSearchWorksPerBatch),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return v.(*blockSearchWorkBatch)
|
|
|
|
}
|
|
|
|
|
|
|
|
func putBlockSearchWorkBatch(bswb *blockSearchWorkBatch) {
|
|
|
|
bswb.reset()
|
|
|
|
blockSearchWorkBatchPool.Put(bswb)
|
|
|
|
}
|
|
|
|
|
|
|
|
var blockSearchWorkBatchPool sync.Pool
|
|
|
|
|
|
|
|
func (bswb *blockSearchWorkBatch) appendBlockSearchWork(p *part, so *searchOptions, bh *blockHeader) bool {
|
|
|
|
bsws := bswb.bsws
|
|
|
|
|
|
|
|
bsws = append(bsws, blockSearchWork{
|
|
|
|
p: p,
|
|
|
|
so: so,
|
|
|
|
})
|
|
|
|
bsw := &bsws[len(bsws)-1]
|
2023-06-20 07:55:12 +02:00
|
|
|
bsw.bh.copyFrom(bh)
|
2024-05-14 01:49:20 +02:00
|
|
|
|
|
|
|
bswb.bsws = bsws
|
|
|
|
|
|
|
|
return len(bsws) < cap(bsws)
|
2023-06-20 07:55:12 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
func getBlockSearch() *blockSearch {
|
|
|
|
v := blockSearchPool.Get()
|
|
|
|
if v == nil {
|
|
|
|
return &blockSearch{}
|
|
|
|
}
|
|
|
|
return v.(*blockSearch)
|
|
|
|
}
|
|
|
|
|
|
|
|
func putBlockSearch(bs *blockSearch) {
|
|
|
|
bs.reset()
|
2024-09-24 20:51:13 +02:00
|
|
|
|
|
|
|
// reset seenStreams before returning bs to the pool in order to reduce memory usage.
|
|
|
|
bs.seenStreams = nil
|
|
|
|
|
2023-06-20 07:55:12 +02:00
|
|
|
blockSearchPool.Put(bs)
|
|
|
|
}
|
|
|
|
|
|
|
|
var blockSearchPool sync.Pool
|
|
|
|
|
|
|
|
type blockSearch struct {
|
|
|
|
// bsw is the actual work to perform on the given block pointed by bsw.ph
|
|
|
|
bsw *blockSearchWork
|
|
|
|
|
|
|
|
// br contains result for the search in the block after search() call
|
|
|
|
br blockResult
|
|
|
|
|
|
|
|
// timestampsCache contains cached timestamps for the given block.
|
|
|
|
timestampsCache *encoding.Int64s
|
|
|
|
|
|
|
|
// bloomFilterCache contains cached bloom filters for requested columns in the given block
|
|
|
|
bloomFilterCache map[string]*bloomFilter
|
|
|
|
|
|
|
|
// valuesCache contains cached values for requested columns in the given block
|
|
|
|
valuesCache map[string]*stringBucket
|
|
|
|
|
|
|
|
// sbu is used for unmarshaling local columns
|
|
|
|
sbu stringsBlockUnmarshaler
|
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
// cshIndexBlockCache holds columnsHeaderIndex data for the given block.
|
|
|
|
//
|
|
|
|
// It is initialized lazily by calling getColumnsHeaderIndex().
|
|
|
|
cshIndexBlockCache []byte
|
|
|
|
|
2024-10-13 13:25:36 +02:00
|
|
|
// cshBlockCache holds columnsHeader data for the given block.
|
|
|
|
//
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
// It is initialized lazily by calling getColumnsHeaderBlock().
|
|
|
|
cshBlockCache []byte
|
|
|
|
cshBlockInitialized bool
|
|
|
|
|
|
|
|
// ccsCache is the cache for accessed const columns
|
|
|
|
ccsCache []Field
|
|
|
|
|
|
|
|
// chsCache is the cache for accessed column headers
|
|
|
|
chsCache []columnHeader
|
|
|
|
|
|
|
|
// cshIndexCache is the columnsHeaderIndex associated with the given block.
|
|
|
|
//
|
|
|
|
// It is initialized lazily by calling getColumnsHeaderIndex().
|
|
|
|
cshIndexCache *columnsHeaderIndex
|
2024-10-13 13:25:36 +02:00
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
// cshCache is the columnsHeader associated with the given block.
|
2024-09-25 16:59:58 +02:00
|
|
|
//
|
2024-10-18 00:21:20 +02:00
|
|
|
// It is initialized lazily by calling getColumnsHeader().
|
2024-10-13 12:56:08 +02:00
|
|
|
cshCache *columnsHeader
|
2024-09-25 16:59:58 +02:00
|
|
|
|
2024-09-24 20:51:13 +02:00
|
|
|
// seenStreams contains seen streamIDs for the recent searches.
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
//
|
2024-09-24 20:51:13 +02:00
|
|
|
// It is used for speeding up fetching _stream column.
|
|
|
|
seenStreams map[u128]string
|
2023-06-20 07:55:12 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
func (bs *blockSearch) reset() {
|
|
|
|
bs.bsw = nil
|
|
|
|
bs.br.reset()
|
|
|
|
|
|
|
|
if bs.timestampsCache != nil {
|
|
|
|
encoding.PutInt64s(bs.timestampsCache)
|
|
|
|
bs.timestampsCache = nil
|
|
|
|
}
|
|
|
|
|
|
|
|
bloomFilterCache := bs.bloomFilterCache
|
|
|
|
for k, bf := range bloomFilterCache {
|
|
|
|
putBloomFilter(bf)
|
|
|
|
delete(bloomFilterCache, k)
|
|
|
|
}
|
|
|
|
|
|
|
|
valuesCache := bs.valuesCache
|
|
|
|
for k, values := range valuesCache {
|
|
|
|
putStringBucket(values)
|
|
|
|
delete(valuesCache, k)
|
|
|
|
}
|
|
|
|
|
|
|
|
bs.sbu.reset()
|
2024-09-25 16:59:58 +02:00
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
bs.cshIndexBlockCache = bs.cshIndexBlockCache[:0]
|
|
|
|
|
2024-10-13 13:25:36 +02:00
|
|
|
bs.cshBlockCache = bs.cshBlockCache[:0]
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
bs.cshBlockInitialized = false
|
|
|
|
|
|
|
|
ccsCache := bs.ccsCache
|
|
|
|
for i := range ccsCache {
|
|
|
|
ccsCache[i].Reset()
|
|
|
|
}
|
|
|
|
bs.ccsCache = ccsCache[:0]
|
|
|
|
|
|
|
|
chsCache := bs.chsCache
|
|
|
|
for i := range chsCache {
|
|
|
|
chsCache[i].reset()
|
|
|
|
}
|
|
|
|
bs.chsCache = chsCache[:0]
|
|
|
|
|
|
|
|
if bs.cshIndexCache != nil {
|
|
|
|
putColumnsHeaderIndex(bs.cshIndexCache)
|
|
|
|
bs.cshIndexCache = nil
|
|
|
|
}
|
2024-10-13 13:25:36 +02:00
|
|
|
|
2024-10-13 12:56:08 +02:00
|
|
|
if bs.cshCache != nil {
|
|
|
|
putColumnsHeader(bs.cshCache)
|
|
|
|
bs.cshCache = nil
|
|
|
|
}
|
2024-09-25 16:59:58 +02:00
|
|
|
|
2024-09-24 20:51:13 +02:00
|
|
|
// Do not reset seenStreams, since its' lifetime is managed by blockResult.addStreamColumn() code.
|
2023-06-20 07:55:12 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
func (bs *blockSearch) partPath() string {
|
|
|
|
return bs.bsw.p.path
|
|
|
|
}
|
|
|
|
|
2024-05-20 04:08:30 +02:00
|
|
|
func (bs *blockSearch) search(bsw *blockSearchWork, bm *bitmap) {
|
2023-06-20 07:55:12 +02:00
|
|
|
bs.reset()
|
|
|
|
|
|
|
|
bs.bsw = bsw
|
|
|
|
|
|
|
|
// search rows matching the given filter
|
2024-05-20 04:08:30 +02:00
|
|
|
bm.init(int(bsw.bh.rowsCount))
|
2023-06-20 07:55:12 +02:00
|
|
|
bm.setBits()
|
2024-05-20 04:08:30 +02:00
|
|
|
bs.bsw.so.filter.applyToBlockSearch(bs, bm)
|
2023-06-20 07:55:12 +02:00
|
|
|
|
|
|
|
if bm.isZero() {
|
2024-05-12 16:33:29 +02:00
|
|
|
// The filter doesn't match any logs in the current block.
|
2023-06-20 07:55:12 +02:00
|
|
|
return
|
|
|
|
}
|
|
|
|
|
2024-05-20 04:08:30 +02:00
|
|
|
bs.br.mustInit(bs, bm)
|
|
|
|
|
2023-06-20 07:55:12 +02:00
|
|
|
// fetch the requested columns to bs.br.
|
2024-05-12 16:33:29 +02:00
|
|
|
if bs.bsw.so.needAllColumns {
|
2024-09-25 16:16:53 +02:00
|
|
|
bs.br.initAllColumns()
|
2024-05-12 16:33:29 +02:00
|
|
|
} else {
|
2024-09-25 16:16:53 +02:00
|
|
|
bs.br.initRequestedColumns()
|
2023-06-20 07:55:12 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
func (bs *blockSearch) partFormatVersion() uint {
|
|
|
|
return bs.bsw.p.ph.FormatVersion
|
|
|
|
}
|
|
|
|
|
2024-10-13 14:28:59 +02:00
|
|
|
func (bs *blockSearch) getConstColumnValue(name string) string {
|
|
|
|
if name == "_msg" {
|
|
|
|
name = ""
|
|
|
|
}
|
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
if bs.partFormatVersion() < 1 {
|
2024-10-18 00:21:20 +02:00
|
|
|
csh := bs.getColumnsHeader()
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
for _, cc := range csh.constColumns {
|
|
|
|
if cc.Name == name {
|
|
|
|
return cc.Value
|
|
|
|
}
|
2024-10-13 14:28:59 +02:00
|
|
|
}
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
return ""
|
|
|
|
}
|
|
|
|
|
|
|
|
columnNameID, ok := bs.getColumnNameID(name)
|
|
|
|
if !ok {
|
|
|
|
return ""
|
|
|
|
}
|
|
|
|
|
|
|
|
for i := range bs.ccsCache {
|
|
|
|
if bs.ccsCache[i].Name == name {
|
|
|
|
return bs.ccsCache[i].Value
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
cshIndex := bs.getColumnsHeaderIndex()
|
|
|
|
for _, cr := range cshIndex.constColumnsRefs {
|
|
|
|
if cr.columnNameID != columnNameID {
|
|
|
|
continue
|
|
|
|
}
|
|
|
|
|
|
|
|
b := bs.getColumnsHeaderBlock()
|
|
|
|
if cr.offset > uint64(len(b)) {
|
|
|
|
logger.Panicf("FATAL: %s: header offset for const column %q cannot exceed %d bytes; got %d bytes", bs.bsw.p.path, name, len(b), cr.offset)
|
|
|
|
}
|
|
|
|
b = b[cr.offset:]
|
|
|
|
bs.ccsCache = slicesutil.SetLength(bs.ccsCache, len(bs.ccsCache)+1)
|
|
|
|
cc := &bs.ccsCache[len(bs.ccsCache)-1]
|
|
|
|
if _, err := cc.unmarshalNoArena(b, false); err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: cannot unmarshal header for const column %q: %s", bs.bsw.p.path, name, err)
|
|
|
|
}
|
|
|
|
cc.Name = strings.Clone(name)
|
|
|
|
return cc.Value
|
2024-10-13 14:28:59 +02:00
|
|
|
}
|
|
|
|
return ""
|
|
|
|
}
|
|
|
|
|
|
|
|
func (bs *blockSearch) getColumnHeader(name string) *columnHeader {
|
|
|
|
if name == "_msg" {
|
|
|
|
name = ""
|
|
|
|
}
|
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
if bs.partFormatVersion() < 1 {
|
2024-10-18 00:21:20 +02:00
|
|
|
csh := bs.getColumnsHeader()
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
chs := csh.columnHeaders
|
|
|
|
for i := range chs {
|
|
|
|
ch := &chs[i]
|
|
|
|
if ch.name == name {
|
|
|
|
return ch
|
|
|
|
}
|
2024-10-13 14:28:59 +02:00
|
|
|
}
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
|
|
|
columnNameID, ok := bs.getColumnNameID(name)
|
|
|
|
if !ok {
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
|
|
|
for i := range bs.chsCache {
|
|
|
|
if bs.chsCache[i].name == name {
|
|
|
|
return &bs.chsCache[i]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
cshIndex := bs.getColumnsHeaderIndex()
|
|
|
|
for _, cr := range cshIndex.columnHeadersRefs {
|
|
|
|
if cr.columnNameID != columnNameID {
|
|
|
|
continue
|
|
|
|
}
|
|
|
|
|
|
|
|
b := bs.getColumnsHeaderBlock()
|
|
|
|
if cr.offset > uint64(len(b)) {
|
|
|
|
logger.Panicf("FATAL: %s: header offset for column %q cannot exceed %d bytes; got %d bytes", bs.bsw.p.path, name, len(b), cr.offset)
|
|
|
|
}
|
|
|
|
b = b[cr.offset:]
|
|
|
|
bs.chsCache = slicesutil.SetLength(bs.chsCache, len(bs.chsCache)+1)
|
|
|
|
ch := &bs.chsCache[len(bs.chsCache)-1]
|
|
|
|
if _, err := ch.unmarshalNoArena(b, partFormatLatestVersion); err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: cannot unmarshal header for column %q: %s", bs.bsw.p.path, name, err)
|
|
|
|
}
|
|
|
|
ch.name = strings.Clone(name)
|
|
|
|
return ch
|
2024-10-13 14:28:59 +02:00
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
func (bs *blockSearch) getColumnNameID(name string) (uint64, bool) {
|
|
|
|
id, ok := bs.bsw.p.columnNameIDs[name]
|
|
|
|
return id, ok
|
|
|
|
}
|
|
|
|
|
2024-10-28 20:49:50 +01:00
|
|
|
func (bs *blockSearch) getColumnNameByID(columnNameID uint64) string {
|
|
|
|
columnNames := bs.bsw.p.columnNames
|
|
|
|
if columnNameID >= uint64(len(columnNames)) {
|
|
|
|
logger.Panicf("FATAL: %s: too big columnNameID=%d; it must be smaller than %d", bs.bsw.p.path, columnNameID, len(columnNames))
|
|
|
|
}
|
|
|
|
return columnNames[columnNameID]
|
|
|
|
}
|
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
func (bs *blockSearch) getColumnsHeaderIndex() *columnsHeaderIndex {
|
|
|
|
if bs.partFormatVersion() < 1 {
|
|
|
|
logger.Panicf("BUG: getColumnsHeaderIndex() can be called only for part encoding v1+, while it has been called for v%d", bs.partFormatVersion())
|
|
|
|
}
|
|
|
|
|
|
|
|
if bs.cshIndexCache == nil {
|
|
|
|
bs.cshIndexBlockCache = readColumnsHeaderIndexBlock(bs.cshIndexBlockCache[:0], bs.bsw.p, &bs.bsw.bh)
|
|
|
|
|
|
|
|
bs.cshIndexCache = getColumnsHeaderIndex()
|
|
|
|
if err := bs.cshIndexCache.unmarshalNoArena(bs.cshIndexBlockCache); err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: cannot unmarshal columns header index: %s", bs.bsw.p.path, err)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return bs.cshIndexCache
|
2024-10-13 14:28:59 +02:00
|
|
|
}
|
|
|
|
|
2024-10-18 00:21:20 +02:00
|
|
|
func (bs *blockSearch) getColumnsHeader() *columnsHeader {
|
2024-10-13 12:56:08 +02:00
|
|
|
if bs.cshCache == nil {
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
b := bs.getColumnsHeaderBlock()
|
2024-10-13 13:25:36 +02:00
|
|
|
|
2024-10-18 00:21:20 +02:00
|
|
|
csh := getColumnsHeader()
|
|
|
|
partFormatVersion := bs.partFormatVersion()
|
|
|
|
if err := csh.unmarshalNoArena(b, partFormatVersion); err != nil {
|
2024-10-13 13:25:36 +02:00
|
|
|
logger.Panicf("FATAL: %s: cannot unmarshal columns header: %s", bs.bsw.p.path, err)
|
|
|
|
}
|
2024-10-18 00:21:20 +02:00
|
|
|
if partFormatVersion >= 1 {
|
|
|
|
cshIndex := bs.getColumnsHeaderIndex()
|
|
|
|
if err := csh.setColumnNames(cshIndex, bs.bsw.p.columnNames); err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: %s", bs.bsw.p.path, err)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
bs.cshCache = csh
|
2024-09-25 16:59:58 +02:00
|
|
|
}
|
2024-10-13 12:56:08 +02:00
|
|
|
return bs.cshCache
|
2024-09-25 16:59:58 +02:00
|
|
|
}
|
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
func (bs *blockSearch) getColumnsHeaderBlock() []byte {
|
|
|
|
if !bs.cshBlockInitialized {
|
|
|
|
bs.cshBlockCache = readColumnsHeaderBlock(bs.cshBlockCache[:0], bs.bsw.p, &bs.bsw.bh)
|
|
|
|
bs.cshBlockInitialized = true
|
|
|
|
}
|
|
|
|
return bs.cshBlockCache
|
|
|
|
}
|
|
|
|
|
|
|
|
func readColumnsHeaderIndexBlock(dst []byte, p *part, bh *blockHeader) []byte {
|
|
|
|
n := bh.columnsHeaderIndexSize
|
|
|
|
if n > maxColumnsHeaderIndexSize {
|
|
|
|
logger.Panicf("FATAL: %s: columns header index size cannot exceed %d bytes; got %d bytes", p.path, maxColumnsHeaderIndexSize, n)
|
|
|
|
}
|
|
|
|
|
|
|
|
dstLen := len(dst)
|
|
|
|
dst = bytesutil.ResizeNoCopyMayOverallocate(dst, int(n)+dstLen)
|
|
|
|
p.columnsHeaderIndexFile.MustReadAt(dst[dstLen:], int64(bh.columnsHeaderIndexOffset))
|
|
|
|
|
|
|
|
return dst
|
|
|
|
}
|
|
|
|
|
2024-10-13 13:25:36 +02:00
|
|
|
func readColumnsHeaderBlock(dst []byte, p *part, bh *blockHeader) []byte {
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
n := bh.columnsHeaderSize
|
|
|
|
if n > maxColumnsHeaderSize {
|
|
|
|
logger.Panicf("FATAL: %s: columns header size cannot exceed %d bytes; got %d bytes", p.path, maxColumnsHeaderSize, n)
|
2023-06-20 07:55:12 +02:00
|
|
|
}
|
2024-10-13 13:25:36 +02:00
|
|
|
dstLen := len(dst)
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
dst = bytesutil.ResizeNoCopyMayOverallocate(dst, int(n)+dstLen)
|
2024-10-13 13:25:36 +02:00
|
|
|
p.columnsHeaderFile.MustReadAt(dst[dstLen:], int64(bh.columnsHeaderOffset))
|
|
|
|
return dst
|
2023-06-20 07:55:12 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
// getBloomFilterForColumn returns bloom filter for the given ch.
|
|
|
|
//
|
|
|
|
// The returned bloom filter belongs to bs, so it becomes invalid after bs reset.
|
|
|
|
func (bs *blockSearch) getBloomFilterForColumn(ch *columnHeader) *bloomFilter {
|
|
|
|
bf := bs.bloomFilterCache[ch.name]
|
|
|
|
if bf != nil {
|
|
|
|
return bf
|
|
|
|
}
|
|
|
|
|
|
|
|
p := bs.bsw.p
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
bloomValuesFile := p.getBloomValuesFileForColumnName(ch.name)
|
2023-06-20 07:55:12 +02:00
|
|
|
|
|
|
|
bb := longTermBufPool.Get()
|
|
|
|
bloomFilterSize := ch.bloomFilterSize
|
|
|
|
if bloomFilterSize > maxBloomFilterBlockSize {
|
|
|
|
logger.Panicf("FATAL: %s: bloom filter block size cannot exceed %d bytes; got %d bytes", bs.partPath(), maxBloomFilterBlockSize, bloomFilterSize)
|
|
|
|
}
|
|
|
|
bb.B = bytesutil.ResizeNoCopyMayOverallocate(bb.B, int(bloomFilterSize))
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
|
|
|
|
bloomValuesFile.bloom.MustReadAt(bb.B, int64(ch.bloomFilterOffset))
|
2023-06-20 07:55:12 +02:00
|
|
|
bf = getBloomFilter()
|
|
|
|
if err := bf.unmarshal(bb.B); err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: cannot unmarshal bloom filter: %s", bs.partPath(), err)
|
|
|
|
}
|
|
|
|
longTermBufPool.Put(bb)
|
|
|
|
|
|
|
|
if bs.bloomFilterCache == nil {
|
|
|
|
bs.bloomFilterCache = make(map[string]*bloomFilter)
|
|
|
|
}
|
|
|
|
bs.bloomFilterCache[ch.name] = bf
|
|
|
|
return bf
|
|
|
|
}
|
|
|
|
|
|
|
|
// getValuesForColumn returns block values for the given ch.
|
|
|
|
//
|
|
|
|
// The returned values belong to bs, so they become invalid after bs reset.
|
|
|
|
func (bs *blockSearch) getValuesForColumn(ch *columnHeader) []string {
|
|
|
|
values := bs.valuesCache[ch.name]
|
|
|
|
if values != nil {
|
|
|
|
return values.a
|
|
|
|
}
|
|
|
|
|
|
|
|
p := bs.bsw.p
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
bloomValuesFile := p.getBloomValuesFileForColumnName(ch.name)
|
2023-06-20 07:55:12 +02:00
|
|
|
|
|
|
|
bb := longTermBufPool.Get()
|
|
|
|
valuesSize := ch.valuesSize
|
|
|
|
if valuesSize > maxValuesBlockSize {
|
|
|
|
logger.Panicf("FATAL: %s: values block size cannot exceed %d bytes; got %d bytes", bs.partPath(), maxValuesBlockSize, valuesSize)
|
|
|
|
}
|
|
|
|
bb.B = bytesutil.ResizeNoCopyMayOverallocate(bb.B, int(valuesSize))
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
bloomValuesFile.values.MustReadAt(bb.B, int64(ch.valuesOffset))
|
2023-06-20 07:55:12 +02:00
|
|
|
|
|
|
|
values = getStringBucket()
|
|
|
|
var err error
|
|
|
|
values.a, err = bs.sbu.unmarshal(values.a[:0], bb.B, bs.bsw.bh.rowsCount)
|
|
|
|
longTermBufPool.Put(bb)
|
|
|
|
if err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: cannot unmarshal column %q: %s", bs.partPath(), ch.name, err)
|
|
|
|
}
|
|
|
|
|
|
|
|
if bs.valuesCache == nil {
|
|
|
|
bs.valuesCache = make(map[string]*stringBucket)
|
|
|
|
}
|
|
|
|
bs.valuesCache[ch.name] = values
|
|
|
|
return values.a
|
|
|
|
}
|
|
|
|
|
|
|
|
// getTimestamps returns timestamps for the given bs.
|
|
|
|
//
|
|
|
|
// The returned timestamps belong to bs, so they become invalid after bs reset.
|
|
|
|
func (bs *blockSearch) getTimestamps() []int64 {
|
|
|
|
timestamps := bs.timestampsCache
|
|
|
|
if timestamps != nil {
|
|
|
|
return timestamps.A
|
|
|
|
}
|
|
|
|
|
|
|
|
p := bs.bsw.p
|
|
|
|
|
|
|
|
bb := longTermBufPool.Get()
|
|
|
|
th := &bs.bsw.bh.timestampsHeader
|
|
|
|
blockSize := th.blockSize
|
|
|
|
if blockSize > maxTimestampsBlockSize {
|
|
|
|
logger.Panicf("FATAL: %s: timestamps block size cannot exceed %d bytes; got %d bytes", bs.partPath(), maxTimestampsBlockSize, blockSize)
|
|
|
|
}
|
|
|
|
bb.B = bytesutil.ResizeNoCopyMayOverallocate(bb.B, int(blockSize))
|
|
|
|
p.timestampsFile.MustReadAt(bb.B, int64(th.blockOffset))
|
|
|
|
|
|
|
|
rowsCount := int(bs.bsw.bh.rowsCount)
|
|
|
|
timestamps = encoding.GetInt64s(rowsCount)
|
|
|
|
var err error
|
|
|
|
timestamps.A, err = encoding.UnmarshalTimestamps(timestamps.A[:0], bb.B, th.marshalType, th.minTimestamp, rowsCount)
|
|
|
|
longTermBufPool.Put(bb)
|
|
|
|
if err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: cannot unmarshal timestamps: %s", bs.partPath(), err)
|
|
|
|
}
|
|
|
|
bs.timestampsCache = timestamps
|
|
|
|
return timestamps.A
|
|
|
|
}
|
|
|
|
|
|
|
|
// mustReadBlockHeaders reads ih block headers from p, appends them to dst and returns the result.
|
|
|
|
func (ih *indexBlockHeader) mustReadBlockHeaders(dst []blockHeader, p *part) []blockHeader {
|
|
|
|
bbCompressed := longTermBufPool.Get()
|
|
|
|
indexBlockSize := ih.indexBlockSize
|
|
|
|
if indexBlockSize > maxIndexBlockSize {
|
|
|
|
logger.Panicf("FATAL: %s: index block size cannot exceed %d bytes; got %d bytes", p.indexFile.Path(), maxIndexBlockSize, indexBlockSize)
|
|
|
|
}
|
|
|
|
bbCompressed.B = bytesutil.ResizeNoCopyMayOverallocate(bbCompressed.B, int(indexBlockSize))
|
|
|
|
p.indexFile.MustReadAt(bbCompressed.B, int64(ih.indexBlockOffset))
|
|
|
|
|
|
|
|
bb := longTermBufPool.Get()
|
|
|
|
var err error
|
|
|
|
bb.B, err = encoding.DecompressZSTD(bb.B, bbCompressed.B)
|
|
|
|
longTermBufPool.Put(bbCompressed)
|
|
|
|
if err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: cannot decompress indexBlock read at offset %d with size %d: %s", p.indexFile.Path(), ih.indexBlockOffset, ih.indexBlockSize, err)
|
|
|
|
}
|
|
|
|
|
lib/logstorage: refactor storage format to be more efficient for querying wide events
It has been appeared that VictoriaLogs is frequently used for collecting logs with tens of fields.
For example, standard Kuberntes setup on top of Filebeat generates more than 20 fields per each log.
Such logs are also known as "wide events".
The previous storage format was optimized for logs with a few fields. When at least a single field
was referenced in the query, then the all the meta-information about all the log fields was unpacked
and parsed per each scanned block during the query. This could require a lot of additional disk IO
and CPU time when logs contain many fields. Resolve this issue by providing an (field -> metainfo_offset)
index per each field in every data block. This index allows reading and extracting only the needed
metainfo for fields used in the query. This index is stored in columnsHeaderIndexFilename ( columns_header_index.bin ).
This allows increasing performance for queries over wide events by 10x and more.
Another issue was that the data for bloom filters and field values across all the log fields except of _msg
was intermixed in two files - fieldBloomFilename ( field_bloom.bin ) and fieldValuesFilename ( field_values.bin ).
This could result in huge disk read IO overhead when some small field was referred in the query,
since the Operating System usually reads more data than requested. It reads the data from disk
in at least 4KiB blocks (usually the block size is much bigger in the range 64KiB - 512KiB).
So, if 512-byte bloom filter or values' block is read from the file, then the Operating System
reads up to 512KiB of data from disk, which results in 1000x disk read IO overhead. This overhead isn't visible
for recently accessed data, since this data is usually stored in RAM (aka Operating System page cache),
but this overhead may become very annoying when performing the query over large volumes of data
which isn't present in OS page cache.
The solution for this issue is to split bloom filters and field values across multiple shards.
This reduces the worst-case disk read IO overhead by at least Nx where N is the number of shards,
while the disk read IO overhead is completely removed in best case when the number of columns doesn't exceed N.
Currently the number of shards is 8 - see bloomValuesShardsCount . This solution increases
performance for queries over large volumes of newly ingested data by up to 1000x.
The new storage format is versioned as v1, while the old storage format is version as v0.
It is stored in the partHeader.FormatVersion.
Parts with the old storage format are converted into parts with the new storage format during background merge.
It is possible to force merge by querying /internal/force_merge HTTP endpoint - see https://docs.victoriametrics.com/victorialogs/#forced-merge .
2024-10-16 16:18:28 +02:00
|
|
|
dst, err = unmarshalBlockHeaders(dst, bb.B, p.ph.FormatVersion)
|
2023-06-20 07:55:12 +02:00
|
|
|
longTermBufPool.Put(bb)
|
|
|
|
if err != nil {
|
|
|
|
logger.Panicf("FATAL: %s: cannot unmarshal block headers read at offset %d with size %d: %s", p.indexFile.Path(), ih.indexBlockOffset, ih.indexBlockSize, err)
|
|
|
|
}
|
|
|
|
|
|
|
|
return dst
|
|
|
|
}
|
2024-09-24 20:51:13 +02:00
|
|
|
|
|
|
|
// getStreamStr returns _stream value for the given block at bs.
|
|
|
|
func (bs *blockSearch) getStreamStr() string {
|
|
|
|
sid := bs.bsw.bh.streamID.id
|
|
|
|
streamStr := bs.seenStreams[sid]
|
|
|
|
if streamStr != "" {
|
|
|
|
// Fast path - streamStr is found in the seenStreams.
|
|
|
|
return streamStr
|
|
|
|
}
|
|
|
|
|
|
|
|
// Slow path - load streamStr from the storage.
|
|
|
|
streamStr = bs.getStreamStrSlow()
|
|
|
|
if streamStr != "" {
|
|
|
|
// Store the found streamStr in seenStreams.
|
|
|
|
if len(bs.seenStreams) > 20_000 {
|
|
|
|
bs.seenStreams = nil
|
|
|
|
}
|
|
|
|
if bs.seenStreams == nil {
|
|
|
|
bs.seenStreams = make(map[u128]string)
|
|
|
|
}
|
|
|
|
bs.seenStreams[sid] = streamStr
|
|
|
|
}
|
|
|
|
return streamStr
|
|
|
|
}
|
|
|
|
|
|
|
|
func (bs *blockSearch) getStreamStrSlow() string {
|
|
|
|
bb := bbPool.Get()
|
|
|
|
defer bbPool.Put(bb)
|
|
|
|
|
|
|
|
bb.B = bs.bsw.p.pt.idb.appendStreamTagsByStreamID(bb.B[:0], &bs.bsw.bh.streamID)
|
|
|
|
if len(bb.B) == 0 {
|
|
|
|
// Couldn't find stream tags by sid. This may be the case when the corresponding log stream
|
|
|
|
// was recently registered and its tags aren't visible to search yet.
|
|
|
|
// The stream tags must become visible in a few seconds.
|
|
|
|
// See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6042
|
|
|
|
return ""
|
|
|
|
}
|
|
|
|
|
|
|
|
st := GetStreamTags()
|
|
|
|
mustUnmarshalStreamTags(st, bb.B)
|
|
|
|
bb.B = st.marshalString(bb.B[:0])
|
|
|
|
PutStreamTags(st)
|
|
|
|
|
|
|
|
return string(bb.B)
|
|
|
|
}
|