diff --git a/docs/keyConcepts.md b/docs/keyConcepts.md index d54a70705..61746e597 100644 --- a/docs/keyConcepts.md +++ b/docs/keyConcepts.md @@ -13,6 +13,7 @@ compare it to other processes, perform some calculations with it, or even define user-defined thresholds. The most common use-cases for metrics are: + - check how the system behaves at the particular time period; - correlate behavior changes to other measurements; - observe or forecast trends; @@ -22,79 +23,87 @@ Collecting and analyzing metrics provides advantages that are difficult to overe ### Structure of a metric -Let's start with an example. To track how many requests our application serves, -we'll define a metric with the name `requests_total`. +Let's start with an example. To track how many requests our application serves, we'll define a metric with the +name `requests_total`. You can be more specific here by saying `requests_success_total` (for only successful requests) -or `request_errors_total` (for requests which failed). Choosing a metric name is very important and supposed -to clarify what is actually measured to every person who reads it, just like variable names in programming. +or `request_errors_total` (for requests which failed). Choosing a metric name is very important and supposed to clarify +what is actually measured to every person who reads it, just like variable names in programming. Every metric can contain additional meta information in the form of label-value pairs: + ``` requests_total{path="/", code="200"} requests_total{path="/", code="403"} ``` The meta-information (set of `labels` in curly braces) gives us a context for which `path` and with what `code` -the `request` was served. Label-value pairs are always of a `string` type. VictoriaMetrics data model -is schemaless, which means there is no need to define metric names or their labels in advance. -User is free to add or change ingested metrics anytime. +the `request` was served. Label-value pairs are always of a `string` type. VictoriaMetrics data model is schemaless, +which means there is no need to define metric names or their labels in advance. User is free to add or change ingested +metrics anytime. Actually, the metric's name is also a label with a special name `__name__`. So the following two series are identical: + ``` requests_total{path="/", code="200"} {__name__="requests_total", path="/", code="200"} ``` -A combination of a metric name and its labels defines a `time series`. -For example, `requests_total{path="/", code="200"}` and `requests_total{path="/", code="403"}` +#### Time series + +A combination of a metric name and its labels defines a `time series`. For +example, `requests_total{path="/", code="200"}` and `requests_total{path="/", code="403"}` are two different time series. -The number of all unique label combinations for one metric defines its `cardinality`. -For example, if `requests_total` has 3 unique `path` values and 5 unique `code` values, -then its cardinality will be `3*5=15` of unique time series. If you add one more -unique `path` value, cardinality will bump to `20`. See more in +Number of time series has an impact on database resource usage. See +also [What is an active time series?](https://docs.victoriametrics.com/FAQ.html#what-is-an-active-time-series) +and [What is high churn rate?](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate). + +#### Cardinality + +The number of all unique label combinations for one metric defines its `cardinality`. For example, if `requests_total` +has 3 unique `path` values and 5 unique `code` values, then its cardinality will be `3*5=15` of unique time series. If +you add one more unique `path` value, cardinality will bump to `20`. See more in [What is cardinality](https://docs.victoriametrics.com/FAQ.html#what-is-high-cardinality). -Every time series consists of `datapoints` (also called `samples`). -A `datapoint` is value-timestamp pair associated with the specific series: +#### Data points + +Every time series consists of `data points` (also called `samples`). A `data point` is value-timestamp pair associated +with the specific series: + ``` requests_total{path="/", code="200"} ``` -In VictoriaMetrics data model, datapoint's value is of type `float64`. -And timestamp is unix time with milliseconds precision. Each series can contain an infinite number of datapoints. - +In VictoriaMetrics data model, data point's value is always of type `float64`. And timestamp is unix time with +milliseconds precision. Each series can contain an infinite number of data points. ### Types of metrics -Internally, VictoriaMetrics does not have a notion of a metric type. All metrics are the same. -The concept of a metric type exists specifically to help users to understand how the metric was measured. -There are 4 common metric types. +Internally, VictoriaMetrics does not have a notion of a metric type. All metrics are the same. The concept of a metric +type exists specifically to help users to understand how the metric was measured. There are 4 common metric types. #### Counter Counter metric type is a [monotonically increasing counter](https://en.wikipedia.org/wiki/Monotonic_function) -used for capturing a number of events. -It represents a cumulative metric whose value never goes down and always shows the current number of captured -events. In other words, `counter` always shows the number of observed events since the application has started. -In programming, `counter` is a variable that you **increment** each time something happens. +used for capturing a number of events. It represents a cumulative metric whose value never goes down and always shows +the current number of captured events. In other words, `counter` always shows the number of observed events since the +application has started. In programming, `counter` is a variable that you **increment** each time something happens. {% include img.html href="keyConcepts_counter.png" %} - -`vm_http_requests_total` is a typical example of a counter - a metric which only grows. -The interpretation of a graph above is that time series +`vm_http_requests_total` is a typical example of a counter - a metric which only grows. The interpretation of a graph +above is that time series `vm_http_requests_total{instance="localhost:8428", job="victoriametrics", path="api/v1/query_range"}` was rapidly changing from 1:38 pm to 1:39 pm, then there were no changes until 1:41 pm. -Counter is used for measuring a number of events, like a number of requests, errors, logs, messages, etc. -The most common [MetricsQL](#metricsql) functions used with counters are: -* [rate](https://docs.victoriametrics.com/MetricsQL.html#rate) - calculates the speed of metric's change. - For example, `rate(requests_total)` will show how many requests are served per second; -* [increase](https://docs.victoriametrics.com/MetricsQL.html#increase) - calculates the growth of a metric - on the given time period. For example, `increase(requests_total[1h])` will show how many requests were - served over `1h` interval. +Counter is used for measuring a number of events, like a number of requests, errors, logs, messages, etc. The most +common [MetricsQL](#metricsql) functions used with counters are: + +* [rate](https://docs.victoriametrics.com/MetricsQL.html#rate) - calculates the speed of metric's change. For + example, `rate(requests_total)` will show how many requests are served per second; +* [increase](https://docs.victoriametrics.com/MetricsQL.html#increase) - calculates the growth of a metric on the given + time period. For example, `increase(requests_total[1h])` will show how many requests were served over `1h` interval. #### Gauge @@ -102,22 +111,28 @@ Gauge is used for measuring a value that can go up and down: {% include img.html href="keyConcepts_gauge.png" %} - -The metric `process_resident_memory_anon_bytes` on the graph shows the number of bytes of memory -used by the application during the runtime. It is changing frequently, going up and down showing how -the process allocates and frees the memory. +The metric `process_resident_memory_anon_bytes` on the graph shows the number of bytes of memory used by the application +during the runtime. It is changing frequently, going up and down showing how the process allocates and frees the memory. In programming, `gauge` is a variable to which you **set** a specific value as it changes. -Gauge is used for measuring temperature, memory usage, disk usage, etc. The most common [MetricsQL](#metricsql) +Gauge is used in the following scenarios: + +* measuring temperature, memory usage, disk usage etc; +* storing the state of some process. For example, gauge `config_reloaded_successful` can be set to `1` if everything is + good, and to `0` if configuration failed to reload; +* storing the timestamp when event happened. For example, `config_last_reload_success_timestamp_seconds` + can store the timestamp of the last successful configuration relaod. + +The most common [MetricsQL](#metricsql) functions used with gauges are [aggregation and grouping functions](#aggregation-and-grouping-functions). #### Histogram Histogram is a set of [counter](#counter) metrics with different labels for tracking the dispersion -and [quantiles](https://prometheus.io/docs/practices/histograms/#quantiles) of the observed value. -For example, in VictoriaMetrics we track how many rows is processed per query -using the histogram with the name `vm_per_query_rows_processed_count`. -The exposition format for this histogram has the following form: +and [quantiles](https://prometheus.io/docs/practices/histograms/#quantiles) of the observed value. For example, in +VictoriaMetrics we track how many rows is processed per query using the histogram with the +name `vm_per_query_rows_processed_count`. The exposition format for this histogram has the following form: + ``` vm_per_query_rows_processed_count_bucket{vmrange="4.084e+02...4.642e+02"} 2 vm_per_query_rows_processed_count_bucket{vmrange="5.275e+02...5.995e+02"} 1 @@ -129,51 +144,56 @@ vm_per_query_rows_processed_count_count 11 ``` In practice, histogram `vm_per_query_rows_processed_count` may be used in the following way: + ```Go // define the histogram perQueryRowsProcessed := metrics.NewHistogram(`vm_per_query_rows_processed_count`) // use the histogram during processing for _, query := range queries { - perQueryRowsProcessed.Update(len(query.Rows)) +perQueryRowsProcessed.Update(len(query.Rows)) } ``` Now let's see what happens each time when `perQueryRowsProcessed.Update` is called: -* counter `vm_per_query_rows_processed_count_sum` increments by value of `len(query.Rows)` expression - and accounts for total sum of all observed values; -* counter `vm_per_query_rows_processed_count_count` increments by 1 and accounts for total number - of observations; -* counter `vm_per_query_rows_processed_count_bucket` gets incremented only if observed value is within - the range (`bucket`) defined in `vmrange`. -Such a combination of `counter` metrics allows plotting [Heatmaps in Grafana](https://grafana.com/docs/grafana/latest/visualizations/heatmap/) +* counter `vm_per_query_rows_processed_count_sum` increments by value of `len(query.Rows)` expression and accounts for + total sum of all observed values; +* counter `vm_per_query_rows_processed_count_count` increments by 1 and accounts for total number of observations; +* counter `vm_per_query_rows_processed_count_bucket` gets incremented only if observed value is within the + range (`bucket`) defined in `vmrange`. + +Such a combination of `counter` metrics allows +plotting [Heatmaps in Grafana](https://grafana.com/docs/grafana/latest/visualizations/heatmap/) and calculating [quantiles](https://prometheus.io/docs/practices/histograms/#quantiles): {% include img.html href="keyConcepts_histogram.png" %} -Histograms are usually used for measuring latency, sizes of elements (batch size, for example) etc. -There are two implementations of a histogram supported by VictoriaMetrics: +Histograms are usually used for measuring latency, sizes of elements (batch size, for example) etc. There are two +implementations of a histogram supported by VictoriaMetrics: + 1. [Prometheus histogram](https://prometheus.io/docs/practices/histograms/). The canonical histogram implementation - supported by most of the [client libraries for metrics instrumentation](https://prometheus.io/docs/instrumenting/clientlibs/). - Prometheus histogram requires a user to define ranges (`buckets`) statically. + supported by most of + the [client libraries for metrics instrumentation](https://prometheus.io/docs/instrumenting/clientlibs/). Prometheus + histogram requires a user to define ranges (`buckets`) statically. 2. [VictoriaMetrics histogram](https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350) - supported by [VictoriaMetrics/metrics](https://github.com/VictoriaMetrics/metrics) instrumentation library. Victoriametrics - histogram automatically adjusts buckets, so users don't need to think about them. + supported by [VictoriaMetrics/metrics](https://github.com/VictoriaMetrics/metrics) instrumentation library. + Victoriametrics histogram automatically adjusts buckets, so users don't need to think about them. Histograms aren't trivial to learn and use. We recommend reading the following articles before you start: + 1. [Prometheus histogram](https://prometheus.io/docs/concepts/metric_types/#histogram) 2. [Histograms and summaries](https://prometheus.io/docs/practices/histograms/) 3. [How does a Prometheus Histogram work?](https://www.robustperception.io/how-does-a-prometheus-histogram-work) 4. [Improving histogram usability for Prometheus and Grafana](https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350) - #### Summary Summary is quite similar to [histogram](#histogram) and is used for -[quantiles](https://prometheus.io/docs/practices/histograms/#quantiles) calculations. -The main difference to histograms is that calculations are made on the client-side, so -metrics exposition format already contains pre-calculated quantiles: +[quantiles](https://prometheus.io/docs/practices/histograms/#quantiles) calculations. The main difference to histograms +is that calculations are made on the client-side, so metrics exposition format already contains pre-calculated +quantiles: + ``` go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 @@ -189,36 +209,56 @@ The visualisation of summaries is pretty straightforward: {% include img.html href="keyConcepts_summary.png" %} Such an approach makes summaries easier to use but also puts significant limitations - summaries can't be aggregated. -The [histogram](#histogram) exposes the raw values via counters. It means a user can aggregate these counters -for different metrics (for example, for metrics with different `instance` label) and **then calculate quantiles**. -For summary, quantiles are already calculated, so they [can't be aggregated](https://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html) +The [histogram](#histogram) exposes the raw values via counters. It means a user can aggregate these counters for +different metrics (for example, for metrics with different `instance` label) and **then calculate quantiles**. For +summary, quantiles are already calculated, so +they [can't be aggregated](https://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html) with other metrics. -Summaries are usually used for measuring latency, sizes of elements (batch size, for example) etc. -But taking into account the limitation mentioned above. +Summaries are usually used for measuring latency, sizes of elements (batch size, for example) etc. But taking into +account the limitation mentioned above. +### Instrumenting application with metrics -#### Instrumenting application with metrics - -As was said at the beginning of the section [Types of metrics](#types-of-metrics), metric type defines -how it was measured. VictoriaMetrics TSDB doesn't know about metric types, all it sees are labels, -values, and timestamps. And what are these metrics, what do they measure, and how - all this depends -on the application which emits them. +As was said at the beginning of the section [Types of metrics](#types-of-metrics), metric type defines how it was +measured. VictoriaMetrics TSDB doesn't know about metric types, all it sees are labels, values, and timestamps. And what +are these metrics, what do they measure, and how - all this depends on the application which emits them. To instrument your application with metrics compatible with VictoriaMetrics TSDB we recommend -using [VictoriaMetrics/metrics](https://github.com/VictoriaMetrics/metrics) instrumentation library. -See more about how to use it on example of +using [VictoriaMetrics/metrics](https://github.com/VictoriaMetrics/metrics) instrumentation library. See more about how +to use it on example of [How to monitor Go applications with VictoriaMetrics](https://victoriametrics.medium.com/how-to-monitor-go-applications-with-victoriametrics-c04703110870) article. VictoriaMetrics is also compatible with Prometheus [client libraries for metrics instrumentation](https://prometheus.io/docs/instrumenting/clientlibs/). +#### Naming + +We recommend following [naming convention introduced by Prometheus](https://prometheus.io/docs/practices/naming/). There +are no strict (except allowed chars) restrictions and any metric name would be accepted by VictoriaMetrics. But +convention will help to keep names meaningful, descriptive and clear to other people. Following convention is a good +practice. + +#### Labels + +Every metric can contain an arbitrary number of label names. The good practice is to keep this number limited. +Otherwise, it would be difficult to use or plot on the graphs. By default, VictoriaMetrics limits the number of labels +per series to `30` and drops all excessive labels. This limit can be changed via `-maxLabelsPerTimeseries` flag. + +Every label value can contain arbitrary string value. The good practice is to use short and meaningful label values to +describe the attribute of the metric, not to tell the story about it. For example, label-value pair +`environment=prod` is ok, but `log_message=long log message with a lot of details...` is not ok. By default, +VcitoriaMetrics limits label's value size with 16kB. This limit can be changed via `-maxLabelValueLen` flag. + +It is very important to control the max number of unique label values since it defines the number +of [time series](#time-series). Try to avoid using volatile values such as session ID or query ID in label values to +avoid excessive resource usage and database slowdown. ## Write data -There are two main models in monitoring for data collection: [push](#push-model) and [pull](#pull-model). -Both are used in modern monitoring and both are supported by VictoriaMetrics. +There are two main models in monitoring for data collection: [push](#push-model) and [pull](#pull-model). Both are used +in modern monitoring and both are supported by VictoriaMetrics. ### Push model @@ -226,29 +266,41 @@ Push model is a traditional model of the client sending data to the server: {% include img.html href="keyConcepts_push_model.png" %} -The client (application) decides when and where to send/ingest its metrics. -VictoriaMetrics supports following protocols for ingesting: +The client (application) decides when and where to send/ingest its metrics. VictoriaMetrics supports following protocols +for ingesting: + * [Prometheus remote write API](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#prometheus-setup). -* [Prometheus exposition format](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-prometheus-exposition-format). -* [InfluxDB line protocol](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-influxdb-compatible-agents-such-as-telegraf) over HTTP, TCP and UDP. -* [Graphite plaintext protocol](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-graphite-compatible-agents-such-as-statsd) with [tags](https://graphite.readthedocs.io/en/latest/tags.html#carbon). -* [OpenTSDB put message](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#sending-data-via-telnet-put-protocol). -* [HTTP OpenTSDB /api/put requests](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#sending-opentsdb-data-via-http-apiput-requests). -* [JSON line format](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-json-line-format). +* [Prometheus exposition format](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-prometheus-exposition-format) + . +* [InfluxDB line protocol](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-influxdb-compatible-agents-such-as-telegraf) + over HTTP, TCP and UDP. +* [Graphite plaintext protocol](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-graphite-compatible-agents-such-as-statsd) + with [tags](https://graphite.readthedocs.io/en/latest/tags.html#carbon). +* [OpenTSDB put message](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#sending-data-via-telnet-put-protocol) + . +* [HTTP OpenTSDB /api/put requests](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#sending-opentsdb-data-via-http-apiput-requests) + . +* [JSON line format](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-json-line-format) + . * [Arbitrary CSV data](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-csv-data). -* [Native binary format](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-native-format). +* [Native binary format](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-native-format) + . All the protocols are fully compatible with VictoriaMetrics [data model](#data-model) and can be used in production. -There are no officially supported clients by VictoriaMetrics team for data ingestion. -We recommend choosing from already existing clients compatible with the listed above protocols -(like [Telegraf](https://github.com/influxdata/telegraf) for [InfluxDB line protocol](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-influxdb-compatible-agents-such-as-telegraf)). +There are no officially supported clients by VictoriaMetrics team for data ingestion. We recommend choosing from already +existing clients compatible with the listed above protocols +(like [Telegraf](https://github.com/influxdata/telegraf) +for [InfluxDB line protocol](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-influxdb-compatible-agents-such-as-telegraf)) +. Creating custom clients or instrumenting the application for metrics writing is as easy as sending a POST request: + ```bash curl -d '{"metric":{"__name__":"foo","job":"node_exporter"},"values":[0,1,2],"timestamps":[1549891472010,1549891487724,1549891503438]}' -X POST 'http://localhost:8428/api/v1/import' ``` -It is allowed to push/write metrics to [Single-server-VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html), +It is allowed to push/write metrics +to [Single-server-VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html), [cluster component vminsert](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#architecture-overview) and [vmagent](https://docs.victoriametrics.com/vmagent.html). @@ -264,40 +316,37 @@ elaborating more on why Percona switched from pull to push model. The cons of push protocol: -* it requires applications to be more complex, - since they need to be responsible for metrics delivery; +* it requires applications to be more complex, since they need to be responsible for metrics delivery; * applications need to be aware of monitoring systems; -* using a monitoring system it is hard to tell whether the application - went down or just stopped sending metrics for a different reason; -* applications can overload the monitoring system by pushing - too many metrics. +* using a monitoring system it is hard to tell whether the application went down or just stopped sending metrics for a + different reason; +* applications can overload the monitoring system by pushing too many metrics. ### Pull model -Pull model is an approach popularized by [Prometheus](https://prometheus.io/), -where the monitoring system decides when and where to pull metrics from: +Pull model is an approach popularized by [Prometheus](https://prometheus.io/), where the monitoring system decides when +and where to pull metrics from: {% include img.html href="keyConcepts_pull_model.png" %} -In pull model, the monitoring system needs to be aware of all the applications it needs -to monitor. The metrics are scraped (pulled) with fixed intervals via HTTP protocol. +In pull model, the monitoring system needs to be aware of all the applications it needs to monitor. The metrics are +scraped (pulled) with fixed intervals via HTTP protocol. -For metrics scraping VictoriaMetrics supports [Prometheus exposition format](https://docs.victoriametrics.com/#how-to-scrape-prometheus-exporters-such-as-node-exporter) -and needs to be configured with `-promscrape.config` flag pointing to the file with scrape configuration. -This configuration may include list of static `targets` (applications or services) +For metrics scraping VictoriaMetrics +supports [Prometheus exposition format](https://docs.victoriametrics.com/#how-to-scrape-prometheus-exporters-such-as-node-exporter) +and needs to be configured with `-promscrape.config` flag pointing to the file with scrape configuration. This +configuration may include list of static `targets` (applications or services) or `targets` discovered via various service discoveries. -Metrics scraping is supported by [Single-server-VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) +Metrics scraping is supported +by [Single-server-VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) and [vmagent](https://docs.victoriametrics.com/vmagent.html). The pros of the pull model: -* monitoring system decides how and when to scrape data, - so it can't be overloaded; -* applications aren't aware of the monitoring system and don't need - to implement the logic for delivering metrics; -* the list of all monitored targets belongs to the monitoring system - and can be quickly checked; +* monitoring system decides how and when to scrape data, so it can't be overloaded; +* applications aren't aware of the monitoring system and don't need to implement the logic for delivering metrics; +* the list of all monitored targets belongs to the monitoring system and can be quickly checked; * easy to detect faulty or crashed services when they don't respond. The cons of the pull model: @@ -308,47 +357,51 @@ The cons of the pull model: ### Common approaches for data collection VictoriaMetrics supports both [Push](#push-model) and [Pull](#pull-model) -models for data collection. Many installations are using -exclusively one or second model, or both at once. +models for data collection. Many installations are using exclusively one or second model, or both at once. The most common approach for data collection is using both models: {% include img.html href="keyConcepts_data_collection.png" %} -In this approach the additional component is used - [vmagent](https://docs.victoriametrics.com/vmagent.html). -Vmagent is a lightweight agent whose main purpose is to collect and deliver metrics. -It supports all the same mentioned protocols and approaches mentioned for both data collection models. +In this approach the additional component is used - [vmagent](https://docs.victoriametrics.com/vmagent.html). Vmagent is +a lightweight agent whose main purpose is to collect and deliver metrics. It supports all the same mentioned protocols +and approaches mentioned for both data collection models. -The basic setup for using VictoriaMetrics and vmagent for monitoring is described -in example of [docker-compose manifest](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/deployment/docker). -In this example, vmagent [scrapes a list of targets](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/prometheus.yml) -and [forwards collected data to VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/9d7da130b5a873be334b38c8d8dec702c9e8fac5/deployment/docker/docker-compose.yml#L15). -VictoriaMetrics is then used as a [datasource for Grafana](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/provisioning/datasources/datasource.yml) +The basic setup for using VictoriaMetrics and vmagent for monitoring is described in example +of [docker-compose manifest](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/deployment/docker). In this +example, +vmagent [scrapes a list of targets](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/prometheus.yml) +and [forwards collected data to VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/9d7da130b5a873be334b38c8d8dec702c9e8fac5/deployment/docker/docker-compose.yml#L15) +. VictoriaMetrics is then used as +a [datasource for Grafana](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/provisioning/datasources/datasource.yml) installation for querying collected data. -VictoriaMetrics components allow building more advanced topologies. -For example, vmagents pushing metrics from separate datacenters to the central VictoriaMetrics: +VictoriaMetrics components allow building more advanced topologies. For example, vmagents pushing metrics from separate +datacenters to the central VictoriaMetrics: {% include img.html href="keyConcepts_two_dcs.png" %} -VictoriaMetrics in example may be [Single-server-VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) -or [VictoriaMetrics Cluster](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html). -Vmagent also allows to fan-out the same data to multiple destinations. +VictoriaMetrics in example may +be [Single-server-VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) +or [VictoriaMetrics Cluster](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html). Vmagent also allows to +fan-out the same data to multiple destinations. ## Query data -VictoriaMetrics provides an [HTTP API](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#prometheus-querying-api-usage) +VictoriaMetrics provides +an [HTTP API](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#prometheus-querying-api-usage) for serving read queries. The API is used in various integrations such as -[Grafana](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#grafana-setup). -The same API is also used by -[VMUI](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui) - graphical User Interface -for querying and visualizing metrics. +[Grafana](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#grafana-setup). The same API is also used +by +[VMUI](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui) - graphical User Interface for querying +and visualizing metrics. The API consists of two main handlers: [instant](#instant-query) and [range queries](#range-query). ### Instant query Instant query executes the query expression at the given moment of time: + ``` GET | POST /api/v1/query @@ -359,6 +412,7 @@ step - max lookback window if no datapoints found at the given time. If omitted, ``` To understand how instant queries work, let's begin with a data sample: + ``` foo_bar 1.00 1652169600000 # 2022-05-10 10:00:00 foo_bar 2.00 1652169660000 # 2022-05-10 10:01:00 @@ -375,8 +429,8 @@ foo_bar 1.00 1652170500000 # 2022-05-10 10:15:00 foo_bar 4.00 1652170560000 # 2022-05-10 10:16:00 ``` -The data sample contains a list of samples for one time series with time intervals between -samples from 1m to 3m. If we plot this data sample on the system of coordinates, it will have the following form: +The data sample contains a list of samples for one time series with time intervals between samples from 1m to 3m. If we +plot this data sample on the system of coordinates, it will have the following form:

@@ -384,13 +438,31 @@ samples from 1m to 3m. If we plot this data sample on the system of coordinates,

-To get the value of `foo_bar` metric at some specific moment of time, for example `2022-05-10 10:03:00`, -in VictoriaMetrics we need to issue an **instant query**: +To get the value of `foo_bar` metric at some specific moment of time, for example `2022-05-10 10:03:00`, in +VictoriaMetrics we need to issue an **instant query**: + ```bash curl "http:///api/v1/query?query=foo_bar&time=2022-05-10T10:03:00.000Z" ``` + ```json -{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"foo_bar"},"value":[1652169780,"3"]}]}} +{ + "status": "success", + "data": { + "resultType": "vector", + "result": [ + { + "metric": { + "__name__": "foo_bar" + }, + "value": [ + 1652169780, + "3" + ] + } + ] + } +} ``` In response, VictoriaMetrics returns a single sample-timestamp pair with a value of `3` for the series @@ -408,16 +480,17 @@ requested timestamp, VictoriaMetrics will try to locate the closest sample on th The time range at which VictoriaMetrics will try to locate a missing data sample is equal to `5m` by default and can be overridden via `step` parameter. -Instant query can return multiple time series, but always only one data sample per series. -Instant queries are used in the following scenarios: +Instant query can return multiple time series, but always only one data sample per series. Instant queries are used in +the following scenarios: + * Getting the last recorded value; * For alerts and recording rules evaluation; * Plotting Stat or Table panels in Grafana. - ### Range query Range query executes the query expression at the given time range with the given step: + ``` GET | POST /api/v1/query_range @@ -428,20 +501,104 @@ end - end (rfc3339 | unix_timestamp) of the time range. If omitted, current time step - step in seconds for evaluating query expression on the time range. If omitted, is set to 5m ``` -To get the values of `foo_bar` on time range from `2022-05-10 09:59:00` to `2022-05-10 10:17:00`, -in VictoriaMetrics we need to issue a range query: +To get the values of `foo_bar` on time range from `2022-05-10 09:59:00` to `2022-05-10 10:17:00`, in VictoriaMetrics we +need to issue a range query: + ```bash curl "http:///api/v1/query_range?query=foo_bar&step=1m&start=2022-05-10T09:59:00.000Z&end=2022-05-10T10:17:00.000Z" ``` + ```json -{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"__name__":"foo_bar"},"values":[[1652169600,"1"],[1652169660,"2"],[1652169720,"3"],[1652169780,"3"],[1652169840,"7"],[1652169900,"7"],[1652169960,"7.5"],[1652170020,"7.5"],[1652170080,"6"],[1652170140,"6"],[1652170260,"5.5"],[1652170320,"5.25"],[1652170380,"5"],[1652170440,"3"],[1652170500,"1"],[1652170560,"4"],[1652170620,"4"]]}]}} +{ + "status": "success", + "data": { + "resultType": "matrix", + "result": [ + { + "metric": { + "__name__": "foo_bar" + }, + "values": [ + [ + 1652169600, + "1" + ], + [ + 1652169660, + "2" + ], + [ + 1652169720, + "3" + ], + [ + 1652169780, + "3" + ], + [ + 1652169840, + "7" + ], + [ + 1652169900, + "7" + ], + [ + 1652169960, + "7.5" + ], + [ + 1652170020, + "7.5" + ], + [ + 1652170080, + "6" + ], + [ + 1652170140, + "6" + ], + [ + 1652170260, + "5.5" + ], + [ + 1652170320, + "5.25" + ], + [ + 1652170380, + "5" + ], + [ + 1652170440, + "3" + ], + [ + 1652170500, + "1" + ], + [ + 1652170560, + "4" + ], + [ + 1652170620, + "4" + ] + ] + } + ] + } +} ``` In response, VictoriaMetrics returns `17` sample-timestamp pairs for the series `foo_bar` at the given time range -from `2022-05-10 09:59:00` to `2022-05-10 10:17:00`. But, if we take a look at the original data sample again, -we'll see that it contains only 13 data points. What happens here is that the range query is actually -an [instant query](#instant-query) executed `(start-end)/step` times on the time range from `start` to `end`. -If we plot this request in VictoriaMetrics the graph will be shown as the following: +from `2022-05-10 09:59:00` to `2022-05-10 10:17:00`. But, if we take a look at the original data sample again, we'll +see that it contains only 13 data points. What happens here is that the range query is actually +an [instant query](#instant-query) executed `(start-end)/step` times on the time range from `start` to `end`. If we plot +this request in VictoriaMetrics the graph will be shown as the following:

@@ -450,87 +607,100 @@ If we plot this request in VictoriaMetrics the graph will be shown as the follow

-The blue dotted lines on the pic are the moments when instant query was executed. -Since instant query retains the ability to locate the missing point, the graph contains two types of -points: `real` and `ephemeral` data points. `ephemeral` data point always repeats the left closest +The blue dotted lines on the pic are the moments when instant query was executed. Since instant query retains the +ability to locate the missing point, the graph contains two types of points: `real` and `ephemeral` data +points. `ephemeral` data point always repeats the left closest `real` data point (see red arrow on the pic above). This behavior of adding ephemeral data points comes from the specifics of the [Pull model](#pull-model): + * Metrics are scraped at fixed intervals; * Scrape may be skipped if the monitoring system is overloaded; * Scrape may fail due to network issues. -According to these specifics, the range query assumes that if there is a missing data point then it is likely -a missed scrape, so it fills it with the previous data point. The same will work for cases when `step` is -lower than the actual interval between samples. In fact, if we set `step=1s` for the same request, we'll get about -1 thousand data points in response, where most of them are `ephemeral`. +According to these specifics, the range query assumes that if there is a missing data point then it is likely a missed +scrape, so it fills it with the previous data point. The same will work for cases when `step` is lower than the actual +interval between samples. In fact, if we set `step=1s` for the same request, we'll get about 1 thousand data points in +response, where most of them are `ephemeral`. -Sometimes, the lookbehind window for locating the datapoint isn't big enough and the graph will contain a gap. -For range queries, lookbehind window isn't equal to the `step` parameter. It is calculated as the median of the -intervals between the first 20 data points in the requested time range. In this way, VictoriaMetrics automatically -adjusts the lookbehind window to fill gaps and detect stale series at the same time. +Sometimes, the lookbehind window for locating the datapoint isn't big enough and the graph will contain a gap. For range +queries, lookbehind window isn't equal to the `step` parameter. It is calculated as the median of the intervals between +the first 20 data points in the requested time range. In this way, VictoriaMetrics automatically adjusts the lookbehind +window to fill gaps and detect stale series at the same time. + +Range queries are mostly used for plotting time series data over specified time ranges. These queries are extremely +useful in the following scenarios: -Range queries are mostly used for plotting time series data over specified time ranges. -These queries are extremely useful in the following scenarios: * Track the state of a metric on the time interval; * Correlate changes between multiple metrics on the time interval; * Observe trends and dynamics of the metric change. ### MetricsQL -VictoriaMetrics provide a special query language for executing read queries - [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html). -MetricsQL is a [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics) -like query language -with a powerful set of functions and features for working specifically with time series data. -MetricsQL is backwards-compatible with PromQL, so it shares most of the query concepts. -For example, the basics concepts of PromQL are described [here](https://valyala.medium.com/promql-tutorial-for-beginners-9ab455142085) -are applicable to MetricsQL as well. +VictoriaMetrics provide a special query language for executing read queries + +- [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html). MetricsQL is + a [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics) -like query language with a powerful set of + functions and features for working specifically with time series data. MetricsQL is backwards-compatible with PromQL, + so it shares most of the query concepts. For example, the basics concepts of PromQL are + described [here](https://valyala.medium.com/promql-tutorial-for-beginners-9ab455142085) + are applicable to MetricsQL as well. #### Filtering -In sections [instant query](#instant-query) and [range query](#range-query) we've already used MetricsQL -to get data for metric `foo_bar`. It is as simple as just writing a metric name in the query: +In sections [instant query](#instant-query) and [range query](#range-query) we've already used MetricsQL to get data for +metric `foo_bar`. It is as simple as just writing a metric name in the query: + ```MetricsQL foo_bar ``` A single metric name may correspond to multiple time series with distinct label sets. For example: + ```MetricsQL requests_total{path="/", code="200"} requests_total{path="/", code="403"} ``` To select only time series with specific label value specify the matching condition in curly braces: + ```MetricsQL requests_total{code="200"} ``` -The query above will return all time series with the name `requests_total` and `code="200"`. -We use the operator `=` to match a label value. For negative match use `!=` operator. -Filters also support regex matching `=~` for positive and `!~` for negative matching: +The query above will return all time series with the name `requests_total` and `code="200"`. We use the operator `=` to +match a label value. For negative match use `!=` operator. Filters also support regex matching `=~` for positive +and `!~` for negative matching: + ```MetricsQL requests_total{code=~"2.*"} ``` Filters can also be combined: + ```MetricsQL requests_total{code=~"200|204", path="/home"} ``` -The query above will return all time series with a name `requests_total`, -status `code` `200` or `204` and `path="/home"`. + +The query above will return all time series with a name `requests_total`, status `code` `200` or `204`and `path="/home"` +. #### Filtering by name -Sometimes it is required to return all the time series for multiple metric names. -As was mentioned in the [data model section](#data-model), the metric name is just an ordinary label with -a special name — `__name__`. So filtering by multiple metric names may be performed by applying regexps -on metric names: +Sometimes it is required to return all the time series for multiple metric names. As was mentioned in +the [data model section](#data-model), the metric name is just an ordinary label with a special name — `__name__`. So +filtering by multiple metric names may be performed by applying regexps on metric names: + ```MetricsQL {__name__=~"requests_(error|success)_total"} ``` + The query above is supposed to return series for two metrics: `requests_error_total` and `requests_success_total`. #### Arithmetic operations + MetricsQL supports all the basic arithmetic operations: + * addition (+) * subtraction (-) * multiplication (*) @@ -538,20 +708,23 @@ MetricsQL supports all the basic arithmetic operations: * modulo (%) * power (^) -This allows performing various calculations. For example, the following query will calculate -the percentage of error requests: +This allows performing various calculations. For example, the following query will calculate the percentage of error +requests: + ```MetricsQL (requests_error_total / (requests_error_total + requests_success_total)) * 100 ``` #### Combining multiple series -Combining multiple time series with arithmetic operations requires an understanding of matching rules. -Otherwise, the query may break or may lead to incorrect results. The basics of the matching rules are simple: -* MetricsQL engine strips metric names from all the time series on the left and right side of the arithmetic - operation without touching labels. -* For each time series on the left side MetricsQL engine searches for the corresponding time series on - the right side with the same set of labels, applies the operation for each data point and returns the resulting - time series with the same set of labels. If there are no matches, then the time series is dropped from the result. + +Combining multiple time series with arithmetic operations requires an understanding of matching rules. Otherwise, the +query may break or may lead to incorrect results. The basics of the matching rules are simple: + +* MetricsQL engine strips metric names from all the time series on the left and right side of the arithmetic operation + without touching labels. +* For each time series on the left side MetricsQL engine searches for the corresponding time series on the right side + with the same set of labels, applies the operation for each data point and returns the resulting time series with the + same set of labels. If there are no matches, then the time series is dropped from the result. * The matching rules may be augmented with ignoring, on, group_left and group_right modifiers. This could be complex, but in the majority of cases isn’t needed. @@ -559,6 +732,7 @@ This could be complex, but in the majority of cases isn’t needed. #### Comparison operations MetricsQL supports the following comparison operators: + * equal (==) * not equal (!=) * greater (>) @@ -566,51 +740,56 @@ MetricsQL supports the following comparison operators: * less (<) * less-or-equal (<=) -These operators may be applied to arbitrary MetricsQL expressions as with arithmetic operators. -The result of the comparison operation is time series with only matching data points. -For instance, the following query would return series only for processes where memory usage is > 100MB: +These operators may be applied to arbitrary MetricsQL expressions as with arithmetic operators. The result of the +comparison operation is time series with only matching data points. For instance, the following query would return +series only for processes where memory usage is > 100MB: + ```MetricsQL process_resident_memory_bytes > 100*1024*1024 ``` #### Aggregation and grouping functions -MetricsQL allows aggregating and grouping time series. -Time series are grouped by the given set of labels and then the given aggregation function is applied -for each group. For instance, the following query would return memory used by various processes grouped -by instances (for the case when multiple processes run on the same instance): +MetricsQL allows aggregating and grouping time series. Time series are grouped by the given set of labels and then the +given aggregation function is applied for each group. For instance, the following query would return memory used by +various processes grouped by instances (for the case when multiple processes run on the same instance): + ```MetricsQL sum(process_resident_memory_bytes) by (instance) ``` #### Calculating rates -One of the most widely used functions for [counters](#counter) is [rate](https://docs.victoriametrics.com/MetricsQL.html#rate). -It calculates per-second rate for all the matching time series. For example, the following query will show -how many bytes are received by the network per second: +One of the most widely used functions for [counters](#counter) +is [rate](https://docs.victoriametrics.com/MetricsQL.html#rate). It calculates per-second rate for all the matching time +series. For example, the following query will show how many bytes are received by the network per second: + ```MetricsQL rate(node_network_receive_bytes_total) ``` -To calculate the rate, the query engine will need at least two data points to compare. -Simplified rate calculation for each point looks like `(Vcurr-Vprev)/(Tcurr-Tprev)`, -where `Vcurr` is the value at the current point — `Tcurr`, `Vprev` is the value at the point `Tprev=Tcurr-step`. -The range between `Tcurr-Tprev` is usually equal to `step` parameter. -If `step` value is lower than the real interval between data points, then it is ignored and a minimum real interval is used. +To calculate the rate, the query engine will need at least two data points to compare. Simplified rate calculation for +each point looks like `(Vcurr-Vprev)/(Tcurr-Tprev)`, where `Vcurr` is the value at the current point — `Tcurr`, `Vprev` +is the value at the point `Tprev=Tcurr-step`. The range between `Tcurr-Tprev` is usually equal to `step` parameter. +If `step` value is lower than the real interval between data points, then it is ignored and a minimum real interval is +used. The interval on which `rate` needs to be calculated can be specified explicitly as `duration` in square brackets: + ```MetricsQL rate(node_network_receive_bytes_total[5m]) ``` -For this query the time duration to look back when calculating per-second rate for each point on the graph -will be equal to `5m`. -`rate` strips metric name while leaving all the labels for the inner time series. -Do not apply `rate` to time series which may go up and down, such as [gauges](#gauge). -`rate` must be applied only to [counters](#counter), which always go up. -Even if counter gets reset (for instance, on service restart), `rate` knows how to deal with it. +For this query the time duration to look back when calculating per-second rate for each point on the graph will be equal +to `5m`. + +`rate` strips metric name while leaving all the labels for the inner time series. Do not apply `rate` to time series +which may go up and down, such as [gauges](#gauge). +`rate` must be applied only to [counters](#counter), which always go up. Even if counter gets reset (for instance, on +service restart), `rate` knows how to deal with it. ### Visualizing time series + VictoriaMetrics has a built-in graphical User Interface for querying and visualizing metrics [VMUI](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#vmui). Open `http://victoriametrics:8428/vmui` page, type the query and see the results: @@ -618,5 +797,30 @@ Open `http://victoriametrics:8428/vmui` page, type the query and see the results {% include img.html href="keyConcepts_vmui.png" %} VictoriaMetrics supports [Prometheus HTTP API](https://prometheus.io/docs/prometheus/latest/querying/api/) -which makes it possible to [use with Grafana](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#grafana-setup). -Play more with Grafana integration in VictoriaMetrics sandbox [https://play-grafana.victoriametrics.com](https://play-grafana.victoriametrics.com). +which makes it possible +to [use with Grafana](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#grafana-setup). Play more with +Grafana integration in VictoriaMetrics +sandbox [https://play-grafana.victoriametrics.com](https://play-grafana.victoriametrics.com). + +## Modify data + +VictoriaMetrics stores time series data in [MergeTree](https://en.wikipedia.org/wiki/Log-structured_merge-tree)-like +data structures. While this approach if very efficient for write-heavy databases, it applies some limitations on data +updates. In short, modifying already written [time series](#time-series) requires re-writing the whole data block where +it is stored. Due to this limitation, VictoriaMetrics does not support direct data modification. + +### Deletion + +See [How to delete time series](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-delete-time-series) +. + +### Relabeling + +Relabeling is a powerful mechanism for modifying time series before they have been written to the database. Relabeling +may be applied for both [push](#push-model) and [pull](#pull-model) models. See more +details [here](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#relabeling). + +### Deduplication + +VictoriaMetrics supports data points deduplication after data was written to the storage. See more +details [here](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#deduplication). \ No newline at end of file