mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2024-11-23 20:37:12 +01:00

Aliaksandr Valialkin de113806bb docs/CaseStudies.md: add CERN case study

2020-05-11 14:05:20 +03:00

15 KiB

Raw Blame History

Case studies and talks

Below are approved public case studies and talks from VictoriaMetrics users. Join our community Slack channel and feel free asking for references, reviews and additional case studies from real VictoriaMetrics users there.

Adidas

See slides and video from Remote Write Storage Wars talk at PromCon 2019. VictoriaMetrics is compared to Thanos, Corex and M3DB in the talk.

CERN

The European Organization for Nuclear Research known as CERN uses VictoriaMetrics for real-time monitoring of the CMS detector system. According to published talk VictoriaMetrics is used for the following purposes as a part of "CMS Monitoring cluster":

As long-term storage for messages consumed from the NATS messaging system. Consumed messages are pushed directly to VictoriaMetrics via HTTP protocol
As long-term storage for Prometheus monitoring system (30 days retention policy, there are plans to increase it up to ½ year)
As a data source for visualizing metrics in Grafana.

R&D topic: Evaluate VictoraMetrics vs InfluxDB for large cardinality data.

COLOPL

COLOPL is Japaneese Game Development company. It started using VictoriaMetrics after evaulating the following remote storage solutions for Prometheus:

Cortex
Thanos
M3DB
VictoriaMetrics

See slides and video from Large-scale, super-load system monitoring platform built with VictoriaMetrics talk at Prometheus Meetup Tokyo #3.

Wix.com

Wix.com is the leading web development platform.

We needed to redesign metric infrastructure from the ground up after the move to Kubernethes. A few approaches/designs have been tried before the one that works great has been chosen: Prometheus instance in every datacenter with 2 hours retention for local storage and remote write into HA pair of single-node VictoriaMetrics instances.

Numbers:

The number of active time series per VictoriaMetrics instance is 20M.
The total number of time series per VictoriaMetrics instance is 400M+.
Ingestion rate per VictoriaMetrics instance is 800K data points per second.
The average time series churn rate is ~3M per day.
The average query rate is ~1K per minute (mostly alert queries).
Query duration: median is ~70ms, 99th percentile is ~2sec.
Retention: 6 months.

Alternatives that we’ve played with before choosing VictoriaMetrics are: federated Prometheus, Cortex, IronDB and Thanos. Points that were critical to us when we were choosing a central tsdb, in order of importance:

At least 3 month worth of history.
Raw data, no aggregation, no sampling.
High query speed.
Clean fail state for HA (multi-node clusters may return partial data resulting in false alerts).
Enough head room/scaling capacity for future growth, up to 100M active time series.
Ability to split DB replicas per workload. Alert queries go to one replica, user queries go to another (speed for users, effective cache).

Optimizing for those points and our specific workload VictoriaMetrics proved to be the best option. As an icing on a cake we’ve got PromQL extensions - default 0 and histogram are my favorite ones, for example. What we specially like is having a lot of tsdb params easily available via config options, that makes tsdb easy to tune for specific use case. Also worth noting is a great community in Slack channel and of course maintainer support.

Alex Ulstein, Head of Monitoring, Wix.com

Wedos.com

Wedos is the Biggest Czech Hosting. We have our own private data center, that holds only our servers and technologies. The second data center, where the servers will be cooled in an oil bath, is being built. We started using cluster VictoriaMetrics to store Prometheus metrics from all our infrastructure after receiving positive references from our friends who successfully use VictoriaMetrics.

Numbers:

The number of acitve time series: 5M.
Ingestion rate: 170K data points per second.
Query duration: median is ~2ms, 99th percentile is ~50ms.

We like configuration simplicity and zero maintenance for VictoriaMetrics - once installed and forgot about it. It works out of the box without any issues.

Synthesio

Synthesio is the leading social intelligence tool for social media monitoring & social analytics.

We fully migrated from Metrictank to Victoria Metrics

Numbers:

Single node
Active time series - 5 Million
Datapoints: 1.25 Trillion
Ingestion rate - 550k datapoints per second
Disk usage - 150gb
Index size - 3gb
Query duration 99th percentile - 147ms
Churn rate - 100 new time series per hour

MHI Vestas Offshore Wind

The mission of MHI Vestas Offshore Wind is to co-develop offshore wind as an economically viable and sustainable energy resource to benefit future generations.

MHI Vestas Offshore Wind is using VictoriaMetrics to ingest and visualize sensor data from offshore wind turbines. The very efficient storage and ability to backfill was key in chosing VictoriaMetrics. MHI Vestas Offshore Wind is running the cluster version of VictoriaMetrics on Kubernetes using the Helm charts for deployment to be able to scale up capacity as the solution will be rolled out.

Numbers with current limited roll out:

Active time series: 270K
Ingestion rate: 70K/sec
Total number of datapoints: 850 billions
Data size on disk: 800 GiB
Retention time: 3 years

Dreamteam

Dreamteam successfully uses single-node VictoriaMetrics in multiple environments.

Numbers:

Active time series: from 350K to 725K.
Total number of time series: from 100M to 320M.
Total number of datapoints: from 120 billions to 155 billions.
Retention: 3 months.

VictoriaMetrics in production environment runs on 2 M5 EC2 instances in "HA" mode, managed by Terraform and Ansible TF module. 2 Prometheus instances are writing to both VMs, with 2 Promxy replicas as load balancer for reads.

Brandwatch

Brandwatch is the world's pioneering digital consumer intelligence suite, helping over 2,000 of the world's most admired brands and agencies to make insightful, data-driven business decisions.

The engineering department at Brandwatch has been using InfluxDB for storing application metrics for many years and when End-of-Life of InfluxDB version 1.x was announced we decided to re-evaluate our whole metrics collection and storage stack.

Main goals for the new metrics stack were:

improved performance
lower maintenance
support for native clustering in open source version
the less metrics shipment had to change, the better
achieving longer data retention would be great but not critical

We initially looked at CrateDB and TimescaleDB which both turned out to have limitations or requirements in the open source versions that made them unfit for our use case. Prometheus was also considered but push vs. pull metrics was a big change we did not want to include in the already significant change.

Once we found VictoriaMetrics it solved the following problems:

it is very light weight and we can now run virtual machines instead of dedicated hardware machines for metrics storage
very short startup time and any possible gaps in data can easily be filled in by using Promxy
we could continue using Telegraf as our metrics agent and ship identical metrics to both InfluxDB and VictoriaMetrics during a migration period (migration just about to start)
compression is really good so we can store more metrics and we can spin up new VictoriaMetrics instances for new data and keep read-only nodes with older data if we need to extend our retention period further than single virtual machine disks allow and we can aggregate all the data from VictoriaMetrics with Promxy

High availability is done the same way we did with InfluxDB, by running parallel single nodes of VictoriaMetrics.

Numbers:

active time series: up to 25 million
ingestion rate: ~300 000
total number of datapoints: 380 billion and growing
total number of entries in inverted index: 575 million and growing
daily time series churn rate: ~550 000
data size on disk: ~660GB and growing
index size on disk: ~9,3GB and growing
average datapoint size on disk: ~1.75 bytes

Query rates are insignificant as we have concentrated on data ingestion so far.

Anders Bomberg, Monitoring and Infrastructure Team Lead, brandwatch.com

Adsterra

Adsterra Network is a leading digital advertising company that offers performance-based solutions for advertisers and media partners worldwide.

We used to collect and store our metrics via Prometheus. Over time the amount of our servers and metrics increased so we were gradually reducing the retention. When retention became 7 days we started to look for alternative solutions. We were choosing among Thanos, VictoriaMetrics and Prometheus federation.

We end up with the following configuration:

Local Prometheus'es with VictoriaMetrics as remote storage on our backend servers.
A single Prometheus on our monitoring server scrapes metrics from other servers and writes to VictoriaMetrics.
A separate Prometheus that federates from other Prometheus'es and processes alerts.

Turns out that remote write protocol generates too much traffic and connections. So after 8 months we started to look for alternatives.

Around the same time VictoriaMetrics released vmagent. We tried to scrape all the metrics via a single insance of vmagent. But that didn't work - vmgent wasn't able to catch up with writes into VictoriaMetrics. We tested different options and end up with the following scheme:

We removed Prometheus from our setup.
VictoriaMetrics can scrape targets as well, so we removed vmagent. Now VictoriaMetrics scrapes all the metrics from 110 jobs and 5531 targets.
We use Promxy for alerting.

Such a scheme has the following benefits comparing to Prometheus:

We can store more metrics.
We need less RAM and CPU for the same workload.

Cons are the following:

VictoriaMetrics doesn't support replication - we run extra instance of VictoriaMetrics and Promxy in front of VictoriaMetrics pair for high availability.
VictoriaMetrics stores 1 extra month for defined retention (if retention is set to N months, then VM stores N+1 months of data), but this is still better than other solutions.

Some numbers from our single-node VictoriaMetrics setup:

active time series: 10M
ingestion rate: 800K samples/sec
total number of datapoints: more than 2 trillion
total number of entries in inverted index: more than 1 billion
daily time series churn rate: 2.6M
data size on disk: 1.5 TB
index size on disk: 27 GB
average datapoint size on disk: 0.75 bytes
range query rate: 16 rps
instant query rate: 25 rps
range query duration: max: 0.5s; median: 0.05s; 97th percentile: 0.29s
instant query duration: max: 2.1s; median: 0.04s; 97th percentile: 0.15s

VictoriaMetrics consumes about 50GiB of RAM.

Setup:

We have 2 single-node instances of VictoriaMetircs. The first instance collects and stores high-resolution metrics (10s scrape interval) for a month. The second instance collects and stores low-resolution metrics (300s scrape interval) for a month. We use Promxy + Alertmanager for global view and alerts evaluation.

ARNES

The Academic and Research Network of Slovenia (ARNES) is a public institute that provides network services to research, educational and cultural organizations, and enables them to establish connections and cooperation with each other and with related organizations abroad.

After using Cacti, Graphite and StatsD for years, we wanted to upgrade our monitoring stack to something that:

has native alerting support
can run on-prem
has multi-dimension metrics
lower hardware requirements
is scalable
simple client provisioning and discovery with Puppet

We were running Prometheus for about a year in a test environment and it worked great. But there was a need/wish for a few years of retention time, like the old systems provided. We tested Thanos, which was a bit resource hungry back then, but it worked great for about half a year until we discovered VictoriaMetrics. As our scale is not that big, we don't have on-prem S3 and no Kubernetes, VM's single node instance provided the same result with less maintenance overhead and lower hardware requirements.

After testing it a few months and having great support from the maintainers on Slack, we decided to go with it. VM's support for ingesting InfluxDB metrics was an additional bonus, since our hardware team uses SNMPCollector to collect metrics from network devices and switching from InfluxDB to VictoriaMetrics was a simple change in the config file for them.

Numbers:

2 single node instances per DC (one for prometheus and one for influxdb metrics)
Active time series per VictoriaMetrics instance: ~500k (prometheus) + ~320k (influxdb)
Ingestion rate per VictoriaMetrics instance: 45k/s (prometheus) / 30k/s (influxdb)
Query duration: median is ~5ms, 99th percentile is ~45ms
Total number of datapoints per instance: 390B (prometheus), 110B (influxdb)
Average datapoint size on drive: 0.4 bytes
Disk usage per VictoriaMetrics instance: 125GB (prometheus), 185GB (influxdb)
Index size per VictoriaMetrics instance: 1.6GB (prometheus), 1.2GB (influcdb)

We are running 1 Prometheus, 1 VictoriaMetrics and 1 Grafana server in each datacenter on baremetal servers, scraping 350+ targets (and 3k+ devices collected via SNMPCollector sending metrics directly to VM). Each Prometheus is scraping all targets, so we have all metrics in both VictoriaMetrics instances. We are using Promxy to deduplicate metrics from both instances. Grafana has a LB infront, so if one DC has problems, we can still view all metrics from both DCs on the other Grafana instance.

We are still in the process of migration, but we are really happy with the whole stack. It has proven as an essential piece for insight into our services during COVID-19 and has enabled us to provide better service and spot problems faster.

15 KiB Raw Blame History Unescape Escape

Case studies and talks

Adidas

CERN

COLOPL

Wix.com

Wedos.com

Synthesio

MHI Vestas Offshore Wind

Dreamteam

Brandwatch

Adsterra

ARNES

15 KiB

Raw Blame History