* Refactor netclass_rtnl collector
Merge the netclass_rtnl collector into the netclass collector.
* Disabled by default
* Followup to #2492
Signed-off-by: Ben Kochie <superq@gmail.com>
* update rtnetlink package to v1.2.3
* add RTNL version of netclass collector that have all the metrics that netdev collector provides, too.
Signed-off-by: Haoyu Sun <hasun@redhat.com>
Some systems have broken netlink messages due to patched kernels. Since
these messages can not be parsed, add a flag to fall back to parsing
from `/proc/net/dev`.
Fixes: https://github.com/prometheus/node_exporter/issues/2502
Signed-off-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
Note however that the InetDiagMsg struct contains a InetDiagSockID
member, which itself contains some members which are explicitly
specified as big-endian in Linux kernel source:
struct inet_diag_sockid {
__be16 idiag_sport;
__be16 idiag_dport;
__be32 idiag_src[4];
__be32 idiag_dst[4];
__u32 idiag_if;
__u32 idiag_cookie[2];
};
node_exporter currently does not use these members for anything, so this
is acceptable (for now).
Signed-off-by: Daniel Swarbrick <daniel.swarbrick@gmail.com>
We don't need to fully sanitize the hwmon label values to metric/label
name strings.
* Just make sure they're valid UTF-8.
* Always included the label metric to avoid group_left failures.
Signed-off-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
Correctly handle the new `collector.diskstats.device-exclude` flag to
avoid errors when using the old `collector.diskstats.ignored-devices`
flag.
Fixes: https://github.com/prometheus/node_exporter/issues/2486
Signed-off-by: Ben Kochie <superq@gmail.com>
* [CHANGE] Merge metrics descriptions in textfile collector #2475
* [FEATURE] [node-mixin] Add darwin dashboard to mixin #2351
* [FEATURE] Add "isolated" metric on cpu collector on linux #2251
* [FEATURE] Add cgroup summary collector #2408
* [FEATURE] Add selinux collector #2205
* [FEATURE] Add slab info collector #2376
* [FEATURE] Add sysctl collector #2425
* [FEATURE] Also track the CPU Spin time for OpenBSD systems #1971
* [FEATURE] Add support for MacOS version #2471
* [ENHANCEMENT] [node-mixin] Add missing selectors #2426
* [ENHANCEMENT] [node-mixin] Change current datasource to grafana's default #2281
* [ENHANCEMENT] [node-mixin] Change disk graph to disk table #2364
* [ENHANCEMENT] [node-mixin] Change io time units to %util #2375
* [ENHANCEMENT] Ad user_wired_bytes and laundry_bytes on *bsd #2266
* [ENHANCEMENT] Add additional vm_stat memory metrics for darwin #2240
* [ENHANCEMENT] Add device filter flags to arp collector #2254
* [ENHANCEMENT] Add diskstats include and exclude device flags #2417
* [ENHANCEMENT] Add node_softirqs_total metric #2221
* [ENHANCEMENT] Add rapl zone name label option #2401
* [ENHANCEMENT] Add slabinfo collector #1799
* [ENHANCEMENT] Allow user to select port on NTP server to query #2270
* [ENHANCEMENT] collector/diskstats: Add labels and metrics from udev #2404
* [ENHANCEMENT] Enable builds against older macOS SDK #2327
* [ENHANCEMENT] qdisk-linux: Add exclude and include flags for interface name #2432
* [ENHANCEMENT] systemd: Expose systemd minor version #2282
* [ENHANCEMENT] Use netlink for tcpstat collector #2322
* [ENHANCEMENT] Use netlink to get netdev stats #2074
* [ENHANCEMENT] Add additional perf counters for stalled frontend/backend cycles #2191
* [ENHANCEMENT] Add btrfs device error stats #2193
* [BUGFIX] [node-mixin] Fix fsSpaceAvailableCriticalThreshold and fsSpaceAvailableWarning #2352
* [BUGFIX] Fix concurrency issue in ethtool collector #2289
* [BUGFIX] Fix concurrency issue in netdev collector #2267
* [BUGFIX] Fix diskstat reads and write metrics for disks with different sector sizes #2311
* [BUGFIX] Fix iostat on macos broken by deprecation warning #2292
* [BUGFIX] Fix NodeFileDescriptorLimit alerts #2340
* [BUGFIX] Sanitize rapl zone names #2299
* [BUGFIX] Add file descriptor close safely in test #2447
* [BUGFIX] Fix race condition in os_release.go #2454
* [BUGFIX] Skip ZFS IO metrics if their paths are missing #2451
Signed-off-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
* Improve metrics filesystem scanning logic
* Makes ioctl syscalls to load the device error stats.
* Adds filesystem mountpoint labels to existing metrics for ease of use.
Signed-off-by: Marcus Cobden <leth@users.noreply.github.com>
The textfile collector will now provide a unified metric description
(that will look like "Metric read from file/a.prom, file/b.prom")
for metrics collected accross several text-files that don't already
have a description.
Also change the error handling in the textfile collector tests to
ContinueOnError to better mirror the real-life use-case.
Signed-off-by: Guillaume Espanel <guillaume.espanel.ext@ovhcloud.com>
Signed-off-by: Guillaume Espanel <guillaume.espanel.ext@ovhcloud.com>
* Allow user to select port on NTP server to query
Some people (me!) run NTP servers on non-privileged ports. The `github.com/beevik/ntp` package allows overriding the port, so this change just adds a flag `collector.ntp.server-port` (defaults to 123) and then passes that value through to the query via the `QueryOptions`.
Signed-off-by: Andrew Rowson <github@growse.com>
On Linux, we get more detailed interface statistics from netlink than we did
from `/proc/net/dev`.
This commit adds a new flag (`--collector.netdev.enable-detailed-metrics`) to
expose those statistics under new (incompatible) metric names. When enabled,
the metric names are also changed on Darwin and BSD platforms to keep
everything consistent, but it doesn't provide more detailed statistics on those
platforms.
The old metrics can be derived from the new ones using the following rules
([dev_seq_printf_stats]):
- `receive_errs` = `receive_errors`
- `receive_drop` = `receive_dropped` + `receive_missed_errors`
- `receive_fifo` = `receive_fifo_errors`
- `receive_frame` = `receive_length_errors` + `receive_over_errors` + `receive_crc_errors` + `receive_frame_errors`
- `receive_multicast` = `multicast`
- `transmit_errs` = `transmit_errors`
- `transmit_drop` = `transmit_dropped`
- `transmit_fifo` = `transmit_fifo_errors`
- `transmit_colls` = `collisions`
- `transmit_carrier` = `transmit_aborted_errors` + `transmit_carrier_errors` + `transmit_heartbeat_errors` + `transmit_window_errors`
[dev_seq_printf_stats]: https://github.com/torvalds/linux/blob/master/net/core/net-procfs.c#L75-L97
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
To prepare for the introduction of new metric names, add tests for the legacy
metric names and values. This will make it easier to ensure that the code that
converts the new metrics to the old ones (for compatibility) behaves correctly.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Since netdev metrics are now read from netlink instead of `/proc/net/dev`, we
can't easily spoof them for the end-to-end tests by reading a fixture file in
place of `/proc/net/dev`.
Therefore, we only get metrics for `lo` and ignore those that would return
unpredictable values (i.e. the byte and packet counters).
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Instead of parsing `/proc/net/dev` to get network interface statistics, get
them from a netlink call.
Internally, both come from the [rtnl_link_stats64] struct, but with
`/proc/net/dev`, some of the values are aggregated together in
[dev_seq_printf_stats], so we get less information out of them.
This commit maintains compatibility by aggregating those stats back into the
same metrics.
[rtnl_link_stats64]: https://github.com/torvalds/linux/blob/master/include/uapi/linux/if_link.h#L42-L246
[dev_seq_printf_stats]: https://github.com/torvalds/linux/blob/master/net/core/net-procfs.c#L75-L97
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
These two memory classes have been here for a while now in FreeBSD,
adding them allows having information for all memory classes.
Signed-off-by: François Charlier <fcharlier@ploup.net>
Log a single error message when the udev data directory (`/run/udev/data` by
default) is unreadable, and then don't try to get device properties out of it.
Also lower the log level from error to debug when we can't parse the udev files
properly, since these messages would be sent every time the node exporter gets
scraped.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
When parsing udev data, skip lines that don't start with `E:`.
Lines prefixed with `E:` represent device properties, as documented in
udevadm(8).
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Set the `--path.udev.data` flag to point to the udev fixture, and update the
output fixture with
```console
$ ./end-to-end-test.sh -u
```
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Now that we read some data from `/run/udev/data`, add the corresponding
fixtures and update the expected test results accordingly.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Instead of hard-coding the path to `/run/udev/data`, intoduce a
`--path.udev.data` flag that defaults to that value.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Add labels to the `node_disk_info` metric extracted from udev, such as `model`,
`path`, `revision`, `serial` and `wwn`.
Also add a few metrics related to filesystem and device mapper, which are also
extracted from udev information.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Use standard include/exclude pattern for device include/exclude in the
diskstats collector.
Signed-off-by: Ben Kochie <superq@gmail.com>
Co-authored-by: rushilenekar20 <rushilenekar20@gmail.com>
Fix up handling of CPU info collector on non-x86_64 systems due to
fixtures containing `/proc/cpuinfo` from x86_64.
* Update e2e 64k page test fixture from an arm64 system.
* Enable ARM testing in CircleCI.
Fixes: https://github.com/prometheus/node_exporter/issues/1959
Signed-off-by: Ben Kochie <superq@gmail.com>
* Correctly name collector file.
* Fix cgroup summary type as gauge.
* Use a boolean metric rather than a label for enabled.
Signed-off-by: Ben Kochie <superq@gmail.com>
Use unix.ByteSliceToString to convert Utsname []byte fields to strings.
This also allows to drop the bytesToString helper which serves the same
purpose and matches ByteSliceToString's implementation.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
This is necessary to build on darwin using nix, as nix-darwin uses an
older macOS SDK, built from Apple's open source releases.
Signed-off-by: Peter Woodman <peter@shortbus.org>
In certain instances on heavily loaded nodes with many network
devices, there may be concurrent access to the netdev collector's
`metricDescs` map, resulting in a panic. This adds a mutex to prevent
concurrent reads and writes to the map.
Signed-off-by: Brad Ison <bison@xvdf.io>
Move the systemd version function to an exporter method. This way we can
update the Verison information at every scrape, in case the underlying
version changes.
Signed-off-by: Ben Kochie <superq@gmail.com>
systemd patch versions are as important as the major version number;
they indicate security or bug fixes or other behavioural changes between
versions.
Use float64 over float32 as the rounding error with float32 rendered
250.3 as 250.3000030517578 in my testing.
Signed-off-by: Joe Groocock <jgroocock@cloudflare.com>
Signed-off-by: Joe Groocock <me@frebib.net>
analogous to the /var/lib/docker exclude added in
https://github.com/prometheus/node_exporter/pull/814
podman rootful containers mount eg. shm filesystems at
/var/lib/containers/storage/*-containers/*/userdata/shm. these should be
treated like things under /var/lib/docker by default.
Signed-off-by: Lauri Tirkkonen <lauri@hacktheplanet.fi>
Allow filtering APR entries based on device. Useful for ignoring
entries for network namespaces (containers).
Signed-off-by: Ben Kochie <superq@gmail.com>
This adds a new Linux metric, node_softirqs_total, which corresponds
to the 'softirq' line in /proc/stat. This metric is disabled by
default and it can be enabled with '--collector.stat.softirq'.
Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
Use the non-cgo version for all openbsd architectures.
The old code only pulled some defines from header files. Just add them
as enumerations in native go. Also be careful at what the SysctlRaw returns.
Implement a way that supports both recent and old pre-6.4 OpenBSD systems.
With go-1.16 OpenBSD binaries will link to libc and because of this binaries
built on OpenBSD 6.9-current do not run on OpenBSD 6.3. OpenBSD 6.3 is also
not supported for more then 2 years. So maybe the compat code is not needed.
Still validation object length before doing an unsafe pointer conversion
is probably reasonable but I'm no golang expert.
Signed-off-by: Claudio Jeker <claudio@openbsd.org>
TCP timeouts count is a useful signal to show
abnormal network performance and is another
signal to aid debugging. This metric can be
used to generate proactive alerts for host
network namespace workloads.
Signed-off-by: Martin Kennelly <mkennell@redhat.com>
The new `lnstat` collector produces a high number of metrics, per-cpu,
and results in approximately double the number of metrics previously
scraped. For example, a typical server with 64 cores produces 3832
lnstat metrics compared to 4147 metrics for the remaining collectors.
Therefore disable the `lnstat` collector by default.
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
Sanitizing the metric names can lead to duplicate metric names:
```
caller=level.go:63 level=error caller="error gathering metrics: [from Gatherer #2] collected metric \"node_ethtool_giant_hdr\" { label:<name:\"device\" value:\"ens192\" > untyped:<value:0" msg=" > } was collected before with the same name and label values"
```
Generate a map from the sanitized metric names to the metric names from
ethtool. In case of duplicate sanitized metric names drop both metrics,
because it is unknown which one to take.
Fixes: https://github.com/prometheus/node_exporter/issues/2185
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
Use SysctlTimeval from the golang.org/x/sys/unix package to
simplify the implementation of the boottime collector for the BSDs and
allows to build it without cgo.
Tested on macOS 11.6, FreeBSD 13 and OpenBSD 7.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Add a DMI collector to expose the Desktop Management Interface (DMI)
info from `/sys/class/dmi/id/`. This will expose information about the
BIOS, mainboard, chassis, and product.
Closes: https://github.com/prometheus/node_exporter/issues/303
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
Use `time.NewTimer()` and explicit `Stop()` to avoid memory bloat / GC problems with `time.After()` in the Linux filesystem collector timeout handling.
Signed-off-by: bawenmao <bawenmao@sogou-inc.com>
The ethtool_cmd struct from the linux kernel contains information about the speeds and features supported by a
network device. This includes speeds and duplex but also features like autonegotiate and 802.3x pause frames.
Closes#1444
Signed-off-by: W. Andrew Denton <git@flying-snail.net>
* collector: Unwrap glob textfile directories
* collector: Store full path in mtime's file label
The point is to avoid duplicated gauges from files with the same name in
different directories.
This introduces support for exporting from multiple directories matching
given pattern (e.g. `/home/*/metrics/`).
Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
Expose GPU metrics using `sysfs/drm`.
`amdgpu` is the only driver which exposes this information through DRM.
Signed-off-by: Siavash Safi <siavash.safi@gmail.com>
Use the same flag pattern as netdev to make filtering methods the same.
* Move SanitizeMetricName to helper.go
Signed-off-by: Ben Kochie <superq@gmail.com>
* Refactor diskstats_linux to use procfs.
* Add `node_disk_info` metric.
Signed-off-by: W. Andrew Denton <git@flying-snail.net>
Co-authored-by: W. Andrew Denton <git@flying-snail.net>
Currently Node Exporter has a metric called `node_uname_info` which of
course exposes uname info. While this is nice, it does not help if you
are running different OSes which could have similar uname info.
Therefore parse `/etc/os-release` or `/usr/lib/os-release` and expose a
`node_os_info` metric which provide information regarding the OS
release/version of the node. Also expose the major.minor part of the OS
release version as `node_os_version`.
Since the os-release files will not change often, cache the parsed
content and only refresh the cache if the modification time changes.
This `os` collector will read files outside of `/proc` and `/sys`, but
the os-release file is widely used and the format is standardized:
https://www.freedesktop.org/software/systemd/man/os-release.html
Bug: https://github.com/prometheus/node_exporter/issues/1574
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
In high scale virtualized / cloud environments there are typically
no guest VMs. Add a boolean flag to allow disabling the Linux guest
CPU metrics.
Signed-off-by: Ben Kochie <superq@gmail.com>
Add a `node_ethtool_info` metric to all ethtool devices to expose driver
information with following labels:
* bus_info
* driver
* expansion_rom_version
* firmware_version
* version
This metric is useful to monitor the firmware version to be up-to-date.
Note: The version label might be malformed due to bug #39 in ethtool:
https://github.com/safchain/ethtool/issues/39
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
OpenMetrics and the Prometheus exposition format require the metric name
to consist only of alphanumericals and "_", ":" and they must not start
with digits. The metric names from the ethtool stats might contain
spaces, brackets, and dots. Converting them directly to metric names
will produce invalid metric names.
Therefore sanitize the metric names and convert them to lower case.
Fixes: https://github.com/prometheus/node_exporter/issues/2083
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
This adds a new flag --collector.ethtool.metrics-include to the ethtool
collector. Only metrics matching this regexp will be collected.
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
Other network related collectors allow to filter out unwanted devices.
Add this support to the new ethtool collector as well.
Signed-off-by: Benjamin Drung <benjamin.drung@ionos.com>
Update procfs library to include ignored fields ParseInt handling.
Wrap error returns so that the user can know more about what failed.
Returns from getAllocatedThreads() are errors anyway.
Fixes: https://github.com/prometheus/node_exporter/issues/2110
Signed-off-by: Ben Kochie <superq@gmail.com>