Commit Graph

1230 Commits

Author SHA1 Message Date
Matt Bostock
9e0aee8ae7 Add metrics exposing extended md RAID info (#958)
Add metrics that expose more information about MD RAID devices and
disks:

- the RAID level in use
- the RAID set that a disk belongs to

This allows for things like alert on unusually high I/O
utilisation for a disk compared to other disks in the same RAID set,
which usually means the disk is failing, and for comparing
write/read latency across RAID sets.

Output looks like:

    node_md_disk_info{disk_device="/dev/dm-0", md_device="md1", md_set="A"} 1
    node_md_disk_info{disk_device="/dev/dm-3", md_device="md1", md_set="B"} 1
    node_md_disk_info{disk_device="/dev/dm-2", md_device="md1", md_set="A"} 1
    node_md_disk_info{disk_device="/dev/dm-1", md_device="md1", md_set="B"} 1
    node_md_disk_info{disk_device="/dev/dm-4", md_device="md1", md_set="A"} 1
    node_md_disk_info{disk_device="/dev/dm-5", md_device="md1", md_set="B"} 1
    node_md_info{md_device="md1", md_name="foo", raid_level="10", md_metadata_version="1.2"} 1

The `node_md_info` metric, which gives additional information about the
RAID array, is intentionally separate to avoid adding all of those
labels to each disk. If you need to query using the labels contained in
`node_md_info`, you can do that using PromQL:
https://www.robustperception.io/how-to-have-labels-for-machine-roles/

I looked at adding the array UUID, but there's no sysfs entry for it and
I'm not sure there's a strong use case for it.

This patch to add a sysfs entry for the UUID was apparently not
accepted:
https://www.spinics.net/lists/raid/msg40667.html

Add these metrics as a textfile script rather than adding them to the Go
'md' module as they're perhaps less commonly useful. If lots of people
find them useful, we can later rewrite this in Go.

Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
2018-08-18 08:57:51 +00:00
Matt Layher
d84873727f vendor: bump github.com/mdlayher/wifi and dependencies (#1045)
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2018-08-14 21:15:07 +02:00
James Hartig
60c827231a NRestarts or NRefused aren't available on older systemd versions (#1039)
* If NRestarts or NRefused are not available, don't ignore the unit itself
* Don't report systemd metrics (NRestarts/NRefused) that are not available

Signed-off-by: James Hartig <james@getadmiral.com>
2018-08-14 14:28:26 +02:00
Ben Kochie
fe5a117831
Handle vanishing PIDs (#1043)
PIDs can vanish (exit) from /proc/ between gathering the list of PIDs
and getting all of their stats.

* Ignore file not found errors.
* Explicitly count the PIDs we find.
* Cleanup some error style issues.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-08-13 17:27:23 +02:00
Ben Kochie
099c1527f1
Update build (#1041)
Update build

* Update to Go 1.10.
* Enable `ppc64le` build.
* Enable MIPS builds.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-08-13 17:26:55 +02:00
Ben Kochie
0662673ad6
Disable wifi collector by default (#1037)
* Disable wifi collector by default

Disable the wifi collector by default due to suspected cashing issues and goroutine leaks.
* https://github.com/prometheus/node_exporter/issues/870
* https://github.com/prometheus/node_exporter/issues/1008

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-08-07 10:27:20 +02:00
Ben Kochie
5d23ad0ca7
Fix supervisord collector (#978)
* Replace supervisord xmlrpc library
* Use `github.com/mattn/go-xmlrpc` that doesn't leak goroutines.
* Fix uptime metric

* Use Prometheus best practices for uptime metric.
  * Use "start time" rather than "uptime".
  * Don't emit a start time if the process is down.
* Add changelog entry.
* Add example compatibility rules.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-08-06 16:54:46 +02:00
Julius Volz
2c52b8c761
systemd: Remove unneeded/unhandled error returns (#1035)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-05 16:55:25 +02:00
Christian Hoffmann
6bdc5558ec build: make staticcheck happy by using real regexp patterns #1025 (#1026)
Signed-off-by: Christian Hoffmann <mail@hoffmann-christian.info>
2018-07-30 07:57:18 +02:00
Rene Treffer
80a5712b97 Fix sample rules for migration (#1022)
- add conversion from _ms to _seconds on disk metrics
- add missing node_textfile_mtime section
- add groups: header to pass promtool check rules

Signed-off-by: Rene Treffer <rene.treffer@soundcloud.com>
2018-07-27 14:27:44 +02:00
Hannes Körber
14a4f0028e Enable nfs protocol (#998)
* vendor: Update prometheus/procfs

Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>

* mountstats: Use new NFS protocol field

In https://github.com/prometheus/procfs/pull/100, the NFSTransportStats
struct was expanded by a field called protocol that specifies the NFS
protocol in use, either "tcp" or "udp". This commit adds the protocol as
a label to all NFS metrics exported via the mountstats collector.

Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>

* Update fixtures for UDP mount

Signed-off-by: Hannes Körber <hannes.koerber@haktec.de>
2018-07-24 00:47:12 +02:00
Johannes Wienke
5c780d132c Exclude only subdirectories of /var/lib/docker (#1003)
It is quite common to put /var/lib/docker itself on a separate partition
and that should be monitored as well.

Signed-off-by: Johannes Wienke <languitar@semipol.de>
2018-07-23 15:43:42 +02:00
Ben Kochie
ca2fa4684b
Fix docker build (#1016)
Fix override of make docker target to include new `DOCKER_REPO`
variable pattern.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-07-23 10:56:20 +02:00
Ben Kochie
981de58fad
Update build (#1010)
* Update from upstream `Makefile.common`.
* Update CircleCI with simplifed upstream templating.
* Cleanup `Makefile`.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-07-23 09:38:39 +02:00
Ben Kochie
23f95c8e04
Fix ntp collector thread safety (#1014)
Make the ntp collector thread safe by wrapping a mutex lock around the
leapMidnight variable.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-07-22 14:36:33 +02:00
xginn8
140b8b85c3 Filter out uninstalled systemd units when collecting all units (#1011)
fixes #567

Signed-off-by: Matthew McGinn <mamcgi@gmail.com>
2018-07-22 09:20:03 +02:00
Sven Lange
2ae8c1c7a7 Add systemd uptime metric collection (#952)
* Add systemd uptime metric collection

Signed-off-by: Sven Lange <tdl@hadiko.de>
2018-07-18 16:02:05 +02:00
Ben Kochie
354115511c
Add note about SYS_TIME capability for Docker. (#1001)
Signed-off-by: Ben Kochie <superq@gmail.com>
2018-07-16 18:30:19 +02:00
neiledgar
7e4d9bd150 Update wifi stats to support multiple stations (#977) (#980)
Signed-off-by: neiledgar <neil.edgar@btinternet.com>
2018-07-16 16:02:25 +02:00
xginn8
9b97f44a70 Add a counter for refused socket unit connections, available as of systemd 239 (#995)
Signed-off-by: xginn8 <mamcgi@gmail.com>
2018-07-16 16:01:42 +02:00
Brandon Gilmore
76bbd8dd18 Use /proc/mounts instead of statfs(2) for ro state (#1002)
While the statfs(2) approach is reliable for normally mounted filesystems, the
flags returned can be inconsistent when filesystem has been remounted read-only
after encountering an error. The returned flags do accurately represent the
internal state of the filesystem, but they do not reflect whether the VFS layer
will accept writes. Instead, it makes sense to parse the current VFS mount
state from the options field in /proc/mounts since it takes precedence.

Signed-off-by: Brandon Gilmore <bgilmore@valvesoftware.com>
2018-07-16 15:56:27 +02:00
Jan Klat
c4102f1175 Add sys/class/net parsing from procfs and expose its metrics (#851)
* add sys/class/net parsing from procfs and expose its metrics

Signed-off-by: Jan Klat <jenik@klatys.cz>

* change code to use int pointers per procfs change, move netclass to separate collector, change metric naming

Signed-off-by: Jan Klat <jenik@klatys.cz>

* bump year in licence, remove redundant newline, correct fixtures

Signed-off-by: Jan Klat <jenik@klatys.cz>

* fix style

Signed-off-by: Jan Klat <jenik@klatys.cz>

* change carrier changes to counter type

Signed-off-by: Jan Klat <jenik@klatys.cz>

* fix e2e output

Signed-off-by: Jan Klat <jenik@klatys.cz>

* add fixtures

Signed-off-by: Jan Klat <jenik@klatys.cz>

* update vendor, use fixtures correctly

Signed-off-by: Jan Klat <jenik@klatys.cz>

* change fixtures (device in /sys/class/net should be symlinked)

Signed-off-by: Jan Klat <jenik@klatys.cz>

* correct fixtures for 64k page, updated readme

Signed-off-by: Jan Klat <jenik@klatys.cz>
2018-07-16 15:08:18 +02:00
mknapphrt
09b4305090 Changed the way that stuck mounts are handled. If a mount fails to return, it will stop being queried until it returns. (#997)
Fixed spelling mistakes.

Update transport_generic.go

Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck.

Removed unnecessary lock channel and clarified some var names.

Fixed style nits.

Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
2018-07-14 11:10:28 +02:00
xginn8
ac5a981761 Adding socket stat collection for systemd socket units (#968)
Signed-off-by: xginn8 <mamcgi@gmail.com>
2018-07-05 16:26:48 +02:00
xginn8
8af84a215d Add support for NRestarts counter introduced in systemd 235 (#992)
* Add support for NRestarts counter introduced in systemd 235

`.service` units increment this counter any time the Restart= condition is
triggered.

Signed-off-by: Matthew McGinn <mamcgi@gmail.com>
2018-07-05 13:31:45 +02:00
Bernd Müller
ee1e1997bc Add scsi smart data to prometheus exporter (#862)
Add scsi smart data to prometheus exporter

Signed-off-by: mueller <mueller@b1-systems.de>
2018-07-04 00:30:20 +02:00
Ivan Kiselev
ae90bac5b8 Add example of translating new metrics to old format in case of migration to 1.16 version (#982)
Add additional example of how to save old metrics

Signed-off-by: Ivan Kiselev <ivan@messagebird.com>
2018-07-02 12:39:32 +02:00
Ben Kochie
107e5dfecc
Fix mdadm collector issues (#985)
* Send "Personality unknown" to debug, not info, remove unnecessary newline.
* Add support for "linear" personality.
* Always set number of active disks to 0 when a device is inactive.
* Add total disks calculation to unknown personalites.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-07-02 12:38:20 +02:00
Roman Vynar
55c32fcf02 Add compat rules for filesystem collector. (#973)
Signed-off-by: Roman Vynar <roman.vynar@goquiq.com>
2018-06-13 18:32:07 +02:00
Matt Bostock
f56e8fcdf4 Fix spelling of celsius in IPMI example script (#967)
'Celsius' should be spelt with an 's':
https://en.wikipedia.org/wiki/Celsius

Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
2018-06-08 19:21:19 +02:00
Derek Marcotte
2678d68dcc Fix for #945, cpu temperature is signed. (#965)
* Fix for #945, cpu temperature is signed.

Added a type conversion to cpu temperature sysctl.  Will still
collect/report -1 when the value is -1, this is because it should be up
to interpretation whether this is the correct value for the system or
not.

Some drivers will report -1 for cpu temperature.  Other sensors will
report "an input into the fan control algorithm", i.e. not the actual
temperature, but how much fan it wants.  Some people cool their machines
with liquid nitrogen.

Signed-off-by: Derek Marcotte <554b8425@razorfever.net>
2018-06-07 15:01:25 +02:00
Brad Beam
e3cf1d5187 Adding support for evaluating octal characters in mountpoint (#954)
Signed-off-by: Brad Beam <brad.beam@b-rad.info>
2018-06-06 16:49:19 +02:00
Matt Layher
cd217b77f5
Merge pull request #963 from prometheus/mdl-vendor-wifi
vendor: bump github.com/mdlayher/wifi
2018-06-05 16:44:19 -04:00
Pavlo Kutishchev
456bf5094a Add processes exporter (#950)
* Add processes exporter

Signed-off-by: Pavel Kutishchev <pavel.kutishchev@olx.com>
Signed-off-by: Ben Kochie <superq@gmail.com>
2018-06-05 19:38:32 +02:00
Matt Layher
0f7eba1dec
vendor: bump github.com/mdlayher/wifi
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2018-06-01 13:38:40 -04:00
Ben Kochie
278a98fee0
Merge pull request #960 from prometheus/fish-remove-travis-build-batch
Remove travis build badge
2018-05-30 21:46:00 +02:00
Johannes 'fish' Ziemke
a6a8ec3c1c Remove travis build badge
Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
2018-05-30 19:16:18 +02:00
Matt Bostock
516e5d4beb Add metric for outdated libraries (#957)
Add metrics that count how many running processes are linking to deleted
libraries on each machine. Deleted libraries are usually outdated
libraries, and outdated libraries may have known security
vulnerabilities.

The rationale behind storing these as metrics is allow the rollout of
security fixes to be tracked across a fleet of machines, ensuring that
all affected processes are restarted (e.g. via a reboot).

I'm parsing the output from `/proc/*/maps` because it's using `lsof -d
DEL` can be too slow, particularly if you have sockets that bind to
thousands of IP addresses.

The metric labels include the library path and the base filename, which
allows us to pinpoint the exact path of the deleted library but also
allows us to aggregate on the library name (or approximations of it)
even if library locations differ between operating system versions.

The metrics output and the CPU time consumed is as follows:

    user@host:~$ time sudo python processes.py
    # HELP node_processes_linking_deleted_libraries Count of running processes that link a deleted library
    # TYPE node_processes_linking_deleted_libraries gauge
    node_processes_linking_deleted_libraries{library_path="locale-archive", library_name="/usr/lib/locale"} 3
    node_processes_linking_deleted_libraries{library_path="libevent-2.0.so.5.1.9", library_name="/usr/lib/x86_64-linux-gnu"} 4

    real        0m0.071s
    user        0m0.030s
    sys 0m0.041s

Including the library filename and path will result in reasonably high
metrics cardinality, however I think the benefits when an urgent
security patch is being deployed outweigh concerns around cardinality.

This script assumes that library files do not contain spaces in their
path.

Signed-off-by: Matt Bostock <mbostock@cloudflare.com>
2018-05-25 18:20:42 +02:00
Ivan Voronchihin
606568314b Add Makefile.common (#940)
* Add Makefile.common

Signed-off-by: bege13mot <bege13mot@gmail.com>

* Change Makefile.common to initial Prometheus common

Signed-off-by: bege13mot <bege13mot@gmail.com>

* fix checkmetrics

Signed-off-by: bege13mot <bege13mot@gmail.com>

* fix promu

Signed-off-by: bege13mot <bege13mot@gmail.com>

* Add test to common

Signed-off-by: bege13mot <bege13mot@gmail.com>

* Fix GOPATH

Signed-off-by: bege13mot <bege13mot@gmail.com>

* Initial Makefile.common

Signed-off-by: bege13mot <bege13mot@gmail.com>

* original Makefile.common

Signed-off-by: bege13mot <bege13mot@gmail.com>

* delete promu

Signed-off-by: bege13mot <bege13mot@gmail.com>

* delete redundant .PRONY params

Signed-off-by: bege13mot <bege13mot@gmail.com>
2018-05-24 23:31:48 +02:00
Ben Kochie
04d69158b4
Merge pull request #949 from szeestraten/patch-1
Fix metric name in directory size text collector example
2018-05-19 23:53:32 +02:00
Sandor Zeestraten
578d814744 Fix metric name in directory size text collector example
The directory size text collector example uses the wrong metric name in the HELP and TYPE lines rendering the comments unusable.

This fixes that by using the same metric name.

Signed-off-by: Sandor Zeestraten <sandor@zeestrataca.com>
2018-05-19 21:11:46 +02:00
Ben Kochie
699b6d7f15
Merge pull request #948 from prometheus/superq/rules
Update example rules
2018-05-18 08:57:31 +02:00
Ben Kochie
ec28a8e9d4
Fix cpu utilization rule.
Signed-off-by: Ben Kochie <superq@gmail.com>
2018-05-17 18:15:07 +02:00
Ben Kochie
eb3f922c50
Update naming.
Signed-off-by: Ben Kochie <superq@gmail.com>
2018-05-17 17:52:07 +02:00
Ben Kochie
f00f3db08b
Add a CPU in-use recording example.
Signed-off-by: Ben Kochie <superq@gmail.com>
2018-05-17 17:49:07 +02:00
Ben Kochie
628b2db5bc
Update example rules
* Remove Prometheus 1.x example file.
* Update CPU rules for 0.16.0.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-05-17 17:44:39 +02:00
Ben Kochie
b8918c7d32
Merge pull request #947 from nicholascapo/add_v16_MemAvailable_rule
docs: Add example recording rule for node_memory_MemAvailable
2018-05-17 08:11:42 +02:00
Nicholas Capo
09d11817d0 docs: Add example recording rule for node_memory_MemAvailable
Signed-off-by: Nicholas Capo <nicholas.capo@gmail.com>
2018-05-16 17:01:51 -05:00
Ben Kochie
d42bd70f43
Merge pull request #939 from prometheus/superq/0.16
Release 0.16.0
2018-05-15 17:49:24 +02:00
Ben Kochie
1882a08041 Release 0.16.0
Changes since 0.16.0-rc.3

* [CHANGE] align Darwin disk stat names with Linux #930

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-05-15 16:16:05 +02:00