[arch-devops] Changing the monitoring for new servers

Sun May 8 19:25:46 UTC 2016

On 05.05.2016 15:57, Sven-Hendrik Haase wrote:
> 1) munin for graphs, timed ansible for notifications

Sounds a little hackish imho, but should be fairly quick to setup.
Although I don't know how to properly implement trigger/trending/metrics
alerts, e.g. monitoring the derivative over the last couple of measurements.

> 2) influxdb for storage, collectd for stats collection, graphite for
> graphs, timed ansible for notifications

Instead of graphite I would recommend grafana, but overall the
components (apart from collectd) are still in heavy development. Lacks
sophisticated alerting too, just like 1), although grafana is supposed
to get some alerting functionality in future releases IIRC

> 3) zabbix (does everything)

Haven't gotten into it that much yet, but I have it running for a week
now and the basic setup is really simple.
From what I have read/tried so far it has some pretty nice features,
including:
  * simple on/off monitoring (boolean item + trigger)
  * very flexible item configuration
  * very flexible trigger options (including mathematical operations)
  * agent, with support for encrypted connection to the master
  * autodiscovery and template support
  * low-level-discovery (LLD) support, e.g. detection of all local
    disks/mountpoints/filesystems + free disk space monitoring for them
  * storage backend is postgresql, that has proven reliability and
    performance
  * support for ACL stuff, e.g. user/group permissions for server groups

There is a zabbix-grafana plugin, that lets you build graphs and
dashboards more flexible (and fancier) than zabbix internals.

The webinterface (starting with 3.0.2) is PHP 7 compatible and hasn't
caused any issues for me yet.

In #archlinux.de we have at least 1 guy that is familiar with zabbix
and available for further questions: bastelfreak, one other hasn't
given permission to be listed here yet, but I think he would be happy to
help as well.

> 4) cacti (does everything with plugins)

Personally I think this is horrible. Setup is real PITA and there were
quite some mails on SQL injections vulns regarding the webinterface.
Only used it for network switch SNMP monitoring, dunno how to do
service monitoring with but I can only imagine that it isn't fun
either.

> 5) prometheus (does everything)

±0, haven't even heard about until now.

Just to add some more names to the list:

6) ganglia - somewhat like munin, with some clustering support for
scalability, builtin API but no alerting AFAIK

7) nagios - simple on/off monitoring + alerting, config is a bit pita,
scales rather poorly, would require something like munin/ganglia for metrics

8) icinga - nagios fork, doesn't seem to have metrics

9) check_mk - another fork of nagios, heavily extended, including some
metrics features, mostly coded in python AFAIK

10) sensu - looks like "does everything", haven't used it yet

11) centreon - looks like another "does everything", dunno anything in
detail

12) riemann - event processing and alerting daemon, that could be integrated
into that construct from 2) to replace ansible for notifications, totally
flexible but you have to do pretty much everything from scratch (config is
esentially a clojure program).

So there are quite some options out there, I have stumbled upon at
least another 3 solutions in the past months from which I can't even
remember the names.

--- TL; DR ---
I think the best option would be zabbix as it provides all the
required features, has an API and can even be combined with grafan if
you want pretty graphs. Fine-tuning and getting all checks in there
with proper templates and grouping will be quite time-consuming but
it's a wide spread solution with excellent help in #zabbix as well.

Thore
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.archlinux.org/pipermail/arch-devops/attachments/20160508/bcae0064/attachment.asc>