On 05.05.2016 15:57, Sven-Hendrik Haase wrote:
1) munin for graphs, timed ansible for notifications
Sounds a little hackish imho, but should be fairly quick to setup. Although I don't know how to properly implement trigger/trending/metrics alerts, e.g. monitoring the derivative over the last couple of measurements.
2) influxdb for storage, collectd for stats collection, graphite for graphs, timed ansible for notifications
Instead of graphite I would recommend grafana, but overall the components (apart from collectd) are still in heavy development. Lacks sophisticated alerting too, just like 1), although grafana is supposed to get some alerting functionality in future releases IIRC
3) zabbix (does everything)
Haven't gotten into it that much yet, but I have it running for a week now and the basic setup is really simple. From what I have read/tried so far it has some pretty nice features, including: * simple on/off monitoring (boolean item + trigger) * very flexible item configuration * very flexible trigger options (including mathematical operations) * agent, with support for encrypted connection to the master * autodiscovery and template support * low-level-discovery (LLD) support, e.g. detection of all local disks/mountpoints/filesystems + free disk space monitoring for them * storage backend is postgresql, that has proven reliability and performance * support for ACL stuff, e.g. user/group permissions for server groups There is a zabbix-grafana plugin, that lets you build graphs and dashboards more flexible (and fancier) than zabbix internals. The webinterface (starting with 3.0.2) is PHP 7 compatible and hasn't caused any issues for me yet. In #archlinux.de we have at least 1 guy that is familiar with zabbix and available for further questions: bastelfreak, one other hasn't given permission to be listed here yet, but I think he would be happy to help as well.
4) cacti (does everything with plugins)
Personally I think this is horrible. Setup is real PITA and there were quite some mails on SQL injections vulns regarding the webinterface. Only used it for network switch SNMP monitoring, dunno how to do service monitoring with but I can only imagine that it isn't fun either.
5) prometheus (does everything)
±0, haven't even heard about until now. Just to add some more names to the list: 6) ganglia - somewhat like munin, with some clustering support for scalability, builtin API but no alerting AFAIK 7) nagios - simple on/off monitoring + alerting, config is a bit pita, scales rather poorly, would require something like munin/ganglia for metrics 8) icinga - nagios fork, doesn't seem to have metrics 9) check_mk - another fork of nagios, heavily extended, including some metrics features, mostly coded in python AFAIK 10) sensu - looks like "does everything", haven't used it yet 11) centreon - looks like another "does everything", dunno anything in detail 12) riemann - event processing and alerting daemon, that could be integrated into that construct from 2) to replace ansible for notifications, totally flexible but you have to do pretty much everything from scratch (config is esentially a clojure program). So there are quite some options out there, I have stumbled upon at least another 3 solutions in the past months from which I can't even remember the names. --- TL; DR --- I think the best option would be zabbix as it provides all the required features, has an API and can even be combined with grafan if you want pretty graphs. Fine-tuning and getting all checks in there with proper templates and grouping will be quite time-consuming but it's a wide spread solution with excellent help in #zabbix as well. Thore