[arch-devops] Changing the monitoring for new servers
As you probably have read, we're getting a new bunch of servers quite soon. I want to change our monitoring solution on those new servers as we are currently using munin but it has static graphs only and no notifications built-in. I think notifications would be helpful to have so that we're the first ones to know when a service is down or a server runs out of memory/storage. The alternatives that were mentioned are these: 1) munin for graphs, timed ansible for notifications 2) influxdb for storage, collectd for stats collection, graphite for graphs, timed ansible for notifications 3) zabbix (does everything) 4) cacti (does everything with plugins) 5) prometheus (does everything) My thoughts: 1) Basically the same of what we have now. Static daily/weekly/monthly/graphs only. On top of that, we'd use a systemd-timer unit for ansible I suppose. Munin generally is fairly easy to maintain and set up. 2) Four different programs to but each taking care of a certain task. I'm not sure this counts as complex or KISS. But really I have no idea about this stack as I have not worked with any element in it. 3) No idea about this one but it from a glance it looked rather complex and kind of icky to use. 4) I used this one once. It's a heavyduty PHP-based monitoring software that I probably wouldn't recommend. It was a bitch to set up and use when I tried it some years back. It might have improved. 5) This is the newcomer. Seems to be doing everything we'd need but I haven't checked it out in detail yet. Configuration format seems fairly easy and graphs are pretty. Might be worthwhile. Would be interested in hearing your thoughts. If you have something to add, just add numbered bullet points. If nothing comes of this, we'll probably go back to munin. Sven
On 05.05.2016 15:57, Sven-Hendrik Haase wrote:
3) zabbix (does everything)
I've heard good things about this so far. We'd need to decide where to run the server though (which might be tricky) and it probably requires some work to set up all the checks and alerts. I don't know if it has any auto configuration like munin. I'm ±0 on the others since I don't know enough about them. Florian
On 05.05.2016 15:57, Sven-Hendrik Haase wrote:
1) munin for graphs, timed ansible for notifications
Sounds a little hackish imho, but should be fairly quick to setup. Although I don't know how to properly implement trigger/trending/metrics alerts, e.g. monitoring the derivative over the last couple of measurements.
2) influxdb for storage, collectd for stats collection, graphite for graphs, timed ansible for notifications
Instead of graphite I would recommend grafana, but overall the components (apart from collectd) are still in heavy development. Lacks sophisticated alerting too, just like 1), although grafana is supposed to get some alerting functionality in future releases IIRC
3) zabbix (does everything)
Haven't gotten into it that much yet, but I have it running for a week now and the basic setup is really simple. From what I have read/tried so far it has some pretty nice features, including: * simple on/off monitoring (boolean item + trigger) * very flexible item configuration * very flexible trigger options (including mathematical operations) * agent, with support for encrypted connection to the master * autodiscovery and template support * low-level-discovery (LLD) support, e.g. detection of all local disks/mountpoints/filesystems + free disk space monitoring for them * storage backend is postgresql, that has proven reliability and performance * support for ACL stuff, e.g. user/group permissions for server groups There is a zabbix-grafana plugin, that lets you build graphs and dashboards more flexible (and fancier) than zabbix internals. The webinterface (starting with 3.0.2) is PHP 7 compatible and hasn't caused any issues for me yet. In #archlinux.de we have at least 1 guy that is familiar with zabbix and available for further questions: bastelfreak, one other hasn't given permission to be listed here yet, but I think he would be happy to help as well.
4) cacti (does everything with plugins)
Personally I think this is horrible. Setup is real PITA and there were quite some mails on SQL injections vulns regarding the webinterface. Only used it for network switch SNMP monitoring, dunno how to do service monitoring with but I can only imagine that it isn't fun either.
5) prometheus (does everything)
±0, haven't even heard about until now. Just to add some more names to the list: 6) ganglia - somewhat like munin, with some clustering support for scalability, builtin API but no alerting AFAIK 7) nagios - simple on/off monitoring + alerting, config is a bit pita, scales rather poorly, would require something like munin/ganglia for metrics 8) icinga - nagios fork, doesn't seem to have metrics 9) check_mk - another fork of nagios, heavily extended, including some metrics features, mostly coded in python AFAIK 10) sensu - looks like "does everything", haven't used it yet 11) centreon - looks like another "does everything", dunno anything in detail 12) riemann - event processing and alerting daemon, that could be integrated into that construct from 2) to replace ansible for notifications, totally flexible but you have to do pretty much everything from scratch (config is esentially a clojure program). So there are quite some options out there, I have stumbled upon at least another 3 solutions in the past months from which I can't even remember the names. --- TL; DR --- I think the best option would be zabbix as it provides all the required features, has an API and can even be combined with grafan if you want pretty graphs. Fine-tuning and getting all checks in there with proper templates and grouping will be quite time-consuming but it's a wide spread solution with excellent help in #zabbix as well. Thore
On 2016-05-05 15:57, Sven-Hendrik Haase wrote:
Would be interested in hearing your thoughts. If you have something to add, just add numbered bullet points. If nothing comes of this, we'll probably go back to munin.
Sven
I completely fail to see any advantages of 2 and 5. Prometheus isn't much older that what I proposed with InfluxDB stack (and I didn't mention Ansible usage there for alerting). I'd really like to give Sensu a shot, but I already said some snarky comments about complicated projects with the need of MQ, so let's skip it. Given that there is Grafana plugin for Zabbix, I guess it's the best option considering our low number of machines to monitor. It covers both graphing and alerting, so sounds very good to me. We should also consider promoting its AUR maintainer as I remember he packages it for quite a long time and PKGBUILD is quite flawless. Bartłomiej
Alright, thanks for the feedback guys. We'll roll with zabbix + grafana. barthalion, I like your suggestion about making the AUR maintainer a TU. Will you handle that? I'll set up our new servers in the mean time. Sven On May 11, 2016 9:02 AM, "Bartłomiej Piotrowski" <bpiotrowski@archlinux.org> wrote:
On 2016-05-05 15:57, Sven-Hendrik Haase wrote:
Would be interested in hearing your thoughts. If you have something to add, just add numbered bullet points. If nothing comes of this, we'll probably go back to munin.
Sven
I completely fail to see any advantages of 2 and 5. Prometheus isn't much older that what I proposed with InfluxDB stack (and I didn't mention Ansible usage there for alerting).
I'd really like to give Sensu a shot, but I already said some snarky comments about complicated projects with the need of MQ, so let's skip it.
Given that there is Grafana plugin for Zabbix, I guess it's the best option considering our low number of machines to monitor. It covers both graphing and alerting, so sounds very good to me. We should also consider promoting its AUR maintainer as I remember he packages it for quite a long time and PKGBUILD is quite flawless.
Bartłomiej
participants (4)
-
Bartłomiej Piotrowski
-
Florian Pritz
-
Sven-Hendrik Haase
-
Thore Boedecker