[arch-devops] Changing the monitoring for new servers

5 May 2016

      As you probably have read, we're getting a new bunch of servers quite soon.
I want to change our monitoring solution on those new servers as we are
currently using munin but it has static graphs only and no notifications
built-in. I think notifications would be helpful to have so that we're the
first ones to know when a service is down or a server runs out of
memory/storage.

The alternatives that were mentioned are these:

1) munin for graphs, timed ansible for notifications
2) influxdb for storage, collectd for stats collection, graphite for
graphs, timed ansible for notifications
3) zabbix (does everything)
4) cacti (does everything with plugins)
5) prometheus (does everything)

My thoughts:

1) Basically the same of what we have now. Static
daily/weekly/monthly/graphs only. On top of that, we'd use a systemd-timer
unit for ansible I suppose. Munin generally is fairly easy to maintain and
set up.
2) Four different programs to but each taking care of a certain task. I'm
not sure this counts as complex or KISS. But really I have no idea about
this stack as I have not worked with any element in it.
3) No idea about this one but it from a glance it looked rather complex and
kind of icky to use.
4) I used this one once. It's a heavyduty PHP-based monitoring software
that I probably wouldn't recommend. It was a bitch to set up and use when I
tried it some years back. It might have improved.
5) This is the newcomer. Seems to be doing everything we'd need but I
haven't checked it out in detail yet. Configuration format seems fairly
easy and graphs are pretty. Might be worthwhile.

Would be interested in hearing your thoughts. If you have something to add,
just add numbered bullet points. If nothing comes of this, we'll probably
go back to munin.

Sven

Sven-Hendrik Haase

Florian Pritz

Thore Boedecker

Bartłomiej Piotrowski

Sven-Hendrik Haase

tags

participants (4)