[arch-devops] Centralized log monitoring and alerting?

Mon Sep 10 09:06:06 UTC 2018

On Mon, Sep 10, 2018 at 08:30:33AM +0200, Bartłomiej Piotrowski via arch-devops <arch-devops at lists.archlinux.org> wrote:
> We already have centralized alerting with Zabbix. What are you trying to
> solve exactly? I don't recall anyone from devops team complaining about
> having to ssh to run journalctl somewhere. Each new gear added to infra
> means more time spent on maintaining it, while our goal is quite the
> opposite thing.

Sure, zabbix handles alerting for stuff we have in there, but log
monitoring can do much more. For example we don't monitor error logs for
web apps yet, but even if we did, system services might log errors
directly to the journal. Sure, they may exit and systemd/zabbix will
tell us about that, but they might also just continue running.  Warnings
in logs are similar and they generally don't cause an exit so we'd miss
them unless we have some tool to monitor for them. I think log
monitoring is something we should do to notice and deal with subtle
problems early on. I regularly see deprecation warnings about
configuration or language changes (mostly PHP deprecation warnings) and
sometimes kernel warnings/errors about breaking disks or other hardware
weirdness thanks to this. I think log monitoring could help us
prevent/reduce future problems and make the work easier to plan.

Possibly different from other services, we can always simply stop doing
it if we feel like it requires too much time. It's just an internal
thing without any outside users.

Personally I use tenshi on my servers and while it's not ideal, it works
fairly well. It does require regular maintenance though since sometimes
log messages change with updates. One problem I worry about is that I
might filter interesting messages by accident, especially if the formats
change and an old regex now matches some new message that I actually
didn't want it match. Also, after a while, the regex list can become quite
long. You can split it in files and group them nicely, but I rarely
actually bothered to review them and remove old stuff so it mostly just
grows. Also to put things in perspective, my current personal rules are
between 3 to 10 for most small things, 10-20 for "normal" services and
around 40-70 for dovecot, amavis and postfix.

Another issue I have with using tenshi for us is that I'm conflicted
about publishing the config we use. I'm worried that an attacker might
look at the config and try to stay under the radar and within any
alerting limits we set. Then again, there are probably easier ways to
attack us. Any opinions here are welcome.

I haven't used other solutions so far so I welcome a discussion about
this. In general I think log monitoring could help us reduce future work
load and make things more predictable, but yeah, it requires some
investment at the beginning and some maintenance.

Florian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.archlinux.org/pipermail/arch-devops/attachments/20180910/76e7e0bc/attachment.asc>