On Mon, Sep 10, 2018 at 08:30:33AM +0200, Bartłomiej Piotrowski via arch-devops <arch-devops@lists.archlinux.org> wrote:
We already have centralized alerting with Zabbix. What are you trying to solve exactly? I don't recall anyone from devops team complaining about having to ssh to run journalctl somewhere. Each new gear added to infra means more time spent on maintaining it, while our goal is quite the opposite thing.
Sure, zabbix handles alerting for stuff we have in there, but log monitoring can do much more. For example we don't monitor error logs for web apps yet, but even if we did, system services might log errors directly to the journal. Sure, they may exit and systemd/zabbix will tell us about that, but they might also just continue running. Warnings in logs are similar and they generally don't cause an exit so we'd miss them unless we have some tool to monitor for them. I think log monitoring is something we should do to notice and deal with subtle problems early on. I regularly see deprecation warnings about configuration or language changes (mostly PHP deprecation warnings) and sometimes kernel warnings/errors about breaking disks or other hardware weirdness thanks to this. I think log monitoring could help us prevent/reduce future problems and make the work easier to plan. Possibly different from other services, we can always simply stop doing it if we feel like it requires too much time. It's just an internal thing without any outside users. Personally I use tenshi on my servers and while it's not ideal, it works fairly well. It does require regular maintenance though since sometimes log messages change with updates. One problem I worry about is that I might filter interesting messages by accident, especially if the formats change and an old regex now matches some new message that I actually didn't want it match. Also, after a while, the regex list can become quite long. You can split it in files and group them nicely, but I rarely actually bothered to review them and remove old stuff so it mostly just grows. Also to put things in perspective, my current personal rules are between 3 to 10 for most small things, 10-20 for "normal" services and around 40-70 for dovecot, amavis and postfix. Another issue I have with using tenshi for us is that I'm conflicted about publishing the config we use. I'm worried that an attacker might look at the config and try to stay under the radar and within any alerting limits we set. Then again, there are probably easier ways to attack us. Any opinions here are welcome. I haven't used other solutions so far so I welcome a discussion about this. In general I think log monitoring could help us reduce future work load and make things more predictable, but yeah, it requires some investment at the beginning and some maintenance. Florian