[arch-devops] Centralized log monitoring and alerting?
Hi all, On 2018-07-28 there were some discussions in #archlinux-devops around setting up some sort of centralized logging/monitoring/alerting solution for the various services on Apollo (and maybe other?) server(s). I had mentioned possibly using the ELK[1] stack for this task. There was some back and forth about it potentially being a bit heavy handed for what was needed and how we would most likely need to repurpose/dedicate something like nymeria to handle the stack. There was also the suggestion of possibly using something like tenshi[2] if we're aiming for a low overhead solution, however, there would be much writing of the regexes. With that being said, the purpose of this email is to have a more formal discussion around what we're trying to capture from the logs, the actions we want to have taken with what ends up being captured, and possibly come to a consensus on what tool(s) we could leverage. Thoughts? Regards, Andrew [1] https://www.elastic.co/de/elk-stack [2]https://github.com/inversepath/tenshi
On 10/09/2018 02.53, Andrew Crerar wrote:
Hi all,
On 2018-07-28 there were some discussions in #archlinux-devops around setting up some sort of centralized logging/monitoring/alerting solution for the various services on Apollo (and maybe other?) server(s). I had mentioned possibly using the ELK[1] stack for this task. There was some back and forth about it potentially being a bit heavy handed for what was needed and how we would most likely need to repurpose/dedicate something like nymeria to handle the stack. There was also the suggestion of possibly using something like tenshi[2] if we're aiming for a low overhead solution, however, there would be much writing of the regexes.
With that being said, the purpose of this email is to have a more formal discussion around what we're trying to capture from the logs, the actions we want to have taken with what ends up being captured, and possibly come to a consensus on what tool(s) we could leverage.
Thoughts?
Regards,
Andrew
[1] https://www.elastic.co/de/elk-stack [2]https://github.com/inversepath/tenshi
We already have centralized alerting with Zabbix. What are you trying to solve exactly? I don't recall anyone from devops team complaining about having to ssh to run journalctl somewhere. Each new gear added to infra means more time spent on maintaining it, while our goal is quite the opposite thing. Bartłomiej
On Mon, Sep 10, 2018 at 08:30:33AM +0200, Bartłomiej Piotrowski via arch-devops <arch-devops@lists.archlinux.org> wrote:
We already have centralized alerting with Zabbix. What are you trying to solve exactly? I don't recall anyone from devops team complaining about having to ssh to run journalctl somewhere. Each new gear added to infra means more time spent on maintaining it, while our goal is quite the opposite thing.
Sure, zabbix handles alerting for stuff we have in there, but log monitoring can do much more. For example we don't monitor error logs for web apps yet, but even if we did, system services might log errors directly to the journal. Sure, they may exit and systemd/zabbix will tell us about that, but they might also just continue running. Warnings in logs are similar and they generally don't cause an exit so we'd miss them unless we have some tool to monitor for them. I think log monitoring is something we should do to notice and deal with subtle problems early on. I regularly see deprecation warnings about configuration or language changes (mostly PHP deprecation warnings) and sometimes kernel warnings/errors about breaking disks or other hardware weirdness thanks to this. I think log monitoring could help us prevent/reduce future problems and make the work easier to plan. Possibly different from other services, we can always simply stop doing it if we feel like it requires too much time. It's just an internal thing without any outside users. Personally I use tenshi on my servers and while it's not ideal, it works fairly well. It does require regular maintenance though since sometimes log messages change with updates. One problem I worry about is that I might filter interesting messages by accident, especially if the formats change and an old regex now matches some new message that I actually didn't want it match. Also, after a while, the regex list can become quite long. You can split it in files and group them nicely, but I rarely actually bothered to review them and remove old stuff so it mostly just grows. Also to put things in perspective, my current personal rules are between 3 to 10 for most small things, 10-20 for "normal" services and around 40-70 for dovecot, amavis and postfix. Another issue I have with using tenshi for us is that I'm conflicted about publishing the config we use. I'm worried that an attacker might look at the config and try to stay under the radar and within any alerting limits we set. Then again, there are probably easier ways to attack us. Any opinions here are welcome. I haven't used other solutions so far so I welcome a discussion about this. In general I think log monitoring could help us reduce future work load and make things more predictable, but yeah, it requires some investment at the beginning and some maintenance. Florian
Hey, On 10.09.18 - 11:06, Florian Pritz via arch-devops wrote:
Another issue I have with using tenshi for us is that I'm conflicted about publishing the config we use. I'm worried that an attacker might look at the config and try to stay under the radar and within any alerting limits we set. Then again, there are probably easier ways to attack us. Any opinions here are welcome.
I think it should be fairly easy to put the actual values/limits/thresholds as variables into an ansible vault, so they are encrypted within the public git repository. Just as an idea, I'm not sure if we're already using ansible vaults for things like that or if we want to strictly avoid those. Cheers, Thore -- Thore Bödecker GPG ID: 0xD622431AF8DB80F3 GPG FP: 0F96 559D 3556 24FC 2226 A864 D622 431A F8DB 80F3
Em setembro 10, 2018 6:06 Florian Pritz via arch-devops escreveu:
Personally I use tenshi on my servers and while it's not ideal, it works fairly well. It does require regular maintenance though since sometimes log messages change with updates. One problem I worry about is that I might filter interesting messages by accident, especially if the formats change and an old regex now matches some new message that I actually didn't want it match. Also, after a while, the regex list can become quite long. You can split it in files and group them nicely, but I rarely actually bothered to review them and remove old stuff so it mostly just grows. Also to put things in perspective, my current personal rules are between 3 to 10 for most small things, 10-20 for "normal" services and around 40-70 for dovecot, amavis and postfix.
Never used tenshi, but I guess that, after we write the rules for one webapp, the others will be similar and adaptable.
Another issue I have with using tenshi for us is that I'm conflicted about publishing the config we use. I'm worried that an attacker might look at the config and try to stay under the radar and within any alerting limits we set. Then again, there are probably easier ways to attack us. Any opinions here are welcome.
This is a perfect example of security through obscurity that might actually make the life of the attacker slightly harder. I personally wouldn't loose sleep about this, but we can put on a vault, like Thore suggested.
I haven't used other solutions so far so I welcome a discussion about this. In general I think log monitoring could help us reduce future work load and make things more predictable, but yeah, it requires some investment at the beginning and some maintenance.
Since log monitoring has an intersection with security, perhaps we should include someone from the security team to weigh in on this as well? I don't think this is something we want to keep only on arch-devops. Having said that, my personal opinion is that log monitoring is orders of magnitude more important for security than it is for detecting actual problems on the applications. Because those problems will most likely trigger 50x errors and/or be reported by users. Obviously, some errors might give hints of what is to come, but if we don't act on them, the result is the same. So, in essence, log monitoring will give us security insights, and might help us have a more proactive instance to problems that zabbix currently don't/can't detect. I think it's worth taking some time to invest in this. But I still want to hear what the security guys have to say. Regards, Giancarlo Razzolini
On Sun, Sep 09, 2018 at 08:53:07PM -0400, Andrew Crerar wrote:
Hi all,
On 2018-07-28 there were some discussions in #archlinux-devops around setting up some sort of centralized logging/monitoring/alerting solution for the various services on Apollo (and maybe other?) server(s). I had mentioned possibly using the ELK[1] stack for this task. There was some back and forth about it potentially being a bit heavy handed for what was needed and how we would most likely need to repurpose/dedicate something like nymeria to handle the stack. There was also the suggestion of possibly using something like tenshi[2] if we're aiming for a low overhead solution, however, there would be much writing of the regexes.
With that being said, the purpose of this email is to have a more formal discussion around what we're trying to capture from the logs, the actions we want to have taken with what ends up being captured, and possibly come to a consensus on what tool(s) we could leverage.
Thoughts?
Regards,
Andrew
[1] https://www.elastic.co/de/elk-stack [2]https://github.com/inversepath/tenshi
Hi Andrew, A whole ELK Stack is pretty big. How much servers do we have? An ELK Stack is not only big, it's also a lot of pain to keep it up to date, because you have to update and manage every component of it. I saw many companies switching to graylog instead, but I still think that even graylog is too overkill. If you just want 'log gathering', there are small solutions for this. For example `systemd-journald-gateway`. Did somebody try this before? Besides all of this: If you want to take care of it and take the full responsibility for it. It's a little bit overkill and you increase the attack surface with it, but why not? I am not part of the DevOps Team. So don't take my opinion as something official, I just reply because I got ask for an opinion. chris / shibumi
participants (6)
-
Andrew Crerar
-
Bartłomiej Piotrowski
-
Christian Rebischke
-
Florian Pritz
-
Giancarlo Razzolini
-
Thore Bödecker