[arch-devops] Broken disk on vostok
Hey all, During a small routine check I noticed that we have a broken disk on vostok. In fact, according to the log, we've had it for 7 months at least which is a bit embarrassing given that this is RAID1 only and it's also our primary backup box. This really goes to show that we need better monitoring/more meaningful alerts. Anyway, I had Hetzner replace the disk and the array is now rebuilding. Cheers, Sven
On 26.06.20 19:07, Sven-Hendrik Haase wrote:
Hey all,
During a small routine check I noticed that we have a broken disk on vostok. In fact, according to the log, we've had it for 7 months at least which is a bit embarrassing given that this is RAID1 only and it's also our primary backup box. This really goes to show that we need better monitoring/more meaningful alerts.
Anyway, I had Hetzner replace the disk and the array is now rebuilding.
Cheers, Sven
The array is now resynced and appears to be happy. Yay!
On 27/06/2020 04:01, Sven-Hendrik Haase via arch-devops wrote:
On 26.06.20 19:07, Sven-Hendrik Haase wrote:
Hey all,
During a small routine check I noticed that we have a broken disk on vostok. In fact, according to the log, we've had it for 7 months at least which is a bit embarrassing given that this is RAID1 only and it's also our primary backup box. This really goes to show that we need better monitoring/more meaningful alerts.
Anyway, I had Hetzner replace the disk and the array is now rebuilding.
Cheers, Sven
The array is now resynced and appears to be happy. Yay!
Thanks for handling and finding the issue sven! However how can our monitoring not catch this? Especially for gemini it would be nice to know :-) Greetings, Jelle van der Waa
On Sat, Jun 27, 2020, 19:37 Jelle van der Waa <jelle@vdwaa.nl> wrote:
On 27/06/2020 04:01, Sven-Hendrik Haase via arch-devops wrote:
On 26.06.20 19:07, Sven-Hendrik Haase wrote:
Hey all,
During a small routine check I noticed that we have a broken disk on vostok. In fact, according to the log, we've had it for 7 months at least which is a bit embarrassing given that this is RAID1 only and it's also our primary backup box. This really goes to show that we need better monitoring/more meaningful alerts.
Anyway, I had Hetzner replace the disk and the array is now rebuilding.
Cheers, Sven
The array is now resynced and appears to be happy. Yay!
Thanks for handling and finding the issue sven! However how can our monitoring not catch this? Especially for gemini it would be nice to know :-)
Greetings,
Jelle van der Waa
Yeah, well I dunno. That's gonna get really embarrassing if we don't notice for too long at some point.
Quoting Sven-Hendrik Haase via arch-devops (2020-06-27 20:26:33)
Yeah, well I dunno. That's gonna get really embarrassing if we don't notice for too long at some point.
I'm not familiar with your RAID and monitoring setup, so apologies if this info is not applicable, redundant or useless, but if you're fans of Prometheus, node_exporter exposes mdadm metrics by default about the state of all disks, so you could easily alert on the failed state, if we're talking about a mdadm RAID here. If it's of interest to you, I'd be happy to help out with this stuff. It's what I do for a living and Arch helps me make a living ;) -- Simon Wydooghe +32 (0)472 83 22 30 simon@wydooghe.com
On Sun, 28 Jun 2020 at 10:46, Simon Wydooghe via arch-devops < arch-devops@lists.archlinux.org> wrote:
Quoting Sven-Hendrik Haase via arch-devops (2020-06-27 20:26:33)
Yeah, well I dunno. That's gonna get really embarrassing if we don't notice for too long at some point.
I'm not familiar with your RAID and monitoring setup, so apologies if this info is not applicable, redundant or useless, but if you're fans of Prometheus, node_exporter exposes mdadm metrics by default about the state of all disks, so you could easily alert on the failed state, if we're talking about a mdadm RAID here.
If it's of interest to you, I'd be happy to help out with this stuff. It's what I do for a living and Arch helps me make a living ;)
-- Simon Wydooghe +32 (0)472 83 22 30 simon@wydooghe.com
Hi Simon, Help is always appreciated though we obviously have to be careful with handing out access. Why don't you come hang out with us in #archlinux-devops on Freenode? There we can coordinate any help and I can answer your questions/provide guidance in real-time. I'm sure we can find something for you to do. Sven
Quoting Sven-Hendrik Haase via arch-devops (2020-06-29 01:13:57)
Help is always appreciated though we obviously have to be careful with handing out access.
I understand completely, I'd be worried if you weren't more careful ;)
Why don't you come hang out with us in #archlinux-devops on Freenode? There we can coordinate any help and I can answer your questions/provide guidance in real-time. I'm sure we can find something for you to do.
I will drop by, thanks. -- Simon Wydooghe +32 (0)472 83 22 30 simon@wydooghe.com
On 28/06/2020 10:46, Simon Wydooghe via arch-devops wrote:
Quoting Sven-Hendrik Haase via arch-devops (2020-06-27 20:26:33)
Yeah, well I dunno. That's gonna get really embarrassing if we don't notice for too long at some point.
I'm not familiar with your RAID and monitoring setup, so apologies if this info is not applicable, redundant or useless, but if you're fans of Prometheus, node_exporter exposes mdadm metrics by default about the state of all disks, so you could easily alert on the failed state, if we're talking about a mdadm RAID here.
We use btrfs RAID, but we are interested in switching to prometheus and it's currently just a manpower issue. So if you are interested in contributing knowledge, ansible roles or prometheus alertmanager rules we are certainly interested. Greetings, Jelle
On 29/06/20, Jelle van der Waa wrote:
We use btrfs RAID, [..]
There was a thread recently in linux-btrfs regarding RAID usage and recommendations. In case you missed it: https://www.spinics.net/lists/linux-btrfs/msg102645.html Apologies if you already seen it. -- Leonidas Spyropoulos A: Because it messes up the order in which people normally read text. Q: Why is it such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
participants (4)
-
Jelle van der Waa
-
Leonidas Spyropoulos
-
Simon Wydooghe
-
Sven-Hendrik Haase