Hardware tests and SSDs - Was: When I upgrade to dbus-broker should I expect anything to break?
On Thu, 2024-01-11 at 09:34 +0000, pete wrote:
for the record the Drives are all ok memory tested out ok . see what happens if it starts acting up again
Hi, while I'm using only internal SSDs, no HDDs anymore, since many years, I still don't have experiences with SSD failures. My desktop PC is quasi 24/7/365 up, with just short interruptions. Is your Arch install on a SSD? If so, there's probably an issue related to the "flash retention time". If an SSD is not supplied with power, it loses data relatively quickly even in the best condition. You can sometimes read that six months could already be too long. I believe that when SSDs become old, the period of time in which they can safely retain data without a power supply is shortened. Reading about this just happens to be on my to-do list, just like booting up my old desktop PC that hasn't been connected to the mains for months, maybe even longer than half a year. As it's on my todo list, I don't have any knowledge yet. It's just a guess that an old or somehow damaged SSD might look good, pass write and read tests, but the margin in which it can hold data without power might be very, very short at some point. Regards, Ralf
On Thu, 11 Jan 2024 15:25:57 +0100 Ralf Mardorf <ralf-mardorf@riseup.net> wrote:
On Thu, 2024-01-11 at 09:34 +0000, pete wrote:
for the record the Drives are all ok memory tested out ok . see what happens if it starts acting up again
Hi,
while I'm using only internal SSDs, no HDDs anymore, since many years, I still don't have experiences with SSD failures. My desktop PC is quasi 24/7/365 up, with just short interruptions.
Is your Arch install on a SSD? If so, there's probably an issue related to the "flash retention time".
If an SSD is not supplied with power, it loses data relatively quickly even in the best condition. You can sometimes read that six months could already be too long. I believe that when SSDs become old, the period of time in which they can safely retain data without a power supply is shortened. Reading about this just happens to be on my to-do list, just like booting up my old desktop PC that hasn't been connected to the mains for months, maybe even longer than half a year.
As it's on my todo list, I don't have any knowledge yet. It's just a guess that an old or somehow damaged SSD might look good, pass write and read tests, but the margin in which it can hold data without power might be very, very short at some point.
Regards, Ralf
Hi Ralf no it is all 3.5 hdd's there are 3 2Tb and 2 1 Tb Bit slow by modern standards but UK pensions make it difficult to justify SSD's right now i would like replace my grapichs card as well but something remotely like this an ATI R390 is way out of my pocket Cheers Pete
Hello,
Is your Arch install on a SSD? If so, there's probably an issue related to the "flash retention time".
If an SSD is not supplied with power, it loses data relatively quickly even in the best condition. You can sometimes read that six months could already be too long. I believe that when SSDs become old, the period of time in which they can safely retain data without a power supply is shortened. Reading about this just happens to be on my to-do list, just like booting up my old desktop PC that hasn't been connected to the mains for months, maybe even longer than half a year.
Also bare in mind bit rot. HDDs used to rot due to magnetic interference, hence do not pass your HDDs through a metal detector, but there is always background EMF, it also messes up packets too, so don't go running powerlines next to your data lines :P SSDs however store bits based on highs and lows, it is very easy for bits to flip, and storing that bit with fluctuated charge can be difficult, hence why SSDs are considered only useful for short term storage and why HDDs and tape can't be replaced. Point is, there is a possibility that your system has slowly rotted over the years and the wrong thing rotted. ext4 has no bit rot detection, I know a lot of people who say ext4 is old and insecure, but bare in mind the overhead of the new filesystem features in say, btrfs. btrfs does have detection as it does checksumming on the filesystem level, however this only detects the bit rot, it can't fix it, which requires redundant storage or backups. I believe the standard for bit rot protection is run 2 SSDs both with btrfs and then RAID 1 them, when a checksum fails, it pulls the sector from the other SSD allowing decent data integrity, correct me if I am wrong, I am bit rot insecure as I don't store any data on my daily driver of use. Anyways, its said all the time for good reason, always backup your data, also configurations are a must to backup too, there is nothing worse than losing firewall configurations and needing to rewrite it all (as an example).
As it's on my todo list, I don't have any knowledge yet. It's just a guess that an old or somehow damaged SSD might look good, pass write and read tests, but the margin in which it can hold data without power might be very, very short at some point.
I am not a data expert, but I have done reading on this in the past. SSDs unlike HDDs have no clear sign of degredation, HDDs sound funny or begin to tick or crackle when they are dying, their IOPS drop considerably, but these are only signs and HDDs can still random die. SSDs cells will die, but they are designed to move data to redundant cells, this happens at a firmware level and it handles it itself. S.M.A.R.T checking is a good prevention method, but it is now flawless, it will flag when a drive isn't acting as suspected, but some drives can fail S.M.A.R.T and last for years still, so take it as a pinch of salt. I believe some SSD firmware allows you to check the number of redundant cells remaining, but I am not sure. Data corruption, data disappearing or the SSD disappearing entirely are good signs, but by that time, your data is gone already... I recommend reading the S.M.A.R.T ArchWiki page [1], it never hurts to check periodically to see how your drives are functioning, some drives even tell you the total read/writes, maybe its just me being obsessive about numbers, but I love seeing how these numbers gradually increase over time (along with my battery charge cycles, and how the total capacity decreases). As I said I am not a data expert, knowledge here is what I have read, and discussed with others who are experienced in this field, feel free to correct me if I made a mistake. Take care, -- Polarian GPG signature: 0770E5312238C760 Website: https://polarian.dev JID/XMPP: polarian@icebound.dev [1] https://wiki.archlinux.org/title/S.M.A.R.T.
Hi, I can confirm that one of two things usually happens before a HDD fails completely. It makes unusual noises. These noises are so unusual that you don't need to ask anyone for confirmation because you're not absolutely sure that it really sounds unusual. The other is voodoo that looks like many people think their computer has been hacked. In very few cases has the computer been enchanted or hacked. It cannot be ruled out that one of the OP's hard drives is close to gving up the ghost. The smartctl database is often not suitable for checking SSDs. In such cases, the SMART information must be displayed in an understandable way by the manufacturer's software. The annoyance can be that the SSDs are one day very old and the manufacturer discontinued support for the corresponding Linux software. This is what experienced. A program that causes no problems with virtually everyone else, which I have used a lot, causes voodoo on my old and new computer. So at first I thought it was the first time an SSD had broken. Not so, sometimes there really is witchcraft. I had to give up using the program to get rid of the voodoo. There are no issues so far with any of the SSDs. Regards, Ralf -- Voodoo Computer https://www.youtube.com/watch?v=qFfnlYbFEiE No dead cat was used for any of the microphones. This was kind of a "Stairway to Heaven" in the Ruhrgebiet's guitar shops, today they probably play something djenty on 8-string fanfret guitars.
On Thu, 2024-01-11 at 19:49 +0100, Ralf Mardorf wrote:
It cannot be ruled out that one of the OP's hard drives is close to gving up the ghost.
My apologies, the OP of the original thread is David, but it's Pete suffering from unexplainable issues.
On Thu, 11 Jan 2024 19:54:46 +0100 Ralf Mardorf <ralf-mardorf@riseup.net> wrote:
On Thu, 2024-01-11 at 19:49 +0100, Ralf Mardorf wrote:
It cannot be ruled out that one of the OP's hard drives is close to gving up the ghost.
My apologies, the OP of the original thread is David, but it's Pete suffering from unexplainable issues.
Yes appologies for hijacking the thread . Anyhow thanks to all everything seems fine now only thing i cant get to behave yet is GQRX RTS-SDR software Ok on the Noise from Drives Ralf these things are almost silent i have to have a good listen sometimes are they still spinning Cheers all Pete
On 1/11/24 11:52, Polarian wrote:
ext4 has no bit rot detection, I know a lot of people who say ext4 is old and insecure, but bare in mind the overhead of the new filesystem features in say, btrfs.
btrfs does have detection as it does checksumming on the filesystem level, however this only detects the bit rot, it can't fix it, which requires redundant storage or backups. I believe the standard for bit rot protection is run 2 SSDs both with btrfs and then RAID 1 them, when a checksum fails, it pulls the sector from the other SSD allowing decent data integrity, correct me if I am wrong, I am bit rot insecure as I don't store any data on my daily driver of use.
Good, I don't feel like such a dinosaur with RAID1 ext4 on spinning rust -- which has been incredibly reliable: # mdadm -D /dev/md{0,1,2,4} <snip> /dev/md2: Version : 1.2 Creation Time : Thu Aug 20 23:46:24 2015 Raid Level : raid1 Array Size : 921030656 (878.36 GiB 943.14 GB) Used Dev Size : 921030656 (878.36 GiB 943.14 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Fri Jan 12 23:20:49 2024 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Consistency Policy : bitmap Name : archiso:2 UUID : 73a0a0b5:fa3629e1:7c1a7c87:23044fc8 Events : 7421 Number Major Minor RaidDevice State 0 8 7 0 active sync /dev/sda7 1 8 23 1 active sync /dev/sdb7 <snip> Though I do keep a close eye on the drives given their age, and have spare available -- and knock-on-wood often :) -- David C. Rankin, J.D.,P.E.
Hello, The issue with this however is that ext4 can't detect when bits have rotted as there is no checksumming. All mdadm knows is that the blocks are different and tries to resync them, which will also sync the bit rot between them (you are basically rolling a dice). What it does protect against is entire drive failures, you have a complete copy of the data, but that data can and will rot. For daily use this is not a big deal, and I doubt most people would be too fussed if some random file they have backed up rots on their local machine, but in a server environment (such as storing backups or tax forms), this is a big deal. Take a look at the ArchWiki article [1] which explains this. Take care, -- Polarian GPG signature: 0770E5312238C760 Website: https://polarian.dev JID/XMPP: polarian@icebound.dev [1] https://wiki.archlinux.org/title/RAID#Scrubbing
Hello Polarian,
The issue with this however is that ext4 can't detect when bits have rotted as there is no checksumming.
Could we have a little bit of quoted context in a reply. I'm unclear what ‘this’ is above, for example, though I'm generally interested. Yes, I could go an explore the list archive, but that's too much overhead so I don't. And there's one writer but many readers so the overhead multiples. Thanks. https://wiki.archlinux.org/title/General_guidelines#Quoting -- Cheers, Ralph.
Apologies, I was replying to the following from David Rankin:
Good, I don't feel like such a dinosaur with RAID1 ext4 on spinning rust -- which has been incredibly reliable:
# mdadm -D /dev/md{0,1,2,4} <snip> /dev/md2: Version : 1.2 Creation Time : Thu Aug 20 23:46:24 2015 Raid Level : raid1 Array Size : 921030656 (878.36 GiB 943.14 GB) Used Dev Size : 921030656 (878.36 GiB 943.14 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Jan 12 23:20:49 2024 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0
Consistency Policy : bitmap
Name : archiso:2 UUID : 73a0a0b5:fa3629e1:7c1a7c87:23044fc8 Events : 7421
Number Major Minor RaidDevice State 0 8 7 0 active sync /dev/sda7 1 8 23 1 active sync /dev/sdb7 <snip>
Though I do keep a close eye on the drives given their age, and have spare available -- and knock-on-wood often :)
It should make sense now :) Take care, -- Polarian GPG signature: 0770E5312238C760 Website: https://polarian.dev JID/XMPP: polarian@icebound.dev
participants (5)
-
David C. Rankin
-
pete
-
Polarian
-
Ralf Mardorf
-
Ralph Corderoy