linux-kernel - Re: Race to power off harming SATA SSDs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3231980.BbEtxjAFS5@merkaba>
Date:   Tue, 11 Apr 2017 12:37:43 +0200
From:   Martin Steigerwald <martin@...htvoll.de>
To:     Tejun Heo <tj@...nel.org>
Cc:     Henrique de Moraes Holschuh <hmh@....eng.br>,
        linux-kernel@...r.kernel.org, linux-scsi@...r.kernel.org,
        linux-ide@...r.kernel.org, Hans de Goede <hdegoede@...hat.com>
Subject: Re: Race to power off harming SATA SSDs

Am Dienstag, 11. April 2017, 08:52:06 CEST schrieb Tejun Heo:
> > Evidently, how often the SSD will lose the race depends on a platform
> > and SSD combination, and also on how often the system is powered off.
> > A sluggish firmware that takes its time to cut power can save the day...
> > 
> > 
> > Observing the effects:
> > 
> > An unclean SSD power-off will be signaled by the SSD device through an
> > increase on a specific S.M.A.R.T attribute.  These SMART attributes can
> > be read using the smartmontools package from www.smartmontools.org,
> > which should be available in just about every Linux distro.
> > 
> > smartctl -A /dev/sd#
> > 
> > The SMART attribute related to unclean power-off is vendor-specific, so
> > one might have to track down the SSD datasheet to know which attribute a
> > particular SSD uses.  The naming of the attribute also varies.
> > 
> > For a Crucial M500 SSD with up-to-date firmware, this would be attribute
> > 174 "Unexpect_Power_Loss_Ct", for example.
> > 
> > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > the worst case, or otherwise harm it (reduce longevity, damage flash
> > blocks).  It is also not impossible to get data corruption.
> 
> I get that the incrementing counters might not be pretty but I'm a bit
> skeptical about this being an actual issue.  Because if that were
> true, the device would be bricking itself from any sort of power
> losses be that an actual power loss, battery rundown or hard power off
> after crash.

The write-up by Henrique has been a very informative and interesting read for 
me. I wondered about the same question tough.

I do have a Crucial M500 and I do have an increase of that counter:

martin@...kaba:~[…]/Crucial-M500> grep "^174" smartctl-a-201*   
smartctl-a-2014-03-05.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
Old_age   Always       -       1
smartctl-a-2014-10-11-nach-prüfsummenfehlern.txt:174 Unexpect_Power_Loss_Ct  
0x0032   100   100   000    Old_age   Always       -       67
smartctl-a-2015-05-01.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
Old_age   Always       -       105
smartctl-a-2016-02-06.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
Old_age   Always       -       148
smartctl-a-2016-07-08-unreadable-sector.txt:174 Unexpect_Power_Loss_Ct  0x0032   
100   100   000    Old_age   Always       -       201
smartctl-a-2017-04-11.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    
Old_age   Always       -       272


I mostly didn´t notice anything, except for one time where I indeed had a 
BTRFS checksum error, luckily within a BTRFS RAID 1 with an Intel SSD (which 
also has an attribute for unclean shutdown which raises).

I blogged about this in german language quite some time ago:

https://blog.teamix.de/2015/01/19/btrfs-raid-1-selbstheilung-in-aktion/

(I think its easy enough to get the point of the blog post even when not 
understanding german)

Result of scrub:

   scrub started at Thu Oct  9 15:52:00 2014 and finished after 564 seconds
        total bytes scrubbed: 268.36GiB with 60 errors
        error details: csum=60
        corrected errors: 60, uncorrectable errors: 0, unverified errors: 0

Device errors were on:

merkaba:~> btrfs device stats /home
[/dev/mapper/msata-home].write_io_errs   0
[/dev/mapper/msata-home].read_io_errs    0
[/dev/mapper/msata-home].flush_io_errs   0
[/dev/mapper/msata-home].corruption_errs 60
[/dev/mapper/msata-home].generation_errs 0
[…]

(thats the Crucial m500)


I didn´t have any explaination of this, but I suspected some unclean shutdown, 
even tough I remembered no unclean shutdown. I take good care to always has a 
battery in this ThinkPad T520, due to unclean shutdown issues with Intel SSD 
320 (bricked device which reports 8 MiB as capacity, probably fixed by the 
firmware update I applied back then).

The write-up Henrique gave me the idea, that maybe it wasn´t an user triggered 
unclean shutdown that caused the issue, but an unclean shutdown triggered by 
the Linux kernel SSD shutdown procedure implementation.

Of course, I don´t know whether this is the case and I think there is no way 
to proof or falsify it years after this happened. I never had this happen 
again.

Thanks,
-- 
Martin