linux-kernel - Re: Some hints needed how to handle SATA ALPM failures

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110218161640.GR21209@htj.dyndns.org>
Date:	Fri, 18 Feb 2011 17:16:40 +0100
From:	Tejun Heo <tj@...nel.org>
To:	Stefan Bader <stefan.bader@...onical.com>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-ide@...r.kernel.org, Jeff Garzik <jgarzik@...ox.com>,
	Andy Whitcroft <apw@...onical.com>
Subject: Re: Some hints needed how to handle SATA ALPM failures

Hello,

On Fri, Feb 18, 2011 at 04:55:45PM +0100, Stefan Bader wrote:
> Sorry that was not specific enough. It is remounting ro, which can
> leave the fs in a better or worse state.

I see and, nope, that shouldn't lead to corrupted filesystem on a
journaled filesystem.  I agree it sucks tho.  This shouldn't be
happening with newer kernels unless the hardware completely shuts
down, which some very early SATA harddrives did but shouldn't happen
with most modern devices.  Backporting the fix isn't difficult.

> > Also, the whole LPM thing got revamped several releases ago.  Can you
> > please test how the recent kernels behave?  There will be failures as
> > not all hardware can handle LPM well but those failures shouldn't lead
> > to any catastrophic failures like ro remounting of filesystem.
> 
> The example output given as footnotes in the original post were taken from the
> latest re-test someone did on a 2.6.38-rc5 kernel (same user also reported bad
> experience with a 2.6.35 based kernel). The comment we got on that was:
> 
> "Here's what i get - the drive led lights continuously for about 10 seconds
> during which any hdd access results in hanging process:"
> 
> [12348.040077] ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x150000 action 0x6 frozen
> [12348.040086] ata3: SError: { PHYRdyChg CommWake Dispar }
> [12348.040091] ata3.00: failed command: READ FPDMA QUEUED
> [12348.040099] ata3.00: cmd 60/10:00:b0:94:c5/00:00:03:00:00/40 tag 0 ncq 8192 in
> [12348.040101] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> [12348.040104] ata3.00: status: { DRDY }
> [12348.040112] ata3: hard resetting link
> [12348.390082] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [12348.404414] ata3.00: configured for UDMA/133
> [12348.404550] ata3.00: device reported invalid CHS sector 0
> [12348.404570] ata3: EH complete
> 
> I believe the details of the failures varied but "READ FPDMA QUEUED" and a
> timeout were usually involved.

It's on NVIDIA ahci, right?  This shouldn't be happening with intel
and jmb ones, which were used while implementing.  The problem is most
likely controller dependent.  One possibility is the controller is not
happy with DIPM.  Does specifying "medium_power" instead make the
problem go away?  Can the bug reporter try some kernel patches?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/