linux-kernel - Re: Intel ICH9M/M-E SATA error-handling/reset problems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 16 Feb 2009 08:17:16 -0800
From:	Serguei Miridonov <mirsev@...ese.mx>
To:	Tejun Heo <tj@...nel.org>
Cc:	Robert Hancock <hancockrwd@...il.com>,
	linux-kernel@...r.kernel.org, Jeff Garzik <jeff@...zik.org>
Subject: Re: Intel ICH9M/M-E SATA error-handling/reset problems

Hello,

On Sunday 15 February 2009, Tejun Heo wrote:
> Please try shorter (or different) cable. 

I will, in a few days, may be.

> >> I agree with you completely. Nevertheless, something like 10
> >> errors per 2GB transfer can not be the reason to give up. Vista,
> >> at least, recovers and continues the data transfer. Linux simply
> >> can not return the interface or connected device into operating
> >> mode. Do you think it is normal?
>
> Well, there isn't much point in keeping retrying if the same
> command fails consecutively. 

I'm not talking about the _same_ transfer command. I mean intermittent 
errors, average 10 parity errors per 2GB file. Let me repeat myself 
from another post:

... my very strong opinion based just on general physics is that 
error rate on SATA can be (and will be) much higher than that one on 
PATA. PATA operates at lower frequencies and cables are much shorter. 
eSATA cables are longer and work at up to 3Gb/s. Moreover, consider 
all these consumer-grade connectors, cables, etc. So, CRC errors could 
be quite common and software needs to handle them properly to keep 
transfers fast and maintain the communication with a device.

And, remember USB bulk transfer? Who is taking care on CRC check and 
retries there?

> The problem was the broken speed down
> logic, so all the retries failed and FS eventually received IO
> failure.  Should have been fixed with recent changes.

Slow down may help to reduce amount of errors but it may happen that 
they can not be avoided completely.

> In the log, ata2.00 went down after a timeout.  The reset per-se
> isn't the problem and is the RTTD after a timeout as the controller
> and device states are unknown.  The situations like yours in the
> log often happens because an ATAPI device shuts down completely
> after certain transmission problems.  When this happens, there's
> nothing much the driver can do and soft reboot wouldn't recover the
> device either.

So, this is the kernel job to keep things working, not break them :-)

> But seeing you're on dv5, I think you might be experiencing
> something else.  Please take a look at the following bz.
>
>   http://bugzilla.kernel.org/show_bug.cgi?id=12276

Yes, I tried to suspend to RAM and when the laptop waked up it failed 
to communicate with the hard drive. So, I use hibernate instead.

> ... I'm trying to
> contact HP about this but hasn't gotten anywhere yet.

Please, let us know if they reply. 

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/