linux-kernel - Re: Intel ICH9M/M-E SATA error-handling/reset problems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Thu, 19 Feb 2009 15:29:55 +0900
From:	Tejun Heo <tj@...nel.org>
To:	Serguei Miridonov <mirsev@...ese.mx>
CC:	Robert Hancock <hancockrwd@...il.com>,
	linux-kernel@...r.kernel.org, Jeff Garzik <jeff@...zik.org>
Subject: Re: Intel ICH9M/M-E SATA error-handling/reset problems

Hello, Serguei.

Serguei Miridonov wrote:
>>>> I agree with you completely. Nevertheless, something like 10
>>>> errors per 2GB transfer can not be the reason to give up. Vista,
>>>> at least, recovers and continues the data transfer. Linux simply
>>>> can not return the interface or connected device into operating
>>>> mode. Do you think it is normal?
>> Well, there isn't much point in keeping retrying if the same
>> command fails consecutively. 
> 
> I'm not talking about the _same_ transfer command. I mean intermittent 
> errors, average 10 parity errors per 2GB file. Let me repeat myself 
> from another post:
> 
> ... my very strong opinion based just on general physics is that 
> error rate on SATA can be (and will be) much higher than that one on 
> PATA. PATA operates at lower frequencies and cables are much shorter. 
> eSATA cables are longer and work at up to 3Gb/s. Moreover, consider 
> all these consumer-grade connectors, cables, etc. So, CRC errors could 
> be quite common and software needs to handle them properly to keep 
> transfers fast and maintain the communication with a device.

The kernel doesn't give up after intermittent errors.

> And, remember USB bulk transfer? Who is taking care on CRC check and 
> retries there?

What you're describing is already handled.  No need to worry about it.

>> The problem was the broken speed down
>> logic, so all the retries failed and FS eventually received IO
>> failure.  Should have been fixed with recent changes.
> 
> Slow down may help to reduce amount of errors but it may happen that 
> they can not be avoided completely.
> 
>> In the log, ata2.00 went down after a timeout.  The reset per-se
>> isn't the problem and is the RTTD after a timeout as the controller
>> and device states are unknown.  The situations like yours in the
>> log often happens because an ATAPI device shuts down completely
>> after certain transmission problems.  When this happens, there's
>> nothing much the driver can do and soft reboot wouldn't recover the
>> device either.
> 
> So, this is the kernel job to keep things working, not break them :-)

Yeah, and other than the hardware quirkiness on your machine, it
already works fine.

>> But seeing you're on dv5, I think you might be experiencing
>> something else.  Please take a look at the following bz.
>>
>>   http://bugzilla.kernel.org/show_bug.cgi?id=12276
> 
> Yes, I tried to suspend to RAM and when the laptop waked up it failed 
> to communicate with the hard drive. So, I use hibernate instead.

Can you please try to take a look at the kernel log after the kernel
resumes and see whether you're actually seeing the same problem?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/