linux-kernel - Re: SATA exceptions with 2.6.20-rc5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-id: <45B563C6.5070505@shaw.ca>
Date:	Mon, 22 Jan 2007 19:24:22 -0600
From:	Robert Hancock <hancockr@...w.ca>
To:	Björn Steinbrink <B.Steinbrink@....de>,
	Robert Hancock <hancockr@...w.ca>,
	Jeff Garzik <jeff@...zik.org>, Chr <chunkeey@....de>,
	Alistair John Strachan <s0348365@....ed.ac.uk>,
	linux-kernel@...r.kernel.org, htejun@...il.com,
	jens.axboe@...cle.com, lwalton@...l.com, pomac@...or.com
Subject: Re: SATA exceptions with 2.6.20-rc5

Björn Steinbrink wrote:
>>> Running a kernel with the return statement replace by a line that prints
>>> the irq_stat instead.
>>>
>>> Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
>> 40 minutes stress test now and no exception yet. What's interesting is
>> that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
>> might have get dropped are as above.
>> I'll keep it running for some time and will then re-enable the return
>> statement to see if there's a relation between the irq_stat 0x0 and the
>> exception.
> 
> No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
> 0x0 for ata1. Syslog/dmesg has nothing new either, still the same
> pattern of dismissed irq_stats.

I've finally managed to reproduce this problem on my box, by doing:

watch --interval=0.1 /sbin/hdparm -I /dev/sda

on one drive and then running bonnie++ on /dev/sdb connected to the 
other port on the same controller device. Usually within a few minutes 
one of the IDENTIFY commands would time out in the same way you guys 
have been seeing.

Through some various trials and tribulations, the only conclusion I can 
come to is that this controller really doesn't like that 
NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
adding some debug code to the qc_issue function that would check to see 
if the BUSY flag in altstatus went high or that register showed an 
interrupt within a certain time afterwards, however that really seemed 
to hose things, the system wouldn't even boot.

Try out this patch, it just calls the ata_host_intr function where 
appropriate without using nv_host_intr which looks at the 
NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
that. With this patch I can get through a whole bonnie++ run with the 
repeated IDENTIFY requests running without seeing the error.

As an aside, there seems to be some dubious code in nv_host_intr, if 
ata_host_intr returns 0 for handled when a command is outstanding, it 
goes and calls ata_check_status anyway. This is rather dangerous since 
if an interrupt showed up right after ata_host_intr but before 
ata_check_status, the ata_check_status would clear it and we would 
forget about it. I tried fixing just that issue and still had this 
problem however. I suspect that code is truly broken and needs further 
thought, but this patch avoids calling it in the ADMA case, at any rate.

As a final aside, this is another case where the hardware docs for this 
controller would really be useful, in order to know whether we are 
actually supposed to be reading that register in ADMA mode or not. I 
sent a query to Allen Martin at NVIDIA asking if there's a way I could 
get access to the documents, but I haven't heard anything yet.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@...pamshaw.ca
Home Page: http://www.roberthancock.com/

View attachment "sata_nv-dont-check-ck804-int-status-in-adma.patch" of type "text/plain" (673 bytes)