lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 02 Jan 2008 18:21:45 -0600
From:	Robert Hancock <hancockr@...w.ca>
To:	Allen Martin <AMartin@...dia.com>
Cc:	Jeff Garzik <jeff@...zik.org>, Tejun Heo <htejun@...il.com>,
	Gabor Gombas <gombasg@...aki.hu>, linux-kernel@...r.kernel.org,
	linux-ide@...r.kernel.org, Kuan Luo <kluo@...dia.com>,
	Peer Chen <pchen@...dia.com>
Subject: Re: sata_nv + ADMA + Samsung disk problem

Allen Martin wrote:
>> The software definitely provides that guarantee for all NCQ-capable 
>> controllers.
>>
> 
> Well if that's not it, it must be some problem entering ADMA legacy
> mode.  Here's what the Windows driver does:
> 
> 
> ADMACtrl.aGO = 0
> ADMACtrl.aEIEN = 0
> poll {
>   until ADMAStatus.aLGCY = 1 || timeout
> }

What we're doing to enter legacy mode is essentially:

-wait until ADMA status indicates IDLE bit set (max wait of 1 microsecond)
-clear GO bit in control register
-wait until status indicates LEGACY bit set (max wait of 1 microsecond)

and to enter ADMA mode:

-set GO bit in control register
-wait until status indicates LEGACY bit cleared and IDLE bit set (max 
wait of 1 microsecond)

The 1 microsecond timeout is pretty aggressive admittedly, but it 
apparently isn't being broken (the only timeouts when switching modes 
I've seen are during error handling after a command timeout has already 
occurred). What timeout value is the Windows driver using?

Also, I see you are clearing the AEIN bit when in register mode, while 
we're not. Is that important/necessary?

Aside from all this though, in the case of NCQ writes followed by a 
cache flush, that sequence of commands won't put us into legacy mode at 
all since the cache flush is a no-data command which we should be able 
to handle in ADMA mode, from my understanding (correct me if I'm wrong). 
So I don't imagine legacy/ADMA mode switch could be the cause of this 
problem.

I also saw in my previous investigation that a flush immediately 
followed by a write could cause the write to time out as well.

 From some of the traces I took previously (posted on LKML as "sata_nv 
ADMA controller lockup investigation" way back in Feb 07), what seems to 
occur is that when the second command is issued very rapidly (within 
less than 20 microseconds, or potentially longer) after the previous 
command's completion, the ADMA status changes from 0x500 (STOPPED and 
IDLE) to 0x400 (just IDLE) as it typically does, but then it sticks 
there, no interrupt is ever raised, and CPB response flags remain at 0.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@...pamshaw.ca
Home Page: http://www.roberthancock.com/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ