lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 08 Jun 2009 17:38:16 +0000
From:	James Bottomley <James.Bottomley@...senPartnership.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Chris Clayton <chris2553@...glemail.com>,
	Jaswinder Singh Rajput <jaswinder@...nel.org>,
	NeilBrown <neilb@...e.de>, linux-kernel@...r.kernel.org,
	scsi <linux-scsi@...r.kernel.org>, Tejun Heo <tj@...nel.org>,
	Arjan van de Ven <arjan@...ux.intel.com>
Subject: Re: 2.6.30-rc8 Oops whilst booting

On Mon, 2009-06-08 at 10:21 -0700, Linus Torvalds wrote:
> 
> On Mon, 8 Jun 2009, James Bottomley wrote:
> > 
> > The root cause is a reordering of the devices caused by the async code.
> 
> That's NULL information.
> 
> OF COURSE the root cause is the async code. We know that. We're looking 
> for the specifics.
> 
> In particular, before that commit, at most you will wait for too _much_. 
> In other words, it's a "good" wait. 
> 
> Your commit caused it to wait for less, and that then showed a bug. Not 
> all that surprising - it's now not waiting enough.

right ... my question was whether this exposed an existing bug that was
hidden by the waiting too much.  Actually, I audited all the async code
and that's impossible: we don't actually have any async domains at all
(except for the spurious superblock s_async_list, which never gets
anything added to its runqueue), so it must be a bug in the code.

> You tried to avoid a deadlock situation of waiting for too much, but you 
> avoided the deadlock by now waiting for too little. 
> 
> I also think that your code is simply buggy. As far as I can tell, int he 
> case of having both running and pending events, you'll always pick the 
> pending cookie. But it's the _running_ cookie that has the lower event 
> number, isn't it?

Yes, see later fix.  Assuming we get confirmation from the reporter, we
should be good to go.

> I dunno. It all looks very fishy to me.

Well, the other option is to revert the fix ... since there is no other
separated domain, there's nothing really to fix ... the original code
that showed the problem was a SCSI feature tree conversion of our
current async scanning code to the async infrastructure which used a
separate domain.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ