lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20071016182634.8956423BCED@adsl-69-226-248-13.dsl.pltn13.pacbell.net>
Date:	Tue, 16 Oct 2007 11:26:34 -0700
From:	David Brownell <david-b@...bell.net>
To:	davem@...emloft.net
Cc:	stern@...land.harvard.edu, linux-usb-users@...ts.sourceforge.net,
	linux-kernel@...r.kernel.org, greg@...ah.com
Subject: Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...

> > > Bad news, even with the rwsem after a lot more testing I can still
> > > trigger the hang in ohci_hub_control() :-(
> > >
> > > I think we need to go back to considering the total serialization
> > > approach to this problem.
> > 
> > We shouldn't need that.  What happens if you add an msleep(5)
> > before ehci-hcd::ehci_run() drops ehci_cf_port_reset_rwsem?
>
> What happens is the heisenbug will go away for another week.

Not if what I suggested is happening is really what's happening.
(Quoted next.)

It's got to be just a *simple* hardware race, and the msleep would
reliably prevent it since the switch takes a finite amount of time
to do its job.  I've had to struggle with real heisenbugs, and this
doesn't have enough conflicting behaviors inside the silicon (or
poor enough design) to qualify.


> > The theory there being that the switch triggered by setting CF
> > doesn't take effect instantaneously, contrary to the effective
> > assumption of that code.  A delay of 5 msec seems like it should
> > be more than enough, but that's kind of a guess ... it's good to
> > keep that low, since unfortunately that's in the critical path
> > for OLPC "resume from idle".
>
> I want to help with this, but if I even breath on the kernel the bug
> goes away.  The race just gets harder to trigger, and if we just keep
> adding things it'll make the problem go away but for the absolutely
> wrong reasons.

So, you're unwilling to explore whether that suggestion addresses
this problem.


> The only way we will provably fix this is to make sure EHCI initialize
> fully, first, regardless of kernel config or what userland does.

As Alan noted: no can do, in general.  That's why I've not griped
harder at the distro vendors who are ignoring the fairly simple
recommendation that's been around for six years now:  load EHCI
before other USB controller drivers.  

Admittedly, until you turned up this glitch there was no downside
known beyond the boot slowdown.


> Also, David, you haven't done anything with the feedback I gave to the
> most recent revision of the OHCI hub reset anti-wedge patch.

It's in a different mailbox, sorry.


>	You
> removed the debug logging when the outer-loop timeout expires, and I
> asked that you put that back so that if it happens there is some
> chance to know that this is what happened.  If it's not supposed to
> happen, there is no harm in putting the debugging log message there
> so that if the impossible does happen we find out about it.

It will exit by the inner loop (with diagnostic) before it exits
from the outer one.  Then the hub logic and other code will give
even more messages.


> I really don't think it's appropriate for that bug fix to sit yet
> another week.

The version I sent should just merge.

- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ