linux-kernel - [regression] usb: sometimes dead keyboard after boot (was: new errors during device detection)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200808262103.50984.elendil@planet.nl>
Date:	Tue, 26 Aug 2008 21:03:50 +0200
From:	Frans Pop <elendil@...net.nl>
To:	Alan Stern <stern@...land.harvard.edu>
Cc:	linux-kernel@...r.kernel.org,
	Kernel Testers List <kernel-testers@...r.kernel.org>,
	linux-usb@...r.kernel.org
Subject: [regression] usb: sometimes dead keyboard after boot (was: new errors during device detection)

Sorry for the delay since the last message, but I needed to digest this a 
bit and also did not see any point in degenerating the (so far very 
useful) discussion into a yes-no-yes-no flamefest.

Below some comments and an extra what looks to be real regression.

On Thursday 07 August 2008, Alan Stern wrote:
> On Wed, 6 Aug 2008, Frans Pop wrote:
> > I do find it a bit strange though that EHCI is allowed to grab bus 3
> > when UHCI has already identified a low speed device on that bus.
>
> Here's how it works.  An EHCI controller contains a bank of switches,
> one for each port.  By default, the switches are set so that each port
> connects to the companion UHCI (or OHCI) controller; that way you get
> USB-1.1 functionality even if ehci-hcd isn't loaded.
>
> But when the driver loads, it resets the switches so that the ports
> connect to the EHCI controller.  There is no way for the driver to tell
> which ports have devices attached and which don't, so it has to reset
> all the switches.  Thus, any device which was connected to the UHCI
> controller is now connected to the EHCI controller.  As far as uhci-hcd
> is concerned, it appears that all the devices were suddenly plugged.
>
> As each device is enumerated, ehci-hcd determines whether it can run at
> high speed.  If not, the corresponding switch is set so the device
> connects back to the UHCI controller and it runs at full/low speed.

Thanks a lot for the explanation Alan. I get the general idea and it all 
sounds somewhat logical if you accept the fact that EHCI can be loaded at 
any random time after [UO]HCI as a given, but _that_ still seems to me 
(admittedly a relative outsider and not hindered by any actual technical 
knowledge ;-) like something that is fundamentally broken in this 
sequence.

It also seems to be fragile in practice. I have now had two occasions 
since your last mail where my system would come up with a dead USB 
keyboard and it looks like this issue is the root cause.

Attached a full diff between dmesg from two consecutive boots: first 
without keyboard; after reboot the keyboard is detected. The actual 
difference is fairly small and clearly shows that usb 3-1 is not handed 
off correctly, probably due to a small difference in timing.

Note that I've never seen this problem with earlier kernels.

> > > If you want to prevent all errors of this sort, all you have to do
> > > is insure that ehci-hcd is loaded before either uhci-hcd or
> > > ohci-hcd during system startup.
> >
> > Hmmm. Also not sure that I'm ready to agree with this conclusion.
>
> It follows directly from the description above.  If ehci-hcd is loaded
> first then all the switches will be reset before any device has a
> chance to register the UHCI driver.  Hence uhci-hcd will never see them
> suddenly disconnect and will not generate those error messages.

I still feel it should not be up to individual users to need to "force" 
something like this by manually messing with their initramfs or
/etc/modules. If loading EHCI first is the right thing to do (and it seems 
to me like it is) then the kernel itself should ensure that that's what 
happens.

> > Shouldn't the kernel itself be smart enough to prevent error messages
> > in apparently predictable and probably fairly common scenarios?
>
> It's somewhat difficult to synchronize activities between two different
> drivers, especially when they can be in separate modules (so that one
> might be present in memory and the other not).

Ack.

> As for whether these messages really _should_ be suppressed...  That
> depends on the circumstances.  In your case, yes.  But suppose for some
> reason ehci-hcd was loaded much later, at a time when the devices
> connected to the UHCI controller were already in use.  In that case it
> seems reasonable to log some error messages when the devices stop
> working.

From an end-user PoV (which basically I am) I personally actually don't 
think it is reasonable to have _any_ error messages in situations that 
are expected and part of a "normal" boot sequence. For me, error messages 
always indicate that something is wrong or broken and needs to be fixed 
and followed up on. So, if this driver hand-off is really necessary, 
expected and safe, it should be done with only informational messages, 
not errors.

Even in the case where ehci-hcd is loaded much later I don't think error 
messages would be right. At least, assuming that the kernel can guarantee 
that the driver hand-off can be done cleanly (without risk of damaging 
interruptions in the working of already connected devices). And if it 
cannot guarantee that, then maybe it should just refuse to load ehci-hcd 
at all!


Side note.
Both as a Debian Developer and kernel tester I probably pay more attention 
than most users to my console and logs, but in principle I try to follow 
up on any message that does not seem to belong, especially ones that 
are "new".
I boot kernels with 'quiet', so any error during boot is immediately 
visible (and disturbing). I also run logcheck on all my systems, so I see 
any unexpected log messages during normal operation. As boot logs are 
noisy by definition, I finally do diffs between old and new boot time 
dmesg after most new (rc) kernel builds.

Call it my contribution to quality assurance.

Cheers,
FJP


View attachment "dmesg.no_keyboard.diff" of type "text/x-diff" (36607 bytes)