linux-kernel - Re: Highly critical bug in XHCI Controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20241118222355.482cf783@foxbook>
Date: Mon, 18 Nov 2024 22:23:55 +0100
From: Michał Pecio <michal.pecio@...il.com>
To: linuxusb.ml@...dtek.de
Cc: linux-kernel@...r.kernel.org, linux-usb@...r.kernel.org,
 stern@...land.harvard.edu
Subject: Re: Highly critical bug in XHCI Controller

> In my experience with USB anything that is a 'temporary' failure can
> be considered as 'permanent' failure and I've really seen a lot over
> the last 1 1/2 decades.
> However issues are mostly related to immature controllers / missing
> quirks for some controllers.
> Our devices in the field since 2008 usually pump around 100-300mbit
> through the USB 2 link,
> streaming  devices which usually run for a long period of time (up to
> months / years).
> 'retrying' something on a link where something has gone wrong for sure
> never worked properly for me, it would have continued with another
> followup issue at some point.

You may have simply seen hardware going dead or buggy drivers failing
to recover from recoverable errors.

Random bit errors really happen and (excepting isochronous endpoints)
can be recovered from. But if you get -EPROTO on a bulk endpoint, for
example, it means the endpoint halted and should be reset. Few Linux
drivers seem to bother with such things.

I even think xHCI's handling of halted endpoints and usb_clear_halt()
is broken, but it looks like fixing it would break all the buggy class
drivers on the other hand, which are currently "sort of functional".

> Anyway can you give a particular example where this 'retrying'
> mechanism and reloading the endpoint size solves or solved a problem?

It seems to happen when you insert the plug slowly or at an angle, and
contact is briefly lost while the device is being initialized.

The first three lines below come from hub_port_init(), which looks like
it is being called by hub_port_connect() in a loop.

[81169.840924] usb 5-1: new full-speed USB device number 61 using ohci-pci
[81170.387927] usb 5-1: device not accepting address 61, error -62
[81170.742931] usb 5-1: new full-speed USB device number 62 using ohci-pci
[81170.901914] usb 5-1: New USB device found, idVendor=067b, idProduct=2303, bcdDevice= 3.00
[81170.901919] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[81170.901921] usb 5-1: Product: USB-Serial Controller
[81170.901922] usb 5-1: Manufacturer: Prolific Technology Inc.

Another example which could trigger retries is a device which includes
a permanent "presence detect" resistor (such as PL2303, coincidentally)
but takes a long time to initialize itself and start responding.

Regards,
Michal