linux-kernel - Re: [syzbot] INFO: rcu detected stall in tx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <12088413-2f7d-a1e5-5e8a-25876d85d18a@synopsys.com>
Date:   Wed, 19 May 2021 19:38:52 +0000
From:   Thinh Nguyen <Thinh.Nguyen@...opsys.com>
To:     Alan Stern <stern@...land.harvard.edu>,
        Guido Kiener <Guido.Kiener@...de-schwarz.com>
CC:     dave penkler <dpenkler@...il.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        syzbot <syzbot+e2eae5639e7203360018@...kaller.appspotmail.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "lee.jones@...aro.org" <lee.jones@...aro.org>,
        USB list <linux-usb@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>,
        "dwmw@...zon.co.uk" <dwmw@...zon.co.uk>,
        "hpa@...or.com" <hpa@...or.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "luto@...nel.org" <luto@...nel.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "syzkaller-bugs@...glegroups.com" <syzkaller-bugs@...glegroups.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [syzbot] INFO: rcu detected stall in tx

Alan Stern wrote:
> On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote:
>>> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
>>>> On Sat, 8 May 2021 at 16:29, Alan Stern <stern@...land.harvard.edu> wrote:
>>>>>
>>>>> On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
>>>>>> When the host driver detects a protocol error while processing an
>>>>>> URB it completes the URB with EPROTO status and marks the endpoint
>>>>>> as halted.
>>>>>
>>>>> Not true.  It does not mark the endpoint as halted, not unless it
>>>>> receives a STALL handshake from the device.  A STALL is not a
>>>>> protocol error.
>>>>>
>>>>>> When the class driver resubmits the URB and the if the host driver
>>>>>> finds the endpoint still marked as halted it should return EPIPE
>>>>>> status on the resubmitted URB
>>>>>
>>>>> Irrelevant.
>>>> Not at all. The point is that when an application is talking to an
>>>> instrument over the usbtmc driver, the underlying host controller and
>>>> its driver will detect and silence a babbling endpoint.
>>>
>>> No, they won't.  That is, they will detect a babble error and return an error status, but
>>> they won't silence the endpoint.  What makes you think they will?
>>
>> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
>> "EPROTO is a link level issue and needs to be handled by the host driver.
>> When the host driver detects a protocol error while processing an
>> URB it SHOULD complete the URB with EPROTO status
> 
> The host controller drivers _do_ complete URBs with -EPROTO (or similar) 
> status when a link-level error occurs...
> 
>> and SHOULD mark the endpoint
>> as halted."
> 
> but they don't mark the endpoint as halted.  Even if they did, it 
> wouldn't fix anything because the kernel allows URBs to be submitted to 
> halted endpoints.  In fact, it doesn't even keep track of which 
> endpoints are or are not halted.
> 
>> Is this a realistic fix for all host drivers?
> 
> No, it isn't.
> 
> An endpoint shouldn't be marked as halted unless it really is halted.  
> Otherwise a driver might be tempted to clear the Halt feature, and 
> some devices do not like to receive a Clear-Halt request for an endpoint 
> that isn't halted.
> 
> What we could do is what you suggested earlier: Note the fact that the 
> endpoint is in some sort of fault condition and disallow further 
> communication with the endpoint until the fault condition has been 
> cleared.  (It isn't entirely obvious exactly what actions should clear 
> such a fault...  I guess resetting or re-enabling the endpoint, or 
> resetting the entire device.)
> 
> Alan Stern
> 

Hi Alan,

Sorry if this diverges from the thread, but I've been wondering whether
to add a change for this also.

For xHCI hosts, after transactions errors, the endpoint will enter
halted state. The driver will attempt a few soft-retries before giving
up. According to the xHCI spec (section 4.6.8), a host may send a
ClearFeature(endpoint_halt) to recover and restart the transfer (see
"reset a pipe" in xhci spec), and the class driver can handle this after
receiving something like -EPROTO from xhci.

However, as you've pointed out, some devices don't like
ClearFeature(ep_halt) and may not properly synchronize with the host on
where it should restart.

Some OS (such as Windows) do this. Not sure if we also want this?
Currently the recovery is just a timeout and a port reset from the class
driver, but the timeout is usually defaulted to a long time (e.g. 30
seconds for storage class driver).

Thanks,
Thinh