linux-kernel - Re: [syzbot] INFO: rcu detected stall in tx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <74b2133b-2f77-c86f-4c8b-1189332617d3@synopsys.com>
Date:   Thu, 20 May 2021 20:30:01 +0000
From:   Thinh Nguyen <Thinh.Nguyen@...opsys.com>
To:     Alan Stern <stern@...land.harvard.edu>,
        Thinh Nguyen <Thinh.Nguyen@...opsys.com>,
        Mathias Nyman <mathias.nyman@...el.com>
CC:     Guido Kiener <Guido.Kiener@...de-schwarz.com>,
        dave penkler <dpenkler@...il.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        syzbot <syzbot+e2eae5639e7203360018@...kaller.appspotmail.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "lee.jones@...aro.org" <lee.jones@...aro.org>,
        USB list <linux-usb@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>,
        "dwmw@...zon.co.uk" <dwmw@...zon.co.uk>,
        "hpa@...or.com" <hpa@...or.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "luto@...nel.org" <luto@...nel.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "syzkaller-bugs@...glegroups.com" <syzkaller-bugs@...glegroups.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [syzbot] INFO: rcu detected stall in tx

+Mathias

Alan Stern wrote:
> On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote:
>> Hi Alan,
>>
>> Sorry if this diverges from the thread, but I've been wondering whether
>> to add a change for this also.
>>
>> For xHCI hosts, after transactions errors, the endpoint will enter
>> halted state.
> 
> No.  You are misreading the xHCI spec.  Section 4.6.8 says:
> 
> 	... the state of the associated Endpoint Context is set to 
> 	Halted...
> 
> Note this carefully.  It says "Endpoint Context", not "endpoint".
> 
> The endpoint is part of the device, whereas the endpoint context is part 
> of the host controller.  The device doesn't know when a transaction 
> error has occurred; consequently such errors do not affect the endpoint.  
> The host controller does know, and consequently such errors do affect 
> the endpoint context.
> 

You're right, my mistake here.

>> The driver will attempt a few soft-retries before giving
>> up. According to the xHCI spec (section 4.6.8), a host may send a
>> ClearFeature(endpoint_halt) to recover and restart the transfer (see
> 
> Not quite.  The section of the spec you're talking about says:
> 
> 	Software shall execute the following sequence to “reset a 
> 	pipe”....  Issue a ClearFeature(ENDPOINT_HALT) request to 
> 	device.
> 
> It does not say the host controller will do this; it says that software 
> will do it.

Sorry for being unclear. I meant from the class driver, see my next
sentence.

> 
>> "reset a pipe" in xhci spec), and the class driver can handle this after
>> receiving something like -EPROTO from xhci.
>>
>> However, as you've pointed out, some devices don't like
>> ClearFeature(ep_halt) and may not properly synchronize with the host on
>> where it should restart.
>>
>> Some OS (such as Windows) do this. Not sure if we also want this?
> 
> In general we should do the same thing as Windows does, because most 
> hardware designers test their equipment on Windows systems but 
> relatively few test on Linux systems.
> 
>> Currently the recovery is just a timeout and a port reset from the class
> 
> This depends on the driver.  Some perform no recovery at all.
> 
>> driver, but the timeout is usually defaulted to a long time (e.g. 30
>> seconds for storage class driver).
> 
> That 30-second timeout in the mass-storage driver applies in situations 
> where a command fails to complete, not in situations where it completes 
> quickly but with a -EPROTO or -EPIPE error.

Hm... looks like we have a couple of issues in the uas storage class
driver and the xhci driver.

We may need to fix that in the uas storage driver because it doesn't
seem to handle it. (check uas_data_cmplt() in uas.c).

As for the xhci driver, there maybe a case where the stream URB never
gets to complete because the transaction err_count is not properly
updated. The err_count for transaction error is stored in ep_ring, but
the xhci driver may not be able to lookup the correct ep_ring based on
TRB address for streams. There are cases for streams where the event
TRBs have their TRB pointer field cleared to '0' (xhci spec section
4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
it automatically does a soft-retry. This is seen from one of our
testings that the driver was repeatedly doing soft-retry until the class
driver timed out.

Hi Mathias, maybe you have some comment on this? Thanks.

> 
> The fact is that only a small percentage of -EPROTO errors are 
> recoverable.  Some of them can be handled by a port reset, which can be 
> pretty awkward to perform but does occasionally work.  A lot of them 
> occur because the USB cable has been unplugged; obviously there's no way 
> to recover from that.  With only a few exceptions, the best and simplest 
> approach is not to try to recover at all.

If the cable is unplugged, then we should get a connection change event
and the driver can handle it properly.

Yes, it's probably simplest to do a port reset and let the transfer be
incomplete/corrupted. However, I think we should give
ClearFeature(ep_halt) some more thoughts as I think it can be a recovery
mechanism for storage class driver, even though that it may not be
foolproof.

> 
> For the case in question (the syzbot bug report that started this 
> thread), the class driver doesn't try to perform any recovery.  It just 
> resubmits the URB, getting into a tight retry loop which consumes too 
> much CPU time.  Simply giving up would be preferable.
> 
> Alan Stern
> 

I see. By giving up, you mean doing port reset right? Otherwise it needs
some other mechanism to synchronize with the device side.

Thanks,
Thinh