lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Wed, 20 Oct 2021 22:56:42 +0200 From: Krzysztof Kozlowski <krzysztof.kozlowski@...onical.com> To: Alan Stern <stern@...land.harvard.edu>, Felipe Balbi <balbi@...nel.org>, Greg Kroah-Hartman <gregkh@...uxfoundation.org> Cc: syzbot <syzbot+abd2e0dafb481b621869@...kaller.appspotmail.com>, linux-kernel@...r.kernel.org, netdev@...r.kernel.org, syzkaller-bugs@...glegroups.com, Pavel Skripkin <paskripkin@...il.com>, Thierry Escande <thierry.escande@...labora.com>, Andrey Konovalov <andreyknvl@...il.com> Subject: Re: [syzbot] INFO: task hung in port100_probe On 22/07/2021 16:47, Alan Stern wrote: > On Thu, Jul 22, 2021 at 04:20:10PM +0200, Krzysztof Kozlowski wrote: >> On 22/06/2021 17:43, syzbot wrote: >>> Hello, >>> >>> syzbot found the following issue on: >>> >>> HEAD commit: fd0aa1a4 Merge tag 'for-linus' of git://git.kernel.org/pub.. >>> git tree: upstream >>> console output: https://syzkaller.appspot.com/x/log.txt?x=13e1500c300000 >>> kernel config: https://syzkaller.appspot.com/x/.config?x=7ca96a2d153c74b0 >>> dashboard link: https://syzkaller.appspot.com/bug?extid=abd2e0dafb481b621869 >>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1792e284300000 >>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=13ad9d48300000 >>> >>> IMPORTANT: if you fix the issue, please add the following tag to the commit: >>> Reported-by: syzbot+abd2e0dafb481b621869@...kaller.appspotmail.com >>> >>> INFO: task kworker/0:1:7 blocked for more than 143 seconds. >>> Not tainted 5.13.0-rc6-syzkaller #0 >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> task:kworker/0:1 state:D stack:25584 pid: 7 ppid: 2 flags:0x00004000 >>> Workqueue: usb_hub_wq hub_event >>> Call Trace: >>> context_switch kernel/sched/core.c:4339 [inline] >>> __schedule+0x916/0x23e0 kernel/sched/core.c:5147 >>> schedule+0xcf/0x270 kernel/sched/core.c:5226 >>> schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868 >>> do_wait_for_common kernel/sched/completion.c:85 [inline] >>> __wait_for_common kernel/sched/completion.c:106 [inline] >>> wait_for_common kernel/sched/completion.c:117 [inline] >>> wait_for_completion+0x168/0x270 kernel/sched/completion.c:138 >>> port100_send_cmd_sync drivers/nfc/port100.c:923 [inline] >>> port100_get_command_type_mask drivers/nfc/port100.c:1008 [inline] >>> port100_probe+0x9e4/0x1340 drivers/nfc/port100.c:1554 >>> usb_probe_interface+0x315/0x7f0 drivers/usb/core/driver.c:396 > ... > >> Cc: Thierry, Alan, Andrey, >> >> The issue is reproducible immediately on QEMU instance with >> USB_DUMMY_HCD and USB_RAW_GADGET. I don't know about real port100 NFC >> device. >> >> I spent some time looking into this and have no clue, except that it >> looks like an effect of a race condition. >> >> 1. When using syskaller reproducer against one USB device (In the C >> reproducer change the loop in main() to use procid=0) - issue does not >> happen. >> >> 2. With two threads or more talking to separate Dummy USB devices, the >> issue appears. The more of them, the better... >> >> 3. The reported problem is in missing complete. The correct flow is like: >> port100_probe() >> port100_get_command_type_mask() >> port100_send_cmd_sync() >> port100_send_cmd_async() >> port100_submit_urb_for_ack() >> port100_send_complete() >> [ 63.363863] port100 2-1:0.0: NFC: Urb failure (status -71) >> port100_recv_ack() >> [ 63.369942] port100 2-1:0.0: NFC: Urb failure (status -71) >> >> and schedule_work() which completes and unblocks port100_send_cmd_sync >> >> However in the failing case (hung task) the port100_recv_ack() is never >> called. It looks like USB core / HCD / gadget does not send the Ack/URB >> complete. >> >> I don't know why. The port100 NFC driver code looks OK, except it is not >> prepared for missing ack/urb so it waits indefinitely. I could try to >> convert it to wait_for_completion_timeout() but it won't be trivial and >> more important - I am not sure if this is the problem. Somehow the ACK >> with Urb failure is not sent back to the port100 device. Therefore I am >> guessing that the race condition is somwhere in USB stack, not in >> port100 driver. >> >> The lockdep and other testing tools did not find anything here. >> >> Anyone hints where the issue could be? > > Here's what I wrote earlier: "It looks like the problem stems from the fact > that port100_send_frame_async() submits two URBs, but > port100_send_cmd_sync() only waits for one of them to complete. The other > URB may then still be active when the driver tries to reuse it." > > Of course, there may be more than one problem, so we may not be talking > about the same thing. Hi Alan, Felipe, Greg and others, This is an old issue reported by syzkaller for NFC port100 driver [1]. There is something similar for pn533 [2]. I was looking at it some time ago, took a break and now I am trying to fix it again. Without success. The issue is reproducible via USB gadget on QEMU, not on real HW. I looked and debugged the code and I think previously mentioned double-URB-submit is not the reason here. Or I miss how the USB works (which is quite probable...). 1. The port100 driver calls port100_send_cmd_sync() which eventually goes to port100_send_frame_async(). After it, it waits for "sync" completion. 2. In port100_send_frame_async(), driver indeed first submits "out_urb" which quite fast is being processed by dummy_hcd with "no ep configured" and -EPROTO. 3. Then (or sometimes before -EPROTO response from (2) above) the port100_send_frame_async() submits "in_urb" via port100_submit_urb_for_ack() and waits for its completion. Completion of "in_urb" (or the "ack") in port100_recv_ack() would schedule work to complete the (1) above - the sync completion. 4. Usually, when reproducer works fine (does not trigger issue), the dummy_timer() from gadget responds with the same "no ep configured for urb" for this "in_urb" (3). This completes "in_urb", which eventually completes (1) and probe finishes with error. Error is expected, because it's random junk-gadget... The syzkaller reproducer fails if >1 of threads are running these usb gadgets. When this happens, no "in_urb" completion happens. No this "ack" port100_recv_ack(). I added some debugs and simply dummy_hcd dummy_timer() is woken up on enqueuing in_urb and then is looping crazy on a previous URB (some older URB, coming from before port100 driver probe started). The dummy_timer() loop never reaches the second "in_urb" to process it, I think. The pn533 NFC driver has similar design, but I have now really doubts it is a NFC driver issue. Instead an issue in dummy gadget HCD is somehow triggered by the reproducer. Reproduction - just follow [1] or [2]. Eventually I slightly tweaked the code and put here: https://github.com/krzk/tools/tree/master/tests-var/nfc/port100_probe $ make $ sudo ./port100_probe [1] https://syzkaller.appspot.com/bug?extid=abd2e0dafb481b621869 [2] https://syzkaller.appspot.com/bug?extid=1dc8b460d6d48d7ef9ca Best regards, Krzysztof
Powered by blists - more mailing lists