linux-kernel - Re: hid-related 5.2-rc1 boot hang

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO-hwJ+ZBuXtuk+i1Q9DwO=uLXDh4oToQyMWhK-8t+ZTS-jUjA@mail.gmail.com>
Date:   Tue, 4 Jun 2019 14:25:07 +0200
From:   Benjamin Tissoires <benjamin.tissoires@...hat.com>
To:     Hans de Goede <hdegoede@...hat.com>
Cc:     Jiri Kosina <jikos@...nel.org>,
        Dave Hansen <dave.hansen@...el.com>,
        "open list:HID CORE LAYER" <linux-input@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: hid-related 5.2-rc1 boot hang

On Tue, Jun 4, 2019 at 12:50 PM Hans de Goede <hdegoede@...hat.com> wrote:
>
> Hi,
>
> On 04-06-19 12:08, Benjamin Tissoires wrote:
> > On Tue, Jun 4, 2019 at 9:51 AM Benjamin Tissoires
> > <benjamin.tissoires@...hat.com> wrote:
> >>
> >> On Mon, Jun 3, 2019 at 4:17 PM Hans de Goede <hdegoede@...hat.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On 03-06-19 15:55, Benjamin Tissoires wrote:
> >>>> On Mon, Jun 3, 2019 at 11:51 AM Hans de Goede <hdegoede@...hat.com> wrote:
> >>>>>
> >>>>> Hi Again,
> >>>>>
> >>>>> On 03-06-19 11:11, Hans de Goede wrote:
> >>>>> <snip>
> >>>>>
> >>>>>>> not sure about the rest of logitech issues yet) next week.
> >>>>>>
> >>>>>> The main problem seems to be the request_module patches. Although I also
> >>>>
> >>>> Can't we use request_module_nowait() instead, and set a reasonable
> >>>> timeout that we detect only once to check if userspace is compatible:
> >>>>
> >>>> In pseudo-code:
> >>>> if (!request_module_checked) {
> >>>>     request_module_nowait(name);
> >>>>     use_request_module = wait_event_timeout(wq,
> >>>>           first_module_loaded, 10 seconds in jiffies);
> >>>>     request_module_checked = true;
> >>>> } else if (use_request_module) {
> >>>>     request_module(name);
> >>>> }
> >>>
> >>> Well looking at the just attached dmesg , the modprobe
> >>> when triggered by udev from userspace succeeds in about
> >>> 0.5 seconds, so it seems that the modprobe hangs happens
> >>> when called from within the kernel rather then from within
> >>> userspace.
> >>>
> >>> What I do not know if is the hang is inside userspace, or
> >>> maybe it happens when modprobe calls back into the kernel,
> >>> if the hang happens when modprobe calls back into the kernel,
> >>> then other modprobes (done from udev) likely will hang too
> >>> since I think only 1 modprobe can happen at a time.
> >>>
> >>> I really wish we knew what distinguished working systems
> >>> from non working systems :|
> >>>
> >>> I cannot find a common denominator; other then the systems
> >>> are not running Fedora. So far we've reports from both Ubuntu 16.04
> >>> and Tumbleweed, so software version wise these 2 are wide apart.
> >>
> >> I am trying to reproduce the lock locally, and installed an opensuse
> >> Tumbleweed in a VM. When forwarding a Unifying receiver to the VM, I
> >> do not see the lock with either my vanilla compiled kernel and the rpm
> >> found in http://download.opensuse.org/repositories/Kernel:/HEAD/standard/x86_64/
> >>
> >> Next step is install Tumbleweed on bare metal, but I do not see how
> >> this could introduce a difference (maybe USB2 vs 3).
> >
> > Making progress here.
> >
> > The difference between Ubuntu/Tumbleweed and Fedora: usbhid is shipped
> > as a module while in Fedora usbhid is included in the kernel.
> >
> > If I rmmod hid_* and usbhid, then modprobe usbhid, the command hangs
> > for 3 minutes.
> > If usbhid is already loaded, inserting a receiver is immediate
> > regarding the loading of the external modules.
> >
> > So my assumption is that when the device gets detected at boot, usbhid
> > gets loaded by the kernel event, which in turns attempts to call
> > __request_module, but the modprobe can't be fulfilled because it's
> > already waiting for the initial usbhid modprobe to finish.
> >
> > Still don't know how to solve that, but I thought I should share.
>
> Hmm, we may be hitting the scenario described in the big comment
> around line 3500 of kernel/module.c.

Well, we are not locking during do_init_module(), but in waiting for
the completion of request_module(). So as I read the trace, we wait
for userspace to call/terminate modprobe.

>
> But I'm not sure that is what is happening here.
>
> Maybe you can put a WARN_ON(1) in request_module and look at the
> backtrace ?  That may help to figure out what is going on; or
> alternatively it might help to find some way to detect this and
> if it happens skip the request_module...

Ftrace is much easier to deal with:
---
/sys/kernel/debug/tracing # cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 4/4   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
        modprobe-23236 [003] ....  9191.880917: __request_module
<-hid_add_device
        modprobe-23236 [003] ....  9191.880937: <stack trace>
 => 0xffffffffc0851061
 => __request_module
 => hid_add_device
 => usbhid_probe
 => usb_probe_interface
 => really_probe
 => driver_probe_device
 => device_driver_attach
 => __driver_attach
 => bus_for_each_dev
 => bus_add_driver
 => driver_register
 => usb_register_driver
 => hid_init
 => do_one_initcall
 => do_init_module
 => load_module
 => __do_sys_finit_module
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
        modprobe-23236 [003] ....  9191.892765: __request_module
<-hid_add_device
        modprobe-23236 [003] ....  9191.892784: <stack trace>
 => 0xffffffffc0851061
 => __request_module
 => hid_add_device
 => usbhid_probe
 => usb_probe_interface
 => really_probe
 => driver_probe_device
 => device_driver_attach
 => __driver_attach
 => bus_for_each_dev
 => bus_add_driver
 => driver_register
 => usb_register_driver
 => hid_init
 => do_one_initcall
 => do_init_module
 => load_module
 => __do_sys_finit_module
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
---

This is what happen with the logitech receiver plugged in and while
modprobing usbhid. The modprobe command hangs, and I took the trace
here.
If I hit Ctrl-C, the modprobe loading continue properly.

There is nothing special here in the stack trace, except that there
are 2 calls to request_module here (one for the touchscreen I guess
and one for the logitech receiver).

I have attached the call graph of the same sequence:
wait_for_completion_killable() seems to be the culprit, we are waiting
for userspace to notify it has done calling modprobe.

My idea would be to defer any call to hid_add_device() into a
workqueue and see if that unlocks the situation.

Cheers,
Benjamin

View attachment "function_graph.txt" of type "text/plain" (21638 bytes)