[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO-hwJ+ZBuXtuk+i1Q9DwO=uLXDh4oToQyMWhK-8t+ZTS-jUjA@mail.gmail.com>
Date: Tue, 4 Jun 2019 14:25:07 +0200
From: Benjamin Tissoires <benjamin.tissoires@...hat.com>
To: Hans de Goede <hdegoede@...hat.com>
Cc: Jiri Kosina <jikos@...nel.org>,
Dave Hansen <dave.hansen@...el.com>,
"open list:HID CORE LAYER" <linux-input@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: hid-related 5.2-rc1 boot hang
On Tue, Jun 4, 2019 at 12:50 PM Hans de Goede <hdegoede@...hat.com> wrote:
>
> Hi,
>
> On 04-06-19 12:08, Benjamin Tissoires wrote:
> > On Tue, Jun 4, 2019 at 9:51 AM Benjamin Tissoires
> > <benjamin.tissoires@...hat.com> wrote:
> >>
> >> On Mon, Jun 3, 2019 at 4:17 PM Hans de Goede <hdegoede@...hat.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On 03-06-19 15:55, Benjamin Tissoires wrote:
> >>>> On Mon, Jun 3, 2019 at 11:51 AM Hans de Goede <hdegoede@...hat.com> wrote:
> >>>>>
> >>>>> Hi Again,
> >>>>>
> >>>>> On 03-06-19 11:11, Hans de Goede wrote:
> >>>>> <snip>
> >>>>>
> >>>>>>> not sure about the rest of logitech issues yet) next week.
> >>>>>>
> >>>>>> The main problem seems to be the request_module patches. Although I also
> >>>>
> >>>> Can't we use request_module_nowait() instead, and set a reasonable
> >>>> timeout that we detect only once to check if userspace is compatible:
> >>>>
> >>>> In pseudo-code:
> >>>> if (!request_module_checked) {
> >>>> request_module_nowait(name);
> >>>> use_request_module = wait_event_timeout(wq,
> >>>> first_module_loaded, 10 seconds in jiffies);
> >>>> request_module_checked = true;
> >>>> } else if (use_request_module) {
> >>>> request_module(name);
> >>>> }
> >>>
> >>> Well looking at the just attached dmesg , the modprobe
> >>> when triggered by udev from userspace succeeds in about
> >>> 0.5 seconds, so it seems that the modprobe hangs happens
> >>> when called from within the kernel rather then from within
> >>> userspace.
> >>>
> >>> What I do not know if is the hang is inside userspace, or
> >>> maybe it happens when modprobe calls back into the kernel,
> >>> if the hang happens when modprobe calls back into the kernel,
> >>> then other modprobes (done from udev) likely will hang too
> >>> since I think only 1 modprobe can happen at a time.
> >>>
> >>> I really wish we knew what distinguished working systems
> >>> from non working systems :|
> >>>
> >>> I cannot find a common denominator; other then the systems
> >>> are not running Fedora. So far we've reports from both Ubuntu 16.04
> >>> and Tumbleweed, so software version wise these 2 are wide apart.
> >>
> >> I am trying to reproduce the lock locally, and installed an opensuse
> >> Tumbleweed in a VM. When forwarding a Unifying receiver to the VM, I
> >> do not see the lock with either my vanilla compiled kernel and the rpm
> >> found in http://download.opensuse.org/repositories/Kernel:/HEAD/standard/x86_64/
> >>
> >> Next step is install Tumbleweed on bare metal, but I do not see how
> >> this could introduce a difference (maybe USB2 vs 3).
> >
> > Making progress here.
> >
> > The difference between Ubuntu/Tumbleweed and Fedora: usbhid is shipped
> > as a module while in Fedora usbhid is included in the kernel.
> >
> > If I rmmod hid_* and usbhid, then modprobe usbhid, the command hangs
> > for 3 minutes.
> > If usbhid is already loaded, inserting a receiver is immediate
> > regarding the loading of the external modules.
> >
> > So my assumption is that when the device gets detected at boot, usbhid
> > gets loaded by the kernel event, which in turns attempts to call
> > __request_module, but the modprobe can't be fulfilled because it's
> > already waiting for the initial usbhid modprobe to finish.
> >
> > Still don't know how to solve that, but I thought I should share.
>
> Hmm, we may be hitting the scenario described in the big comment
> around line 3500 of kernel/module.c.
Well, we are not locking during do_init_module(), but in waiting for
the completion of request_module(). So as I read the trace, we wait
for userspace to call/terminate modprobe.
>
> But I'm not sure that is what is happening here.
>
> Maybe you can put a WARN_ON(1) in request_module and look at the
> backtrace ? That may help to figure out what is going on; or
> alternatively it might help to find some way to detect this and
> if it happens skip the request_module...
Ftrace is much easier to deal with:
---
/sys/kernel/debug/tracing # cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 4/4 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
modprobe-23236 [003] .... 9191.880917: __request_module
<-hid_add_device
modprobe-23236 [003] .... 9191.880937: <stack trace>
=> 0xffffffffc0851061
=> __request_module
=> hid_add_device
=> usbhid_probe
=> usb_probe_interface
=> really_probe
=> driver_probe_device
=> device_driver_attach
=> __driver_attach
=> bus_for_each_dev
=> bus_add_driver
=> driver_register
=> usb_register_driver
=> hid_init
=> do_one_initcall
=> do_init_module
=> load_module
=> __do_sys_finit_module
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
modprobe-23236 [003] .... 9191.892765: __request_module
<-hid_add_device
modprobe-23236 [003] .... 9191.892784: <stack trace>
=> 0xffffffffc0851061
=> __request_module
=> hid_add_device
=> usbhid_probe
=> usb_probe_interface
=> really_probe
=> driver_probe_device
=> device_driver_attach
=> __driver_attach
=> bus_for_each_dev
=> bus_add_driver
=> driver_register
=> usb_register_driver
=> hid_init
=> do_one_initcall
=> do_init_module
=> load_module
=> __do_sys_finit_module
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
---
This is what happen with the logitech receiver plugged in and while
modprobing usbhid. The modprobe command hangs, and I took the trace
here.
If I hit Ctrl-C, the modprobe loading continue properly.
There is nothing special here in the stack trace, except that there
are 2 calls to request_module here (one for the touchscreen I guess
and one for the logitech receiver).
I have attached the call graph of the same sequence:
wait_for_completion_killable() seems to be the culprit, we are waiting
for userspace to notify it has done calling modprobe.
My idea would be to defer any call to hid_add_device() into a
workqueue and see if that unlocks the situation.
Cheers,
Benjamin
View attachment "function_graph.txt" of type "text/plain" (21638 bytes)
Powered by blists - more mailing lists