[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200818135045.GA495837@kroah.com>
Date: Tue, 18 Aug 2020 15:50:45 +0200
From: Greg KH <gregkh@...uxfoundation.org>
To: Hugh Dickins <hughd@...gle.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Oleg Nesterov <oleg@...hat.com>,
Michal Hocko <mhocko@...nel.org>,
Linux-MM <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Michal Hocko <mhocko@...e.com>
Subject: Re: [RFC PATCH] mm: silence soft lockups from unlock_page
On Wed, Aug 05, 2020 at 10:46:12PM -0700, Hugh Dickins wrote:
> On Mon, 27 Jul 2020, Greg KH wrote:
> >
> > Linus just pointed me at this thread.
> >
> > If you could run:
> > echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
> > and run the same workload to see if anything shows up in the log when
> > xhci crashes, that would be great.
>
> Thanks, I tried that, and indeed it did have a story to tell:
>
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> a very large number of lines like the above, then
> Cancel URB 00000000d81602f7, dev 4, ep 0x0, starting at offset 0xfffd42c0
> // Ding dong!
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> Stopped on No-op or Link TRB for slot 1 ep 0
> xhci_drop_endpoint called for udev 000000005bc07fa6
> drop ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x0
> add ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x8
> xhci_check_bandwidth called for udev 000000005bc07fa6
> // Ding dong!
> Successful Endpoint Configure command
> Cancel URB 000000006b77d490, dev 4, ep 0x81, starting at offset 0x0
> // Ding dong!
> Stopped on No-op or Link TRB for slot 1 ep 2
> Removing canceled TD starting at 0x0 (dma).
> list_del corruption: prev(ffff8fdb4de7a130)->next should be ffff8fdb41697f88,
> but is 6b6b6b6b6b6b6b6b; next(ffff8fdb4de7a130)->prev is 6b6b6b6b6b6b6b6b.
> ------------[ cut here ]------------
> kernel BUG at lib/list_debug.c:53!
> RIP: 0010:__list_del_entry_valid+0x8e/0xb0
> Call Trace:
> <IRQ>
> handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd]
> xhci_irq+0x242/0x1ea0 [xhci_hcd]
> xhci_msi_irq+0x11/0x20 [xhci_hcd]
> __handle_irq_event_percpu+0x48/0x2c0
> handle_irq_event_percpu+0x32/0x80
> handle_irq_event+0x4a/0x80
> handle_edge_irq+0xd8/0x1b0
> handle_irq+0x2b/0x50
> do_IRQ+0xb6/0x1c0
> common_interrupt+0x90/0x90
> </IRQ>
>
> Info provided for your interest, not expecting any response.
> The list_del info in there is non-standard, from a patch of mine:
> I find hashed addresses in debug output less than helpful.
Thanks for this, that is really odd.
> >
> > Although if you are using an "older version" of the driver, there's not
> > much I can suggest except update to a newer one :)
>
> Yes, I was reluctant to post any info, since really the ball is at our
> end of the court, not yours. I did have a go at bringing in the latest
> xhci driver instead, but quickly saw that was not a sensible task for
> me. And I did scan the git log of xhci changes (especially xhci-ring.c
> changes): thought I saw a likely relevant and easily applied fix commit,
> but in fact it made no difference here.
>
> I suspect it's in part a hardware problem, but driver not recovering
> correctly. I've replaced the machine (but also noticed that the same
> crash has occasionally been seen on other machines). I'm sure it has
> no relevance to this unlock_page() thread, though it's quite possible
> that it's triggered under stress, and Linus's changes allowed greater
> stress.
I will be willing to blame hardware problems for this as well, but will
save this report in case something else shows up in the future, thanks!
greg k-h
Powered by blists - more mailing lists