[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201119125357.GA2084963@elver.google.com>
Date: Thu, 19 Nov 2020 13:53:57 +0100
From: Marco Elver <elver@...gle.com>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Steven Rostedt <rostedt@...dmis.org>,
Anders Roxell <anders.roxell@...aro.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Alexander Potapenko <glider@...gle.com>,
Dmitry Vyukov <dvyukov@...gle.com>,
Jann Horn <jannh@...gle.com>,
Mark Rutland <mark.rutland@....com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>,
kasan-dev <kasan-dev@...glegroups.com>, rcu@...r.kernel.org,
Peter Zijlstra <peterz@...radead.org>,
Tejun Heo <tj@...nel.org>,
Lai Jiangshan <jiangshanlai@...il.com>
Subject: Re: [PATCH] kfence: Avoid stalling work queue task without
allocations
On Wed, Nov 18, 2020 at 03:38PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 18, 2020 at 11:56:21PM +0100, Marco Elver wrote:
> > [...]
> > I think I figured out one piece of the puzzle. Bisection keeps pointing
> > me at some -rcu merge commit, which kept throwing me off. Nor did it
> > help that reproduction is a bit flaky. However, I think there are 2
> > independent problems, but the manifestation of 1 problem triggers the
> > 2nd problem:
> >
> > 1. problem: slowed forward progress (workqueue lockup / RCU stall reports)
> >
> > 2. problem: DEADLOCK which causes complete system lockup
> >
> > | ...
> > | CPU0
> > | ----
> > | lock(rcu_node_0);
> > | <Interrupt>
> > | lock(rcu_node_0);
> > |
> > | *** DEADLOCK ***
> > |
> > | 1 lock held by event_benchmark/105:
> > | #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: print_other_cpu_stall kernel/rcu/tree_stall.h:493 [inline]
> > | #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: check_cpu_stall kernel/rcu/tree_stall.h:652 [inline]
> > | #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: rcu_pending kernel/rcu/tree.c:3752 [inline]
> > | #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: rcu_sched_clock_irq+0x428/0xd40 kernel/rcu/tree.c:2581
> > | ...
> >
> > Problem 2 can with reasonable confidence (5 trials) be fixed by reverting:
> >
> > rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled
> >
> > At which point the system always boots to user space -- albeit with a
> > bunch of warnings still (attached). The supposed "good" version doesn't
> > end up with all those warnings deterministically, so I couldn't say if
> > the warnings are expected due to recent changes or not (Arm64 QEMU
> > emulation, 1 CPU, and lots of debugging tools on).
> >
> > Does any of that make sense?
>
> Marco, it makes all too much sense! :-/
>
> Does the patch below help?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 444ef3bbd0f243b912fdfd51f326704f8ee872bf
> Author: Peter Zijlstra <peterz@...radead.org>
> Date: Sat Aug 29 10:22:24 2020 -0700
>
> sched/core: Allow try_invoke_on_locked_down_task() with irqs disabled
My assumption is that this is a replacement for "rcu: Don't invoke
try_invoke_on_locked_down_task() with irqs disabled", right?
That seems to have the same result (same test setup) as only reverting
"rcu: Don't invoke..." does: still results in a bunch of workqueue
lockup warnings and RCU stall warnings, but boots to user space. I
attached a log. If the warnings are expected (are they?), then it looks
fine to me.
(And just in case: with "rcu: Don't invoke..." and "sched/core:
Allow..." both applied I still get DEADLOCKs -- but that's probably
expected.)
Thanks,
-- Marco
View attachment "log" of type "text/plain" (7314 bytes)
Powered by blists - more mailing lists