[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201118225621.GA1770130@elver.google.com>
Date: Wed, 18 Nov 2020 23:56:21 +0100
From: Marco Elver <elver@...gle.com>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Steven Rostedt <rostedt@...dmis.org>,
Anders Roxell <anders.roxell@...aro.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Alexander Potapenko <glider@...gle.com>,
Dmitry Vyukov <dvyukov@...gle.com>,
Jann Horn <jannh@...gle.com>,
Mark Rutland <mark.rutland@....com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>,
kasan-dev <kasan-dev@...glegroups.com>, rcu@...r.kernel.org,
Peter Zijlstra <peterz@...radead.org>,
Tejun Heo <tj@...nel.org>,
Lai Jiangshan <jiangshanlai@...il.com>
Subject: Re: [PATCH] kfence: Avoid stalling work queue task without
allocations
On Tue, Nov 17, 2020 at 10:29AM -0800, Paul E. McKenney wrote:
[...]
> But it would be good to get the kcompactd() people to look at this (not
> immediately seeing who they are in MAINTAINERS). Perhaps preemption is
> disabled somehow and I am failing to see it.
>
> Failing that, maybe someone knows of a way to check for overly long
> timeout handlers.
I think I figured out one piece of the puzzle. Bisection keeps pointing
me at some -rcu merge commit, which kept throwing me off. Nor did it
help that reproduction is a bit flaky. However, I think there are 2
independent problems, but the manifestation of 1 problem triggers the
2nd problem:
1. problem: slowed forward progress (workqueue lockup / RCU stall reports)
2. problem: DEADLOCK which causes complete system lockup
| ...
| CPU0
| ----
| lock(rcu_node_0);
| <Interrupt>
| lock(rcu_node_0);
|
| *** DEADLOCK ***
|
| 1 lock held by event_benchmark/105:
| #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: print_other_cpu_stall kernel/rcu/tree_stall.h:493 [inline]
| #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: check_cpu_stall kernel/rcu/tree_stall.h:652 [inline]
| #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: rcu_pending kernel/rcu/tree.c:3752 [inline]
| #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: rcu_sched_clock_irq+0x428/0xd40 kernel/rcu/tree.c:2581
| ...
Problem 2 can with reasonable confidence (5 trials) be fixed by reverting:
rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled
At which point the system always boots to user space -- albeit with a
bunch of warnings still (attached). The supposed "good" version doesn't
end up with all those warnings deterministically, so I couldn't say if
the warnings are expected due to recent changes or not (Arm64 QEMU
emulation, 1 CPU, and lots of debugging tools on).
Does any of that make sense?
Thanks,
-- Marco
View attachment "log" of type "text/plain" (8805 bytes)
Powered by blists - more mailing lists