linux-kernel - Re: [PATCH] kfence: Avoid stalling work queue task without allocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201119125357.GA2084963@elver.google.com>
Date:   Thu, 19 Nov 2020 13:53:57 +0100
From:   Marco Elver <elver@...gle.com>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Steven Rostedt <rostedt@...dmis.org>,
        Anders Roxell <anders.roxell@...aro.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alexander Potapenko <glider@...gle.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Jann Horn <jannh@...gle.com>,
        Mark Rutland <mark.rutland@....com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        kasan-dev <kasan-dev@...glegroups.com>, rcu@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Tejun Heo <tj@...nel.org>,
        Lai Jiangshan <jiangshanlai@...il.com>
Subject: Re: [PATCH] kfence: Avoid stalling work queue task without
 allocations

On Wed, Nov 18, 2020 at 03:38PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 18, 2020 at 11:56:21PM +0100, Marco Elver wrote:
> > [...]
> > I think I figured out one piece of the puzzle. Bisection keeps pointing
> > me at some -rcu merge commit, which kept throwing me off. Nor did it
> > help that reproduction is a bit flaky. However, I think there are 2
> > independent problems, but the manifestation of 1 problem triggers the
> > 2nd problem:
> > 
> > 1. problem: slowed forward progress (workqueue lockup / RCU stall reports)
> > 
> > 2. problem: DEADLOCK which causes complete system lockup
> > 
> > 	| ...
> > 	|        CPU0
> > 	|        ----
> > 	|   lock(rcu_node_0);
> > 	|   <Interrupt>
> > 	|     lock(rcu_node_0);
> > 	| 
> > 	|  *** DEADLOCK ***
> > 	| 
> > 	| 1 lock held by event_benchmark/105:
> > 	|  #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: print_other_cpu_stall kernel/rcu/tree_stall.h:493 [inline]
> > 	|  #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: check_cpu_stall kernel/rcu/tree_stall.h:652 [inline]
> > 	|  #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: rcu_pending kernel/rcu/tree.c:3752 [inline]
> > 	|  #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: rcu_sched_clock_irq+0x428/0xd40 kernel/rcu/tree.c:2581
> > 	| ...
> > 
> > Problem 2 can with reasonable confidence (5 trials) be fixed by reverting:
> > 
> > 	rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled
> > 
> > At which point the system always boots to user space -- albeit with a
> > bunch of warnings still (attached). The supposed "good" version doesn't
> > end up with all those warnings deterministically, so I couldn't say if
> > the warnings are expected due to recent changes or not (Arm64 QEMU
> > emulation, 1 CPU, and lots of debugging tools on).
> > 
> > Does any of that make sense?
> 
> Marco, it makes all too much sense!  :-/
> 
> Does the patch below help?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 444ef3bbd0f243b912fdfd51f326704f8ee872bf
> Author: Peter Zijlstra <peterz@...radead.org>
> Date:   Sat Aug 29 10:22:24 2020 -0700
> 
>     sched/core: Allow try_invoke_on_locked_down_task() with irqs disabled

My assumption is that this is a replacement for "rcu: Don't invoke
try_invoke_on_locked_down_task() with irqs disabled", right?

That seems to have the same result (same test setup) as only reverting
"rcu: Don't invoke..." does: still results in a bunch of workqueue
lockup warnings and RCU stall warnings, but boots to user space. I
attached a log. If the warnings are expected (are they?), then it looks
fine to me.

(And just in case: with "rcu: Don't invoke..." and "sched/core:
Allow..." both applied I still get DEADLOCKs -- but that's probably
expected.)

Thanks,
-- Marco

View attachment "log" of type "text/plain" (7314 bytes)