linux-kernel - Re: [PATCH] arm64/smp: Move rcu_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ec2de23c04e400266fcf98dfd282da0b173a68c3.camel@redhat.com>
Date:   Thu, 05 Nov 2020 21:15:24 -0500
From:   Qian Cai <cai@...hat.com>
To:     paulmck@...nel.org
Cc:     Will Deacon <will@...nel.org>, catalin.marinas@....com,
        kernel-team@...roid.com, Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH] arm64/smp: Move rcu_cpu_starting() earlier

On Thu, 2020-11-05 at 15:28 -0800, Paul E. McKenney wrote:
> On Thu, Nov 05, 2020 at 06:02:49PM -0500, Qian Cai wrote:
> > On Thu, 2020-11-05 at 22:22 +0000, Will Deacon wrote:
> > > On Fri, Oct 30, 2020 at 04:33:25PM +0000, Will Deacon wrote:
> > > > On Wed, 28 Oct 2020 14:26:14 -0400, Qian Cai wrote:
> > > > > The call to rcu_cpu_starting() in secondary_start_kernel() is not
> > > > > early
> > > > > enough in the CPU-hotplug onlining process, which results in lockdep
> > > > > splats as follows:
> > > > > 
> > > > >  WARNING: suspicious RCU usage
> > > > >  -----------------------------
> > > > >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader
> > > > > section!!
> > > > > 
> > > > > [...]
> > > > 
> > > > Applied to arm64 (for-next/fixes), thanks!
> > > > 
> > > > [1/1] arm64/smp: Move rcu_cpu_starting() earlier
> > > >       https://git.kernel.org/arm64/c/ce3d31ad3cac
> > > 
> > > Hmm, this patch has caused a regression in the case that we fail to
> > > online a CPU because it has incompatible CPU features and so we park it
> > > in cpu_die_early(). We now get an endless spew of RCU stalls because the
> > > core will never come online, but is being tracked by RCU. So I'm tempted
> > > to revert this and live with the lockdep warning while we figure out a
> > > proper fix.
> > > 
> > > What's the correct say to undo rcu_cpu_starting(), given that we cannot
> > > invoke the full hotplug machinery here? Is it correct to call
> > > rcutree_dying_cpu() on the bad CPU and then rcutree_dead_cpu() from the
> > > CPU doing cpu_up(), or should we do something else?
> > It looks to me that rcu_report_dead() does the opposite of
> > rcu_cpu_starting(),
> > so lift rcu_report_dead() out of CONFIG_HOTPLUG_CPU and use it there to
> > rewind,
> > Paul?
> 
> Yes, rcu_report_dead() should do the trick.  Presumably the earlier
> online-time CPU-hotplug notifiers are also unwound?
I don't think that is an issue here. cpu_die_early() set CPU_STUCK_IN_KERNEL,
and then __cpu_up() will see a timeout waiting for the AP online and then deal
with CPU_STUCK_IN_KERNEL according. Thus, something like this? I don't see
anything in rcu_report_dead() depends on CONFIG_HOTPLUG_CPU=y.

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 09c96f57818c..10729d2d6084 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -421,6 +421,8 @@ void cpu_die_early(void)
 
 	update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
 
+	rcu_report_dead(cpu);
+
 	cpu_park_loop();
 }
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2a52f42f64b6..bd04b09b84b3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4077,7 +4077,6 @@ void rcu_cpu_starting(unsigned int cpu)
 	smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
 }
 
-#ifdef CONFIG_HOTPLUG_CPU
 /*
  * The outgoing function has no further need of RCU, so remove it from
  * the rcu_node tree's ->qsmaskinitnext bit masks.
@@ -4117,6 +4116,7 @@ void rcu_report_dead(unsigned int cpu)
 	rdp->cpu_started = false;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
 /*
  * The outgoing CPU has just passed through the dying-idle state, and we
  * are being invoked from the CPU that was IPIed to continue the offline