[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOUHufabSWTZ+cBjXEDTRh61GeALL5b6uh0M76=2ninZP3KAzQ@mail.gmail.com>
Date: Mon, 28 Oct 2024 16:11:52 -0600
From: Yu Zhao <yuzhao@...gle.com>
To: Marc Zyngier <maz@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Catalin Marinas <catalin.marinas@....com>,
Muchun Song <muchun.song@...ux.dev>, Thomas Gleixner <tglx@...utronix.de>,
Will Deacon <will@...nel.org>, Douglas Anderson <dianders@...omium.org>,
Mark Rutland <mark.rutland@....com>, Nanyong Sun <sunnanyong@...wei.com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs
On Tue, Oct 22, 2024 at 10:15 AM Marc Zyngier <maz@...nel.org> wrote:
>
> On Mon, 21 Oct 2024 05:22:16 +0100,
> Yu Zhao <yuzhao@...gle.com> wrote:
> >
> > Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of
> > time, and then reliably resume them when the local CPU exits critical
> > sections that preclude the execution of remote CPUs.
> >
> > A typical example of such critical sections is BBM on kernel PTEs.
> > HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by
> > commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason:
> >
> > This is deemed UNPREDICTABLE by the Arm architecture without a
> > break-before-make sequence (make the PTE invalid, TLBI, write the
> > new valid PTE). However, such sequence is not possible since the
> > vmemmap may be concurrently accessed by the kernel.
> >
> > Supporting BBM on kernel PTEs is one of the approaches that can make
> > HVO theoretically safe on arm64.
>
> Is the safety only theoretical? I would have expected that we'd use an
> approach that is absolutely rock-solid.
We've been trying to construct a repro against the original HVO
(missing BBM), but so far no success. Hopefully a repro does exist,
and then we'd be able to demonstrate the effectiveness of this series,
which is only theoretical at the moment.
> > Note that it is still possible for the paused CPUs to perform
> > speculative translations. Such translations would cause spurious
> > kernel PFs, which should be properly handled by
> > is_spurious_el1_translation_fault().
>
> Speculative translation faults are never reported, that'd be a CPU
> bug.
Right, I meant to say "speculative accesses that cause translations".
> *Spurious* translation faults can be reported if the CPU doesn't
> implement FEAT_ETS2, for example, and that has to do with the ordering
> of memory access wrt page-table walking for the purpose of translations.
Just want to make sure I fully understand: after the local CPU sends
TLBI (followed by DSB & ISB), FEAT_ETS2 would act like DSB & ISB on
remote CPUs when they perform the invalidation, and therefore
speculative accesses are ordered as well on remote CPUs.
Also I'm assuming IPIs on remote CPUs don't act like a full barrier,
and whatever they interrupt still can be speculatively executed even
though the IPI hander itself doesn't access the vmemmap area
undergoing BBM. Is this correct?
> > Signed-off-by: Yu Zhao <yuzhao@...gle.com>
> > ---
> > arch/arm64/include/asm/smp.h | 3 ++
> > arch/arm64/kernel/smp.c | 92 +++++++++++++++++++++++++++++++++---
> > 2 files changed, 88 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> > index 2510eec026f7..cffb0cfed961 100644
> > --- a/arch/arm64/include/asm/smp.h
> > +++ b/arch/arm64/include/asm/smp.h
> > @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
> > extern void crash_smp_send_stop(void);
> > extern bool smp_crash_stop_failed(void);
> >
> > +void pause_remote_cpus(void);
> > +void resume_remote_cpus(void);
> > +
> > #endif /* ifndef __ASSEMBLY__ */
> >
> > #endif /* ifndef __ASM_SMP_H */
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 3b3f6b56e733..68829c6de1b1 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init;
> > static int nr_ipi __ro_after_init = NR_IPI;
> > static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
> >
> > -static bool crash_stop;
> > +enum {
> > + SEND_STOP = BIT(0),
> > + CRASH_STOP = BIT(1),
> > +};
> > +
> > +static unsigned long stop_in_progress;
> >
> > static void ipi_setup(int cpu);
> >
> > @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
> > #endif
> > }
> >
> > +static DEFINE_SPINLOCK(cpu_pause_lock);
>
> PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep as
> you are dealing with kernel mappings?
Right, it should be a raw spinlock -- the caller disabled preemption,
which as you said is required when dealing with the kernel mappings.
> > +static cpumask_t paused_cpus;
> > +static cpumask_t resumed_cpus;
> > +
> > +static void pause_local_cpu(void)
> > +{
> > + int cpu = smp_processor_id();
> > +
> > + cpumask_clear_cpu(cpu, &resumed_cpus);
> > + /*
> > + * Paired with pause_remote_cpus() to confirm that this CPU not only
> > + * will be paused but also can be reliably resumed.
> > + */
> > + smp_wmb();
> > + cpumask_set_cpu(cpu, &paused_cpus);
> > + /* paused_cpus must be set before waiting on resumed_cpus. */
> > + barrier();
>
> I'm not sure what this is trying to enforce. Yes, the compiler won't
> reorder the set and the test.
Sorry I don't follow: does cpumask_set_cpu(), i.e., set_bit(), already
contain a compiler barrier?
My understanding is that the compiler is free to reorder the set and
test on those two independent variables, and make it like this:
while (!cpumask_test_cpu(cpu, &resumed_cpus))
cpu_relax();
cpumask_set_cpu(cpu, &paused_cpus);
So the CPU sent the IPI would keep waiting on paused_cpus being set,
and this CPU would keep waiting on resumed_cpus being set, which would
end up with a deadlock.
> But your comment seems to indicate that
> also need to make sure the CPU preserves that ordering
> and short of a
> DMB, the test below could be reordered.
If this CPU reorders the set and test like above, it wouldn't be a
problem because the set would eventually appear on the other CPU that
sent the IPI.
> > + while (!cpumask_test_cpu(cpu, &resumed_cpus))
> > + cpu_relax();
> > + /* A typical example for sleep and wake-up functions. */
>
> I'm not sure this is "typical",...
Sorry, this full barrier isn't needed. Apparently I didn't properly
fix this from the previous attempt to use wfe()/sev() to make this
function the sleeper for resume_remote_cpus() to wake up.
> > + smp_mb();
> > + cpumask_clear_cpu(cpu, &paused_cpus);
> > +}
> > +
> > +void pause_remote_cpus(void)
> > +{
> > + cpumask_t cpus_to_pause;
> > +
> > + lockdep_assert_cpus_held();
> > + lockdep_assert_preemption_disabled();
> > +
> > + cpumask_copy(&cpus_to_pause, cpu_online_mask);
> > + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
>
> This bitmap is manipulated outside of your cpu_pause_lock. What
> guarantees you can't have two CPUs stepping on each other here?
Do you mean cpus_to_pause? If so, that's a local bitmap.
> > +
> > + spin_lock(&cpu_pause_lock);
> > +
> > + WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
> > +
> > + smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
> > +
> > + while (!cpumask_equal(&cpus_to_pause, &paused_cpus))
> > + cpu_relax();
>
> This can be a lot of things to compare, specially that you are
> explicitly mentioning large systems. Why can't this be implemented as
> a counter instead?
Agreed - that'd be sufficient and simpler.
> Overall, this looks like stop_machine() in disguise. Why can't this
> use the existing infrastructure?
This came up during the previous discussion [1]. There are
similarities. The main concern with reusing stop_machine() (or part of
its current implementation) is that it may not meet the performance
requirements. Refactoring might be possible, however, it seems to me
(after checking the code again) that such a refactoring unlikely ends
up cleaner or simpler codebase, especially after I got rid of CPU
masks, as you suggested.
[1] https://lore.kernel.org/all/ZbKjHHeEdFYY1xR5@arm.com/
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 3b3f6b56e733..b672af2441a3 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -917,6 +922,64 @@ static void __noreturn
ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
#endif
}
+static DEFINE_RAW_SPINLOCK(cpu_pause_lock);
+static bool __cacheline_aligned_in_smp cpu_paused;
+static atomic_t __cacheline_aligned_in_smp nr_cpus_paused;
+
+static void pause_local_cpu(void)
+{
+ atomic_inc(&nr_cpus_paused);
+
+ while (READ_ONCE(cpu_paused))
+ cpu_relax();
+
+ atomic_dec(&nr_cpus_paused);
+}
+
+void pause_remote_cpus(void)
+{
+ cpumask_t cpus_to_pause;
+ int nr_cpus_to_pause = num_online_cpus() - 1;
+
+ lockdep_assert_cpus_held();
+ lockdep_assert_preemption_disabled();
+
+ if (!nr_cpus_to_pause)
+ return;
+
+ cpumask_copy(&cpus_to_pause, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
+
+ raw_spin_lock(&cpu_pause_lock);
+
+ WARN_ON_ONCE(cpu_paused);
+ WARN_ON_ONCE(atomic_read(&nr_cpus_paused));
+
+ cpu_paused = true;
+
+ smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
+
+ while (atomic_read(&nr_cpus_paused) != nr_cpus_to_pause)
+ cpu_relax();
+
+ raw_spin_unlock(&cpu_pause_lock);
+}
+
+void resume_remote_cpus(void)
+{
+ if (!cpu_paused)
+ return;
+
+ raw_spin_lock(&cpu_pause_lock);
+
+ WRITE_ONCE(cpu_paused, false);
+
+ while (atomic_read(&nr_cpus_paused))
+ cpu_relax();
+
+ raw_spin_unlock(&cpu_pause_lock);
+}
+
static void arm64_backtrace_ipi(cpumask_t *mask)
{
__ipi_send_mask(ipi_desc[IPI_CPU_BACKTRACE], mask);
Powered by blists - more mailing lists