[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8634ke3dn4.wl-maz@kernel.org>
Date: Tue, 29 Oct 2024 19:36:15 +0000
From: Marc Zyngier <maz@...nel.org>
To: Yu Zhao <yuzhao@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Catalin Marinas <catalin.marinas@....com>,
Muchun Song <muchun.song@...ux.dev>,
Thomas Gleixner <tglx@...utronix.de>,
Will Deacon <will@...nel.org>,
Douglas Anderson <dianders@...omium.org>,
Mark Rutland <mark.rutland@....com>,
Nanyong Sun <sunnanyong@...wei.com>,
linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs
On Mon, 28 Oct 2024 22:11:52 +0000,
Yu Zhao <yuzhao@...gle.com> wrote:
>
> On Tue, Oct 22, 2024 at 10:15 AM Marc Zyngier <maz@...nel.org> wrote:
> >
> > On Mon, 21 Oct 2024 05:22:16 +0100,
> > Yu Zhao <yuzhao@...gle.com> wrote:
> > >
> > > Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of
> > > time, and then reliably resume them when the local CPU exits critical
> > > sections that preclude the execution of remote CPUs.
> > >
> > > A typical example of such critical sections is BBM on kernel PTEs.
> > > HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by
> > > commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
> > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason:
> > >
> > > This is deemed UNPREDICTABLE by the Arm architecture without a
> > > break-before-make sequence (make the PTE invalid, TLBI, write the
> > > new valid PTE). However, such sequence is not possible since the
> > > vmemmap may be concurrently accessed by the kernel.
> > >
> > > Supporting BBM on kernel PTEs is one of the approaches that can make
> > > HVO theoretically safe on arm64.
> >
> > Is the safety only theoretical? I would have expected that we'd use an
> > approach that is absolutely rock-solid.
>
> We've been trying to construct a repro against the original HVO
> (missing BBM), but so far no success. Hopefully a repro does exist,
> and then we'd be able to demonstrate the effectiveness of this series,
> which is only theoretical at the moment.
That wasn't my question.
Just because your HW doesn't show you a failure mode doesn't mean the
issue doesn't exist. Or that someone will eventually make use of the
relaxed aspects of the architecture and turn the behaviour you are
relying on into the perfect memory corruption You absolutely cannot
rely on an implementation-defined behaviour for this stuff.
Hence my question: is your current approach actually safe? In a
non-theoretical manner?
>
> > > Note that it is still possible for the paused CPUs to perform
> > > speculative translations. Such translations would cause spurious
> > > kernel PFs, which should be properly handled by
> > > is_spurious_el1_translation_fault().
> >
> > Speculative translation faults are never reported, that'd be a CPU
> > bug.
>
> Right, I meant to say "speculative accesses that cause translations".
>
> > *Spurious* translation faults can be reported if the CPU doesn't
> > implement FEAT_ETS2, for example, and that has to do with the ordering
> > of memory access wrt page-table walking for the purpose of translations.
>
> Just want to make sure I fully understand: after the local CPU sends
> TLBI (followed by DSB & ISB), FEAT_ETS2 would act like DSB & ISB on
> remote CPUs when they perform the invalidation, and therefore
> speculative accesses are ordered as well on remote CPUs.
ETS2 has nothing to do with TLBI. It has to do with non-cacheable
(translation, access and address size) faults, and whether an older
update to a translation is visible to a younger memory access without
extra synchronisation. It definitely has nothing to do with remote
CPUs (and the idea of a remote ISB, while cute, doesn't exist).
> Also I'm assuming IPIs on remote CPUs don't act like a full barrier,
> and whatever they interrupt still can be speculatively executed even
> though the IPI hander itself doesn't access the vmemmap area
> undergoing BBM. Is this correct?
Full barrier *for what*? An interrupt is a CSE, which has the effects
described in the ARM ARM, but that's about it.
>
> > > Signed-off-by: Yu Zhao <yuzhao@...gle.com>
> > > ---
> > > arch/arm64/include/asm/smp.h | 3 ++
> > > arch/arm64/kernel/smp.c | 92 +++++++++++++++++++++++++++++++++---
> > > 2 files changed, 88 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> > > index 2510eec026f7..cffb0cfed961 100644
> > > --- a/arch/arm64/include/asm/smp.h
> > > +++ b/arch/arm64/include/asm/smp.h
> > > @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
> > > extern void crash_smp_send_stop(void);
> > > extern bool smp_crash_stop_failed(void);
> > >
> > > +void pause_remote_cpus(void);
> > > +void resume_remote_cpus(void);
> > > +
> > > #endif /* ifndef __ASSEMBLY__ */
> > >
> > > #endif /* ifndef __ASM_SMP_H */
> > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > index 3b3f6b56e733..68829c6de1b1 100644
> > > --- a/arch/arm64/kernel/smp.c
> > > +++ b/arch/arm64/kernel/smp.c
> > > @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init;
> > > static int nr_ipi __ro_after_init = NR_IPI;
> > > static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
> > >
> > > -static bool crash_stop;
> > > +enum {
> > > + SEND_STOP = BIT(0),
> > > + CRASH_STOP = BIT(1),
> > > +};
> > > +
> > > +static unsigned long stop_in_progress;
> > >
> > > static void ipi_setup(int cpu);
> > >
> > > @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
> > > #endif
> > > }
> > >
> > > +static DEFINE_SPINLOCK(cpu_pause_lock);
> >
> > PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep as
> > you are dealing with kernel mappings?
>
> Right, it should be a raw spinlock -- the caller disabled preemption,
> which as you said is required when dealing with the kernel mappings.
>
> > > +static cpumask_t paused_cpus;
> > > +static cpumask_t resumed_cpus;
> > > +
> > > +static void pause_local_cpu(void)
> > > +{
> > > + int cpu = smp_processor_id();
> > > +
> > > + cpumask_clear_cpu(cpu, &resumed_cpus);
> > > + /*
> > > + * Paired with pause_remote_cpus() to confirm that this CPU not only
> > > + * will be paused but also can be reliably resumed.
> > > + */
> > > + smp_wmb();
> > > + cpumask_set_cpu(cpu, &paused_cpus);
> > > + /* paused_cpus must be set before waiting on resumed_cpus. */
> > > + barrier();
> >
> > I'm not sure what this is trying to enforce. Yes, the compiler won't
> > reorder the set and the test.
>
> Sorry I don't follow: does cpumask_set_cpu(), i.e., set_bit(), already
> contain a compiler barrier?
>
> My understanding is that the compiler is free to reorder the set and
> test on those two independent variables, and make it like this:
>
> while (!cpumask_test_cpu(cpu, &resumed_cpus))
> cpu_relax();
> cpumask_set_cpu(cpu, &paused_cpus);
>
> So the CPU sent the IPI would keep waiting on paused_cpus being set,
> and this CPU would keep waiting on resumed_cpus being set, which would
> end up with a deadlock.
>
> > But your comment seems to indicate that
> > also need to make sure the CPU preserves that ordering
> > and short of a
> > DMB, the test below could be reordered.
>
> If this CPU reorders the set and test like above, it wouldn't be a
> problem because the set would eventually appear on the other CPU that
> sent the IPI.
I don't get it. You are requiring that the compiler doesn't reorder
things, but you're happy that the CPU reorders the same things. Surely
this leads to the same outcome...
>
> > > + while (!cpumask_test_cpu(cpu, &resumed_cpus))
> > > + cpu_relax();
> > > + /* A typical example for sleep and wake-up functions. */
> >
> > I'm not sure this is "typical",...
>
> Sorry, this full barrier isn't needed. Apparently I didn't properly
> fix this from the previous attempt to use wfe()/sev() to make this
> function the sleeper for resume_remote_cpus() to wake up.
>
> > > + smp_mb();
> > > + cpumask_clear_cpu(cpu, &paused_cpus);
> > > +}
> > > +
> > > +void pause_remote_cpus(void)
> > > +{
> > > + cpumask_t cpus_to_pause;
> > > +
> > > + lockdep_assert_cpus_held();
> > > + lockdep_assert_preemption_disabled();
> > > +
> > > + cpumask_copy(&cpus_to_pause, cpu_online_mask);
> > > + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
> >
> > This bitmap is manipulated outside of your cpu_pause_lock. What
> > guarantees you can't have two CPUs stepping on each other here?
>
> Do you mean cpus_to_pause? If so, that's a local bitmap.
Ah yes, sorry.
>
> > > +
> > > + spin_lock(&cpu_pause_lock);
> > > +
> > > + WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
> > > +
> > > + smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI);
> > > +
> > > + while (!cpumask_equal(&cpus_to_pause, &paused_cpus))
> > > + cpu_relax();
> >
> > This can be a lot of things to compare, specially that you are
> > explicitly mentioning large systems. Why can't this be implemented as
> > a counter instead?
>
> Agreed - that'd be sufficient and simpler.
>
> > Overall, this looks like stop_machine() in disguise. Why can't this
> > use the existing infrastructure?
>
> This came up during the previous discussion [1]. There are
> similarities. The main concern with reusing stop_machine() (or part of
> its current implementation) is that it may not meet the performance
> requirements. Refactoring might be possible, however, it seems to me
> (after checking the code again) that such a refactoring unlikely ends
> up cleaner or simpler codebase, especially after I got rid of CPU
> masks, as you suggested.
I would have expected to see an implementation handling faults during
the break window. IPIs are slow, virtualised extremely badly, and
pNMIs are rarely enabled, due to the long history of issues with it.
At this stage, I'm not sure what you have here is any better than
stop_machine(). On the other hand, faults should be the rarest thing,
M.
--
Without deviation from the norm, progress is not possible.
Powered by blists - more mailing lists