[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b0a3e152-22bb-4502-a0a0-4b2513bfbec8@efficios.com>
Date: Tue, 5 Mar 2024 15:01:55 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: "levi.yun" <yeoreum.yun@....com>, catalin.marinas@....com,
will@...nel.org, mark.rutland@....com, peterz@...radead.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
nd@....com, stable@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>
Subject: Re: [PATCH] arm64/mm: Add memory barrier for mm_cid
On 2024-03-05 09:53, levi.yun wrote:
> Currently arm64's switch_mm() doesn't always have an smp_mb()
> which the core scheduler code has depended upon since commit:
>
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
>
> If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear()
> can unset the activly used cid when it fails to observe active task after it
> sets lazy_put.
>
> By adding an smp_mb() in arm64's check_and_switch_context(),
> Guarantee to observe active task after sched_mm_cid_remote_clear()
> success to set lazy_put.
This comment from the original implementation of membarrier
MEMBARRIER_CMD_PRIVATE_EXPEDITED states that the original need from
membarrier was to have a full barrier between storing to rq->curr and
return to userspace:
commit 22e4ebb9758 ("membarrier: Provide expedited private command")
commit message:
* Our TSO archs can do RELEASE without being a full barrier. Look at
x86 spin_unlock() being a regular STORE for example. But for those
archs, all atomics imply smp_mb and all of them have atomic ops in
switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
barrier.
* From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
the rest does indeed do smp_mb(), so there the spin_unlock() is a full
barrier and we're good.
* ARM64 has a very heavy barrier in switch_to(), which suffices.
* PPC just removed its barrier from switch_to(), but appears to be
talking about adding something to switch_mm(). So add a
smp_mb__after_unlock_lock() for now, until this is settled on the PPC
side.
associated code:
+ /*
+ * The membarrier system call requires each architecture
+ * to have a full memory barrier after updating
+ * rq->curr, before returning to user-space. For TSO
+ * (e.g. x86), the architecture must provide its own
+ * barrier in switch_mm(). For weakly ordered machines
+ * for which spin_unlock() acts as a full memory
+ * barrier, finish_lock_switch() in common code takes
+ * care of this barrier. For weakly ordered machines for
+ * which spin_unlock() acts as a RELEASE barrier (only
+ * arm64 and PowerPC), arm64 has a full barrier in
+ * switch_to(), and PowerPC has
+ * smp_mb__after_unlock_lock() before
+ * finish_lock_switch().
+ */
Which got updated to this by
commit 306e060435d ("membarrier: Document scheduler barrier requirements")
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
+ * rq->curr, before returning to user-space.
+ *
+ * Here are the schemes providing that barrier on the
+ * various architectures:
+ * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
+ * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+ * - finish_lock_switch() for weakly-ordered
+ * architectures where spin_unlock is a full barrier,
+ * - switch_to() for arm64 (weakly-ordered, spin_unlock
+ * is a RELEASE barrier),
*/
However, rseq mm_cid has stricter requirements: the barrier needs to be
issued between store to rq->curr and switch_mm_cid(), which happens
earlier than:
- spin_unlock(),
- switch_to().
So it's fine when the architecture switch_mm happens to have that barrier
already, but less so when the architecture only provides the full barrier
in switch_to() or spin_unlock().
The issue is therefore not specific to arm64, it's actually a bug in the
rseq switch_mm_cid() implementation. All architectures that don't have
memory barriers in switch_mm(), but rather have the full barrier either in
finish_lock_switch() or switch_to() have them too late for the needs of
switch_mm_cid().
I would recommend one of three approaches here:
A) Add smp_mb() in switch_mm_cid() for all architectures that lack that
barrier in switch_mm().
B) Figure out if we can move switch_mm_cid() further down in the scheduler
without breaking anything (within switch_to(), at the very end of
finish_lock_switch() for instance). I'm not sure we can do that though
because switch_mm_cid() touches the "prev" which is tricky after
switch_to().
C) Add barriers in switch_mm() within all architectures that are missing it.
Thoughts ?
Thanks,
Mathieu
>
> Signed-off-by: levi.yun <yeoreum.yun@....com>
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Cc: <stable@...r.kernel.org> # 6.4.x
> Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> Cc: Catalin Marinas <catalin.marinas@....com>
> Cc: Mark Rutland <mark.rutland@....com>
> Cc: Will Deacon <will@...nel.org>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Aaron Lu <aaron.lu@...el.com>
> ---
> I'm really sorry if you got this multiple times.
> I had some problems with the SMTP server...
>
> arch/arm64/mm/context.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 188197590fc9..7a9e8e6647a0 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -268,6 +268,11 @@ void check_and_switch_context(struct mm_struct *mm)
> */
> if (!system_uses_ttbr0_pan())
> cpu_switch_mm(mm->pgd, mm);
> +
> + /*
> + * See the comments on switch_mm_cid describing user -> user transition.
> + */
> + smp_mb();
> }
>
> unsigned long arm64_mm_context_get(struct mm_struct *mm)
> --
> LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists