[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <efd2475c-3748-48cc-9918-1cb305f3f581@arm.com>
Date: Tue, 5 Mar 2024 21:07:44 +0000
From: "levi.yun" <yeoreum.yun@....com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
catalin.marinas@....com, will@...nel.org, mark.rutland@....com,
peterz@...radead.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
nd@....com, stable@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>
Subject: Re: [PATCH] arm64/mm: Add memory barrier for mm_cid
Hi Mathieu!
On 05/03/2024 20:01, Mathieu Desnoyers wrote:
> On 2024-03-05 09:53, levi.yun wrote:
>> Currently arm64's switch_mm() doesn't always have an smp_mb()
>> which the core scheduler code has depended upon since commit:
>>
>> commit 223baf9d17f25 ("sched: Fix performance regression
>> introduced by mm_cid")
>>
>> If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear()
>> can unset the activly used cid when it fails to observe active task
>> after it
>> sets lazy_put.
>>
>> By adding an smp_mb() in arm64's check_and_switch_context(),
>> Guarantee to observe active task after sched_mm_cid_remote_clear()
>> success to set lazy_put.
>
> This comment from the original implementation of membarrier
> MEMBARRIER_CMD_PRIVATE_EXPEDITED states that the original need from
> membarrier was to have a full barrier between storing to rq->curr and
> return to userspace:
>
> commit 22e4ebb9758 ("membarrier: Provide expedited private command")
>
> commit message:
>
> * Our TSO archs can do RELEASE without being a full barrier. Look at
> x86 spin_unlock() being a regular STORE for example. But for those
> archs, all atomics imply smp_mb and all of them have atomic ops in
> switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a
> full
> barrier.
> * From all weakly ordered machines, only ARM64 and PPC can do
> RELEASE,
> the rest does indeed do smp_mb(), so there the spin_unlock() is
> a full
> barrier and we're good.
> * ARM64 has a very heavy barrier in switch_to(), which suffices.
> * PPC just removed its barrier from switch_to(), but appears
> to be
> talking about adding something to switch_mm(). So add a
> smp_mb__after_unlock_lock() for now, until this is settled on
> the PPC
> side.
>
> associated code:
>
> + /*
> + * The membarrier system call requires each architecture
> + * to have a full memory barrier after updating
> + * rq->curr, before returning to user-space. For TSO
> + * (e.g. x86), the architecture must provide its own
> + * barrier in switch_mm(). For weakly ordered machines
> + * for which spin_unlock() acts as a full memory
> + * barrier, finish_lock_switch() in common code takes
> + * care of this barrier. For weakly ordered machines for
> + * which spin_unlock() acts as a RELEASE barrier (only
> + * arm64 and PowerPC), arm64 has a full barrier in
> + * switch_to(), and PowerPC has
> + * smp_mb__after_unlock_lock() before
> + * finish_lock_switch().
> + */
>
> Which got updated to this by
>
> commit 306e060435d ("membarrier: Document scheduler barrier
> requirements")
>
> /*
> * The membarrier system call requires each architecture
> * to have a full memory barrier after updating
> + * rq->curr, before returning to user-space.
> + *
> + * Here are the schemes providing that barrier on the
> + * various architectures:
> + * - mm ? switch_mm() : mmdrop() for x86, s390, sparc,
> PowerPC.
> + * switch_mm() rely on membarrier_arch_switch_mm()
> on PowerPC.
> + * - finish_lock_switch() for weakly-ordered
> + * architectures where spin_unlock is a full barrier,
> + * - switch_to() for arm64 (weakly-ordered, spin_unlock
> + * is a RELEASE barrier),
> */
>
> However, rseq mm_cid has stricter requirements: the barrier needs to be
> issued between store to rq->curr and switch_mm_cid(), which happens
> earlier than:
>
> - spin_unlock(),
> - switch_to().
>
> So it's fine when the architecture switch_mm happens to have that barrier
> already, but less so when the architecture only provides the full barrier
> in switch_to() or spin_unlock().
>
> The issue is therefore not specific to arm64, it's actually a bug in the
> rseq switch_mm_cid() implementation. All architectures that don't have
> memory barriers in switch_mm(), but rather have the full barrier
> either in
> finish_lock_switch() or switch_to() have them too late for the needs of
> switch_mm_cid().
Thanks for the great detail explain!
>
> I would recommend one of three approaches here:
>
> A) Add smp_mb() in switch_mm_cid() for all architectures that lack that
> barrier in switch_mm().
>
> B) Figure out if we can move switch_mm_cid() further down in the
> scheduler
> without breaking anything (within switch_to(), at the very end of
> finish_lock_switch() for instance). I'm not sure we can do that though
> because switch_mm_cid() touches the "prev" which is tricky after
> switch_to().
>
> C) Add barriers in switch_mm() within all architectures that are
> missing it.
>
> Thoughts ?
IMHO, A) is look good to me.
Because, In case of B), If you assume spin_unlock() for rq->lock has
full memory barrier,
I'm not sure about the architecture which using queued_spin_unlock().
When I see the queued_spin_unlock()'s implementation, It implements
using smp_store_relasse().
But, when we see the memory_barrier.txt describing MULTICOPY ATOMICITY,
If smp_mb__after_atomic() is implemented with smp_mb(), There might fail
to observe.
Am I wrong?
Many thanks!
Powered by blists - more mailing lists