[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0a84c0e0-2571-4c7f-82ae-a429f467a16b@efficios.com>
Date: Tue, 28 Nov 2023 13:39:15 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andrea Parri <parri.andrea@...il.com>
Cc: paulmck@...nel.org, palmer@...belt.com, paul.walmsley@...ive.com,
aou@...s.berkeley.edu, mmaas@...gle.com, hboehm@...gle.com,
striker@...ibm.com, charlie@...osinc.com, rehn@...osinc.com,
linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] membarrier: riscv: Provide core serializing command
On 2023-11-28 10:13, Andrea Parri wrote:
>> I am concerned about the possibility that this change lacks two barriers in the
>> following scenario:
>>
>> On a transition from uthread -> uthread on [CPU 0], from a thread belonging to
>> another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent
>> membarrier sync-core is done on [CPU 1]:
>>
>> - [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers
>> associated with these stores.
>>
>> - [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C]
>> within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes
>> cpu_rq(0)->curr->mm != mm, so it skips the IPI.
>>
>> - This means membarrier relies on switch_mm() to issue the sync-core.
>>
>> - [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm()
>> may incorrectly skip the sync-core.
>>
>> AFAIU, [C] can be reordered before [A] because there is no barrier between those
>> operations within membarrier. I suspect it can cause the switch_mm() code to skip
>> a needed sync-core.
>>
>> AFAIU, [D] can be reordered before [B] because there is no documented barrier
>> between those operations within the scheduler, which can also cause switch_mm()
>> to skip a needed sync-core.
>>
>> We possibly have a similar scenario for uthread->uthread when the scheduler
>> switches between mm -> !mm.
>>
>> One way to fix this would be to add the following barriers:
>>
>> - A smp_mb() between [A] and [C], possibly just after cpumask_setall() in
>> prepare_sync_core_cmd(), with comments detailing the ordering it guarantees,
>> - A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in
>> flush_icache_deferred(), with appropriate comments.
>>
>> Am I missing something ?
>
> Thanks for the detailed analysis.
>
> AFAIU, the following barrier (in membarrier_private_expedited())
>
> /*
> * Matches memory barriers around rq->curr modification in
> * scheduler.
> */
> smp_mb(); /* system call entry is not a mb. */
>
> can serve the purpose of ordering [A] before [C] (to be documented in v2).
Agreed. Yes it should be documented.
>
> But I agree that [B] and [D] are unordered /missing suitable synchronization.
> Worse, RISC-V has currently no full barrier after [B] and before returning to
> user-space: I'm thinking (inspired by the PowerPC implementation),
If RISC-V currently supports the membarrier private cmd and lacks the
appropriate smp_mb() in switch_mm(), then it's a bug. This initial patch
should be a "Fix" and fast-tracked as such.
Indeed, looking at how ASID is used to implement switch_mm, it appears
to not require a full smp_mb() at all as long as there are no ASID
rollovers.
>
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 217fd4de61342..f63222513076d 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> if (unlikely(prev == next))
> return;
>
> +#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP)
> + /*
> + * The membarrier system call requires a full memory barrier
> + * after storing to rq->curr, before going back to user-space.
> + *
> + * Only need the full barrier when switching between processes:
> + * barrier when switching from kernel to userspace is not
> + * required here, given that it is implied by mmdrop(); barrier
> + * when switching from userspace to kernel is not needed after
> + * store to rq->curr.
> + */
> + if (unlikely(atomic_read(&next->membarrier_state) &
> + (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> + MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev)
> + smp_mb();
> +#endif
The approach looks good. Please implement it within a separate
membarrier_arch_switch_mm() as done on powerpc.
> +
> /*
> * Mark the current MM context as inactive, and the next as
> * active. This is at least used by the icache flushing
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a708d225c28e8..a1c749fddd095 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6670,8 +6670,9 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> *
> * Here are the schemes providing that barrier on the
> * various architectures:
> - * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
> - * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
> + * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC,
> + * RISC-V. switch_mm() relies on membarrier_arch_switch_mm()
> + * on PowerPC.
> * - finish_lock_switch() for weakly-ordered
> * architectures where spin_unlock is a full barrier,
> * - switch_to() for arm64 (weakly-ordered, spin_unlock
>
> The silver lining is that similar changes (probably as a separate/preliminary
> patch) also restore the desired order between [B] and [D] AFAIU; so with them,
> 2/2 would just need additions to document the above SYNC_CORE scenario.
Exactly.
> Thoughts?
I think we should be OK with the changes you suggest.
Thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists