linux-kernel - Re: [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload cache is aged

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a4063a20-7d82-4013-b7cb-b3fd059cc8ba@linux.ibm.com>
Date: Fri, 1 Aug 2025 07:48:33 +0530
From: Donet Tom <donettom@...ux.ibm.com>
To: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
Cc: linux-kernel@...r.kernel.org, Michael Ellerman <mpe@...erman.id.au>,
        Nicholas Piggin <npiggin@...il.com>,
        Vishal Chourasia
 <vishalc@...ux.ibm.com>,
        Madhavan Srinivasan <maddy@...ux.ibm.com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload
 cache is aged

Hi Ritesh

On 7/31/25 11:40 PM, Ritesh Harjani (IBM) wrote:
> Donet Tom <donettom@...ux.ibm.com> writes:
>
>> On systems using the hash MMU, there is a software SLB preload cache that
>> mirrors the entries loaded into the hardware SLB buffer. This preload
>> cache is subject to periodic eviction — typically after every 256 context
>> switches — to remove old entry.
>>
>> Currently, the kernel skips the MMU context switch in switch_mm_irqs_off()
>> if the prev and next mm_struct are the same, as an optimization. However,
>> this behavior can lead to problems on hash MMU systems.
>>
> Let's also add detailed flow of events, as this was not really an easy
> problem to catch.
>
> CPU 0                                   CPU 1
>
> Process P
> exec                                    swapper/1
>   load_elf_binary
>    begin_new_exc
>      activate_mm
>       switch_mm_irqs_off
>        switch_mmu_context
>         switch_slb
>         /*
>          * This invalidates all the
>          * entries in the HW and setup
>          * the new HW SLB entries as per
>          * the preload cache.
>          */
> switch_switch
> sched_migrate_task migrates process P to cpu-1
>
> Process swapper/0                       context switch (to process P)
> (uses mm_struct of Process P)           switch_mm_irqs_off()
>                                           switch_slb
>                                             load_slb++
>                                              /*
>                                              * load_slb becomes 0 here
>                                              * and we evict an entry from
>                                              * the preload cache with
>                                              * preload_age(). We still
>                                              * keep HW SLB and preload
>                                              * cache in sync, that is
>                                              * because all HW SLB entries
>                                              * anyways gets evicted in
>                                              * switch_slb during SLBIA.
>                                              * We then only add those
>                                              * entries back in HW SLB,
>                                              * which are currently
>                                              * present in preload_cache
>                                              * (after eviction).
>                                              */
>                                          load_elf_binary continues...
>                                           setup_new_exec()
>                                            slb_setup_new_exec()
>
>                                          sched_switch event
>                                          sched_migrate_task migrates
>                                          process P to cpu-0
>
> context_switch from swapper/0 to Process P
>   switch_mm_irqs_off()
>    /*
>     * Since both prev and next mm struct are same we don't call
>     * switch_mmu_context(). This will cause the HW SLB and SW preload
>     * cache to go out of sync in preload_new_slb_context. Because there
>     * was an SLB entry which was evicted from both HW and preload cache
>     * on cpu-1. Now later in preload_new_slb_context(), when we will try
>     * to add the same preload entry again, we will add this to the SW
>     * preload cache and then will add it to the HW SLB. Since on cpu-0
>     * this entry was never invalidated, hence adding this entry to the HW
>     * SLB will cause a SLB multi-hit error.
>     */
> load_elf_binary continues...
>   START_THREAD
>    start_thread
>     preload_new_slb_context
>     /*
>      * This tries to add a new EA to preload cache which was earlier
>      * evicted from both cpu-1 HW SLB and preload cache. This caused the
>      * HW SLB of cpu-0 to go out of sync with the SW preload cache. The
>      * reason for this was, that when we context switched back on CPU-0,
>      * we should have ideally called switch_mmu_context() which will
>      * bring bring the HW SLB entries on CPU-0 in sync with SW preload cache
>      * entries by setting up the mmu context properly. But we didn't do
>      * that since the prev mm_struct running on cpu-0 was same as the
>      * next mm_struct (which is true for swapper / kernel threads). So
>      * now when we try to add this new entry into the HW SLB of cpu-0,
>      * we hit a SLB multi-hit error.
>      */
>
> WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62 assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
> Modules linked in:
> CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12 VOLUNTARY
> Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
> NIP:  c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
> REGS: c0000000497c77e0 TRAP: 0700   Not tainted  (6.16.0-rc3-dirty)
> MSR:  8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28888482  XER: 00000000
> CFAR: c0000000001543b0 IRQMASK: 3
> <...>
> NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
> LR [c0000000001543b4] slb_insert_entry+0x124/0x390
> Call Trace:
>    0x7fffceb5ffff (unreliable)
>    preload_new_slb_context+0x100/0x1a0
>    start_thread+0x26c/0x420
>    load_elf_binary+0x1b04/0x1c40
>    bprm_execve+0x358/0x680
>    do_execveat_common+0x1f8/0x240
>    sys_execve+0x58/0x70
>    system_call_exception+0x114/0x300
>    system_call_common+0x160/0x2c4
>
>
>> Consider the following scenario: a process is running on CPU A and gets
>> context-switched to CPU B. During this time, one of its SLB preload cache
>> entries is evicted. Later, the process is rescheduled on CPU A, which was
>> running swapper in the meantime, using the same mm_struct. Because
>> prev == next, the kernel skips the MMU context switch. As a result, the
>> hardware SLB buffer still contains the entry, but the software preload
>> cache does not.
>>
>> The absence of the entry in the preload cache causes it to attempt to
>> reload the SLB. However, since the entry is already present in the hardware
>> SLB, this leads to a SLB multi-hit error.
>>
> Can we use the detailed commit msg from above instead of above two paragraphs.
> It is easier to visualize and document if we have it that way.


Yes, that makes sense — I’ll add this and send V2


>
>
>> To fix this issue, we add a code change to always switch the MMU context on
>> hash MMU if the SLB preload cache has aged. With this change, the
>> SLB multi-hit error no longer occurs.
>>
>> Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
> CC: stable@...r.kernel.org
>
> Otherwise LGTM.


Thank you.


>
>> Suggested-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
>> Signed-off-by: Donet Tom <donettom@...ux.ibm.com>
>> ---
>>   arch/powerpc/mm/book3s64/slb.c | 2 +-
>>   arch/powerpc/mm/mmu_context.c  | 2 +-
>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> index 6b783552403c..08daac3f978c 100644
>> --- a/arch/powerpc/mm/book3s64/slb.c
>> +++ b/arch/powerpc/mm/book3s64/slb.c
>> @@ -509,7 +509,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>>   	 * SLB preload cache.
>>   	 */
>>   	tsk->thread.load_slb++;
>> -	if (!tsk->thread.load_slb) {
>> +	if (tsk->thread.load_slb == U8_MAX) {
>>   		unsigned long pc = KSTK_EIP(tsk);
>>   
>>   		preload_age(ti);
>> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
>> index 3e3af29b4523..d7b9ac8c9971 100644
>> --- a/arch/powerpc/mm/mmu_context.c
>> +++ b/arch/powerpc/mm/mmu_context.c
>> @@ -84,7 +84,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>>   	switch_mm_pgdir(tsk, next);
>>   
>>   	/* Nothing else to do if we aren't actually switching */
>> -	if (prev == next)
>> +	if ((prev == next) && (tsk->thread.load_slb != U8_MAX))
>>   		return;
>>   
>>   	/*
>> -- 
>> 2.50.1