linux-kernel - Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <292abea1-59b7-13d4-0b27-ac00f7e7f20e@ghiti.fr>
Date:   Mon, 24 Jul 2023 11:05:40 +0200
From:   Alexandre Ghiti <alex@...ti.fr>
To:     Guo Ren <guoren@...nel.org>
Cc:     palmer@...osinc.com, paul.walmsley@...ive.com, falcon@...ylab.org,
        bjorn@...nel.org, conor.dooley@...rochip.com,
        linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-riscv@...ts.infradead.org, Guo Ren <guoren@...ux.alibaba.com>
Subject: Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr


On 22/07/2023 01:59, Guo Ren wrote:
> On Fri, Jul 21, 2023 at 4:01 PM Alexandre Ghiti <alex@...ti.fr> wrote:
>>
>> On 21/07/2023 18:08, Guo Ren wrote:
>>> On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@...ti.fr> wrote:
>>>> On 21/07/2023 16:51, guoren@...nel.org wrote:
>>>>> From: Guo Ren <guoren@...ux.alibaba.com>
>>>>>
>>>>> RISC-V specification permits the caching of PTEs whose V (Valid)
>>>>> bit is clear. Operating systems must be written to cope with this
>>>>> possibility, but implementers are reminded that eagerly caching
>>>>> invalid PTEs will reduce performance by causing additional page
>>>>> faults.
>>>>>
>>>>> So we must keep vmalloc_fault for the spurious page faults of kernel
>>>>> virtual address from an OoO machine.
>>>>>
>>>>> Signed-off-by: Guo Ren <guoren@...ux.alibaba.com>
>>>>> Signed-off-by: Guo Ren <guoren@...nel.org>
>>>>> ---
>>>>>     arch/riscv/mm/fault.c | 3 +--
>>>>>     1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
>>>>> index 85165fe438d8..f662c9eae7d4 100644
>>>>> --- a/arch/riscv/mm/fault.c
>>>>> +++ b/arch/riscv/mm/fault.c
>>>>> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
>>>>>          * only copy the information from the master page table,
>>>>>          * nothing more.
>>>>>          */
>>>>> -     if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
>>>>> -         unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
>>>>> +     if (unlikely(addr >= TASK_SIZE)) {
>>>>>                 vmalloc_fault(regs, code, addr);
>>>>>                 return;
>>>>>         }
>>>> Can you share what you are trying to fix here?
>>> We met a spurious page fault panic on an OoO machine.
>>>
>>> 1. The processor speculative execution brings the V=0 entries into the
>>> TLB in the kernel virtual address.
>>> 2. Linux kernel installs the kernel virtual address with the page, and V=1
>>> 3. When kernel code access the kernel virtual address, it would raise
>>> a page fault as the V=0 entry in the tlb.
>>> 4. No vmalloc_fault, then panic.
>>>
>>>> I have a fix (that's currently running our CI) for commit 7d3332be011e
>>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that
>>>> implements flush_cache_vmap() since we lost the vmalloc_fault.
>>> Could you share that patch?
>>
>> Here we go:
>>
>>
>> Author: Alexandre Ghiti <alexghiti@...osinc.com>
>> Date:   Fri Jul 21 08:43:44 2023 +0000
>>
>>       riscv: Implement flush_cache_vmap()
>>
>>       The RISC-V kernel needs a sfence.vma after a page table
>> modification: we
>>       used to rely on the vmalloc fault handling to emit an sfence.vma, but
>>       commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>>       vmalloc/modules area") got rid of this path for 64-bit kernels, so
>> now we
>>       need to explicitly emit a sfence.vma in flush_cache_vmap().
>>
>>       Note that we don't need to implement flush_cache_vunmap() as the
>> generic
>>       code should emit a flush tlb after unmapping a vmalloc region.
>>
>>       Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>> vmalloc/modules area")
>>       Signed-off-by: Alexandre Ghiti <alexghiti@...osinc.com>
>>
>> diff --git a/arch/riscv/include/asm/cacheflush.h
>> b/arch/riscv/include/asm/cacheflush.h
>> index 8091b8bf4883..b93ffddf8a61 100644
>> --- a/arch/riscv/include/asm/cacheflush.h
>> +++ b/arch/riscv/include/asm/cacheflush.h
>> @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
>>    #define flush_icache_user_page(vma, pg, addr, len) \
>>           flush_icache_mm(vma->vm_mm, 0)
>>
>> +#ifdef CONFIG_64BIT
>> +#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
>> +#endif
> I don't want that, and flush_tlb_kernel_range is flush_tlb_all. In
> addition, it would call IPI, which is a performance killer.


At the moment, flush_tlb_kernel_range() indeed calls flush_tlb_all() but 
that needs to be fixed, see my last patchset 
https://lore.kernel.org/linux-riscv/20230711075434.10936-1-alexghiti@rivosinc.com/.

But can you at least check that this fixes your issue? It would be 
interesting to see if the problem comes from vmalloc or something else.


> What's the problem of spurious fault replay? It only costs a
> local_tlb_flush with vaddr.


We had this exact discussion internally this week, and the fault replay 
seems like a solution. But that needs to be thought carefully: the 
vmalloc fault was removed for a reason (see Bjorn commit log), tracing 
functions can use vmalloc() in the path of the vmalloc fault, causing an 
infinite trap loop. And here you are simply re-enabling this problem. In 
addition, this patch makes vmalloc_fault() catch *all* kernel faults in 
the kernel address space, so any genuine kernel fault would loop forever 
in vmalloc_fault().


For now, the simplest solution is to implement flush_cache_vmap() 
because riscv needs a sfence.vma when adding a new mapping, and if 
that's a "performance killer", let's measure that and implement 
something like this patch is trying to do. I may be wrong, but there 
aren't many new kernel mappings that would require a call to 
flush_cache_vmap() so I disagree with the performance killer argument, 
but happy to be proven otherwise!

Thanks,

Alex


>
>> +
>>    #ifndef CONFIG_SMP
>>
>>    #define flush_icache_all() local_flush_icache_all()
>>
>>
>> Let me know if that works for you!
>>
>>
>>>
> --
> Best Regards
>   Guo Ren