linux-kernel - Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bf1aa0a4-08de-443f-a1a3-aa6c05bab38c@os.amperecomputing.com>
Date: Thu, 4 Sep 2025 10:47:56 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Ryan Roberts <ryan.roberts@....com>, Dev Jain <dev.jain@....com>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Ard Biesheuvel <ardb@...nel.org>, scott@...amperecomputing.com, cl@...two.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Subject: Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block
 mapping when rodata=full

On 9/4/25 6:16 AM, Ryan Roberts wrote:
> On 04/09/2025 14:14, Ryan Roberts wrote:
>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>
>>>>>> I am wondering whether we can just have a warn_on_once or something for the
>>>>>> case
>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>> suggested in an off-the-list conversation that we can maintain a cache of PTE
>>>>>> tables for every PMD block mapping, which will give us
>>>>>> the same memory consumption as we do today, but not sure if this is worth it.
>>>>>> x86 can already handle splitting but due to the callchains
>>>>>> I have described above, it has the same problem, and the code has been working
>>>>>> for years :)
>>>>> I think it's preferable to avoid having to keep a cache of pgtable memory if we
>>>>> can...
>>>> Yes, I agree. We simply don't know how many pages we need to cache, and it
>>>> still can't guarantee 100% allocation success.
>>> This is wrong... We can know how many pages will be needed for splitting linear
>>> mapping to PTEs for the worst case once linear mapping is finalized. But it may
>>> require a few hundred megabytes memory to guarantee allocation success. I don't
>>> think it is worth for such rare corner case.
>> Indeed, we know exactly how much memory we need for pgtables to map the linear
>> map by pte - that's exactly what we are doing today. So we _could_ keep a cache.
>> We would still get the benefit of improved performance but we would lose the
>> benefit of reduced memory.
>>
>> I think we need to solve the vm_reset_perms() problem somehow, before we can
>> enable this.
> Sorry I realise this was not very clear... I am saying I think we need to fix it
> somehow. A cache would likely work. But I'd prefer to avoid it if we can find a
> better solution.

Took a deeper look at vm_reset_perms(). It was introduced by commit 
868b104d7379 ("mm/vmalloc: Add flag for freeing of special 
permsissions"). The VM_FLUSH_RESET_PERMS flag is supposed to be set if 
the vmalloc memory is RO and/or ROX. So set_memory_ro() or 
set_memory_rox() is supposed to follow up vmalloc(). So the page table 
should be already split before reaching vfree(). I think this why 
vm_reset_perms() doesn't not check return value.

I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set. The 
most of them has set_memory_ro() or set_memory_rox() followed. But there 
are 3 places I don't see set_memory_ro()/set_memory_rox() is called.

1. BPF trampoline allocation. The BPF trampoline calls 
arch_protect_bpf_trampoline(). The generic implementation does call 
set_memory_rox(). But the x86 and arm64 implementation just simply 
return 0. For x86, it is because execmem cache is used and it does call 
set_memory_rox(). ARM64 doesn't need to split page table before this 
series, so it should never fail. I think we just need to use the generic 
implementation (remove arm64 implementation) if this series is merged.

2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS 
set. But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be 
unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.

3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also 
called set_memory_rox() before switching to execmem cache. The execmem 
cache calls set_memory_rox(). I don't know why ARM64 doesn't call it.

So I think we just need to fix #1 and #3 per the above analysis. If this 
analysis look correct to you guys, I will prepare two patches to fix them.

Thanks,
Yang

>
>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>