linux-kernel - Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2fdcb87-9223-4091-8bda-9d9316c84cf4@arm.com>
Date: Wed, 6 Aug 2025 14:23:56 +0530
From: Dev Jain <dev.jain@....com>
To: David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org
Cc: ryan.roberts@....com, willy@...radead.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, catalin.marinas@....com, will@...nel.org,
 Liam.Howlett@...cle.com, lorenzo.stoakes@...cle.com, vbabka@...e.cz,
 jannh@...gle.com, anshuman.khandual@....com, peterx@...hat.com,
 joey.gouly@....com, ioworker0@...il.com, baohua@...nel.org,
 kevin.brodsky@....com, quic_zhenhuah@...cinc.com,
 christophe.leroy@...roup.eu, yangyicong@...ilicon.com,
 linux-arm-kernel@...ts.infradead.org, hughd@...gle.com,
 yang@...amperecomputing.com, ziy@...dia.com
Subject: Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching


On 06/08/25 1:38 pm, David Hildenbrand wrote:
> On 18.07.25 11:02, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Note that, PTE
>> batching here will save a few function calls, and this strategy in 
>> certain
>> cases (not this one) batches atomic operations in general, so we have
>> a performance win for all arches. This patch paves the way for patch 7
>> which will help us elide the TLBI per contig block on arm64.
>>
>> The correctness of this patch lies on the correctness of setting the
>> new ptes based upon information only from the first pte of the batch
>> (which may also have accumulated a/d bits via modify_prot_start_ptes()).
>>
>> Observe that the flag combination we pass to mprotect_folio_pte_batch()
>> guarantees that the batch is uniform w.r.t the soft-dirty bit and the
>> writable bit. Therefore, the only bits which may differ are the a/d 
>> bits.
>> So we only need to worry about code which is concerned about the a/d 
>> bits
>> of the PTEs.
>>
>> Setting extra a/d bits on the new ptes where previously they were not 
>> set,
>> is fine - setting access bit when it was not set is not an incorrectness
>> problem but will only possibly delay the reclaim of the page mapped by
>> the pte (which is in fact intended because the kernel just operated 
>> on this
>> region via mprotect()!). Setting dirty bit when it was not set is again
>> not an incorrectness problem but will only possibly force an unnecessary
>> writeback.
>>
>> So now we need to reason whether something can go wrong via
>> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
>> and userfaultfd_pte_wp cases are solved due to uniformity in the
>> corresponding bits guaranteed by the flag combination. The ptes all
>> belong to the same VMA (since callers guarantee that [start, end) will
>> lie within the VMA) therefore the conditional based on the VMA is also
>> safe to batch around.
>>
>> Since the dirty bit on the PTE really is just an indication that the 
>> folio
>> got written to - even if the PTE is not actually dirty but one of the 
>> PTEs
>> in the batch is, the wp-fault optimization can be made. Therefore, it is
>> safe to batch around pte_dirty() in can_change_shared_pte_writable()
>> (in fact this is better since without batching, it may happen that
>> some ptes aren't changed to writable just because they are not dirty,
>> even though the other ptes mapping the same large folio are dirty).
>>
>> To batch around the PageAnonExclusive case, we must check the 
>> corresponding
>> condition for every single page. Therefore, from the large folio batch,
>> we process sub batches of ptes mapping pages with the same
>> PageAnonExclusive condition, and process that sub batch, then determine
>> and process the next sub batch, and so on. Note that this does not cause
>> any extra overhead; if suppose the size of the folio batch is 512, then
>> the sub batch processing in total will take 512 iterations, which is the
>> same as what we would have done before.
>>
>> For pte_needs_flush():
>>
>> ppc does not care about the a/d bits.
>>
>> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
>> get cleared; since we can only have extra a/d bits due to batching,
>> we will only have an extra flush, not a case where we elide a flush due
>> to batching when we shouldn't have.
>>
>> Signed-off-by: Dev Jain <dev.jain@....com>
>
>
> I wanted to review this, but looks like it's already upstream and I 
> suspect it's buggy (see the upstream report I cc'ed you on)

Thank you for CCing, and quickly spotting the problem!


>
> [...]
>
>> +
>> +/*
>> + * This function is a result of trying our very best to retain the
>> + * "avoid the write-fault handler" optimization. In 
>> can_change_pte_writable(),
>> + * if the vma is a private vma, and we cannot determine whether to 
>> change
>> + * the pte to writable just from the vma and the pte, we then need 
>> to look
>> + * at the actual page pointed to by the pte. Unfortunately, if we 
>> have a
>> + * batch of ptes pointing to consecutive pages of the same anon 
>> large folio,
>> + * the anon-exclusivity (or the negation) of the first page does not 
>> guarantee
>> + * the anon-exclusivity (or the negation) of the other pages 
>> corresponding to
>> + * the pte batch; hence in this case it is incorrect to decide to 
>> change or
>> + * not change the ptes to writable just by using information from 
>> the first
>> + * pte of the batch. Therefore, we must individually check all pages 
>> and
>> + * retrieve sub-batches.
>> + */
>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>> +        struct folio *folio, unsigned long addr, pte_t *ptep,
>> +        pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>> +{
>> +    struct page *first_page = folio_page(folio, 0);
>
> Who says that we have the first page of the folio mapped into the 
> first PTE of the batch?

Oops, I thought I had taken care of this. We make the folio from 
vm_normal_folio(vma, addr, oldpte),

which makes the struct page from the actual pte, but then I missed that 
page_folio() will reset the folio->page

to the head page, I was under the impression it will retain the original 
page. And indeed, if it didn't do that

then the purpose of vm_normal_folio() is nullified, should have thought 
about that :(  Lemme send a fix patch.


>
>> +    bool expected_anon_exclusive;
>> +    int sub_batch_idx = 0;
>> +    int len;
>> +
>> +    while (nr_ptes) {
>> +        expected_anon_exclusive = PageAnonExclusive(first_page + 
>> sub_batch_idx);
>> +        len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
>> +                    first_page, expected_anon_exclusive);
>> +        prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
>> +                       sub_batch_idx, expected_anon_exclusive, tlb);
>> +        sub_batch_idx += len;
>> +        nr_ptes -= len;
>> +    }
>> +}
>> +
>> +static void set_write_prot_commit_flush_ptes(struct vm_area_struct 
>> *vma,
>> +        struct folio *folio, unsigned long addr, pte_t *ptep,
>> +        pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>> +{
>> +    bool set_write;
>> +
>> +    if (vma->vm_flags & VM_SHARED) {
>> +        set_write = can_change_shared_pte_writable(vma, ptent);
>> +        prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
>> +                       /* idx = */ 0, set_write, tlb);
>> +        return;
>> +    }
>> +
>> +    set_write = maybe_change_pte_writable(vma, ptent) &&
>> +            (folio && folio_test_anon(folio));
>> +    if (!set_write) {
>> +        prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
>> +                       /* idx = */ 0, set_write, tlb);
>> +        return;
>> +    }
>> +    commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, 
>> nr_ptes, tlb);
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>           struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>           unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>>           nr_ptes = 1;
>>           oldpte = ptep_get(pte);
>>           if (pte_present(oldpte)) {
>> +            const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | 
>> FPB_RESPECT_WRITE;
>>               int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> -            struct folio *folio;
>> +            struct folio *folio = NULL;
>>               pte_t ptent;
>>                 /*
>> @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather 
>> *tlb,
>>                         /* determine batch to skip */
>>                       nr_ptes = mprotect_folio_pte_batch(folio,
>> -                          pte, oldpte, max_nr_ptes);
>> +                          pte, oldpte, max_nr_ptes, /* flags = */ 0);
>>                       continue;
>>                   }
>>               }
>>   +            if (!folio)
>> +                folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +            nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, 
>> max_nr_ptes, flags);
>> +
>>               oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>               ptent = pte_modify(oldpte, newprot);
>>   @@ -248,14 +350,13 @@ static long change_pte_range(struct 
>> mmu_gather *tlb,
>>                * COW or special handling is required.
>>                */
>>               if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -                !pte_write(ptent) &&
>> -                can_change_pte_writable(vma, addr, ptent))
>> -                ptent = pte_mkwrite(ptent, vma);
>> -
>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, 
>> nr_ptes);
>> -            if (pte_needs_flush(oldpte, ptent))
>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -            pages++;
>> +                 !pte_write(ptent))
>> +                set_write_prot_commit_flush_ptes(vma, folio,
>> +                addr, pte, oldpte, ptent, nr_ptes, tlb);
>
> While staring at this:
>
> Very broken indentation.

Indeed, but then in order to fix the indentation if we start positioning

every argument just below the opening bracket then it will take at least 
four lines :)


>
>> +            else
>> +                prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>> +                    nr_ptes, /* idx = */ 0, /* set_write = */ false, 
>> tlb);
>
> Semi-broken intendation.

Yup I can position nr_ptes below vma and set_write below nr_ptes. But this

should probably not be a part of the fix patch.


>
>> +            pages += nr_ptes;
>>           } else if (is_swap_pte(oldpte)) {
>>               swp_entry_t entry = pte_to_swp_entry(oldpte);
>>               pte_t newpte;
>
>