[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9552ddff-d778-4934-9349-37c7237cbb78@lucifer.local>
Date: Wed, 6 Aug 2025 10:12:48 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Dev Jain <dev.jain@....com>, akpm@...ux-foundation.org,
ryan.roberts@....com, willy@...radead.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, catalin.marinas@....com, will@...nel.org,
Liam.Howlett@...cle.com, vbabka@...e.cz, jannh@...gle.com,
anshuman.khandual@....com, peterx@...hat.com, joey.gouly@....com,
ioworker0@...il.com, baohua@...nel.org, kevin.brodsky@....com,
quic_zhenhuah@...cinc.com, christophe.leroy@...roup.eu,
yangyicong@...ilicon.com, linux-arm-kernel@...ts.infradead.org,
hughd@...gle.com, yang@...amperecomputing.com, ziy@...dia.com
Subject: Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote:
> On 18.07.25 11:02, Dev Jain wrote:
> > Signed-off-by: Dev Jain <dev.jain@....com>
>
>
> I wanted to review this, but looks like it's already upstream and I suspect
> it's buggy (see the upstream report I cc'ed you on)
>
> [...]
>
> > +
> > +/*
> > + * This function is a result of trying our very best to retain the
> > + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
> > + * if the vma is a private vma, and we cannot determine whether to change
> > + * the pte to writable just from the vma and the pte, we then need to look
> > + * at the actual page pointed to by the pte. Unfortunately, if we have a
> > + * batch of ptes pointing to consecutive pages of the same anon large folio,
> > + * the anon-exclusivity (or the negation) of the first page does not guarantee
> > + * the anon-exclusivity (or the negation) of the other pages corresponding to
> > + * the pte batch; hence in this case it is incorrect to decide to change or
> > + * not change the ptes to writable just by using information from the first
> > + * pte of the batch. Therefore, we must individually check all pages and
> > + * retrieve sub-batches.
> > + */
> > +static void commit_anon_folio_batch(struct vm_area_struct *vma,
> > + struct folio *folio, unsigned long addr, pte_t *ptep,
> > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> > +{
> > + struct page *first_page = folio_page(folio, 0);
>
> Who says that we have the first page of the folio mapped into the first PTE
> of the batch?
Yikes, missed this sorry. Got too tied up in alogrithm here.
You mean in _this_ PTE of the batch right? As we're invoking these on each part
of the PTE table.
I mean I guess we can simply do:
struct page *first_page = pte_page(ptent);
Right?
Presuming ptent is the PTE entry we are curently examining (and applying
the PAE sub-batching to etc.)
[snip]
> > @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb,
> > /* determine batch to skip */
> > nr_ptes = mprotect_folio_pte_batch(folio,
> > - pte, oldpte, max_nr_ptes);
> > + pte, oldpte, max_nr_ptes, /* flags = */ 0);
> > continue;
> > }
> > }
> > + if (!folio)
> > + folio = vm_normal_folio(vma, addr, oldpte);
> > +
> > + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> > +
> > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> > ptent = pte_modify(oldpte, newprot);
> > @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb,
> > * COW or special handling is required.
> > */
> > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> > - !pte_write(ptent) &&
> > - can_change_pte_writable(vma, addr, ptent))
> > - ptent = pte_mkwrite(ptent, vma);
> > -
> > - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> > - if (pte_needs_flush(oldpte, ptent))
> > - tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> > - pages++;
> > + !pte_write(ptent))
> > + set_write_prot_commit_flush_ptes(vma, folio,
> > + addr, pte, oldpte, ptent, nr_ptes, tlb);
>
> While staring at this:
>
> Very broken indentation.
Hmm... I wonder if checkpatch.pl would have picked this up actually.
Yeah that 2nd line is just wrong...
>
> > + else
> > + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
> > + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
>
> Semi-broken intendation.
Because of else then 2 lines after?
>
> > + pages += nr_ptes;
> > } else if (is_swap_pte(oldpte)) {
> > swp_entry_t entry = pte_to_swp_entry(oldpte);
> > pte_t newpte;
>
>
> --
> Cheers,
>
> David / dhildenb
>
I think on this series I was so happy to see the implementation evolve to
something that was so much nicer than it originally was, I did not
scrutinise the impl. details of this one close enough, but as Jamie
Lannister said, "there are always lessons in failures" :)
Cheers, Lorenzo
Powered by blists - more mailing lists