linux-kernel - Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5900e978-6349-4a3d-a384-758889b678a0@lucifer.local>
Date: Tue, 1 Jul 2025 09:15:03 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Dev Jain <dev.jain@....com>, akpm@...ux-foundation.org, david@...hat.com,
        willy@...radead.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        catalin.marinas@....com, will@...nel.org, Liam.Howlett@...cle.com,
        vbabka@...e.cz, jannh@...gle.com, anshuman.khandual@....com,
        peterx@...hat.com, joey.gouly@....com, ioworker0@...il.com,
        baohua@...nel.org, kevin.brodsky@....com, quic_zhenhuah@...cinc.com,
        christophe.leroy@...roup.eu, yangyicong@...ilicon.com,
        linux-arm-kernel@...ts.infradead.org, hughd@...gle.com,
        yang@...amperecomputing.com, ziy@...dia.com
Subject: Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching

On Tue, Jul 01, 2025 at 09:03:27AM +0100, Ryan Roberts wrote:
> >
> > ////// end of Lorenzo's suggestion //////
> >
> > You can obviously modify this to change other stuff like whether you feed back
> > the PAE or not in private case for use in your code.
>
> This sugestion for this part of the problem looks much cleaner!

Thanks :)

>
> Sorry; this whole struct tristate thing was my idea. I never really liked it but
> I was more focussed on trying to illustrate the big picture flow that I thought
> would work well with a batch and sub-batches (which it seems below that you
> hate... but let's talk about that down there).

Yeah, this is fiddly stuff so I get it as a sort of psuedocode, but as code
obviously I've made my feelings known haha.

It seems that we can apply the fundamental underlying idea without needing to do
it this way at any rate so we should be good.

> >>
> >>  static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
> >> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
> >> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
> >
> > This last parameter is pretty horrible. It's a negative mask so now you're
> > passing 'ignore soft dirty' to the function meaning 'don't ignore it'. This is
> > just planting land mines.
> >
> > Obviously David's flag changes will also alter all this.
> >
> > Just add a boolean re: soft dirty.
>
> Dev had a boolean for this in the last round. I've seen various functions expand
> over time with increasing numbers of bool flags. So I asked to convert to a
> flags parameter and just pass in the flags we need. Then it's a bit more future
> proof and self documenting. (For the record I dislike the "switch_off_flags"
> approach taken here).

Yeah, but we can change this when it needs to be changed. When it comes to
internal non-uAPI stuff I don't think we need to be too worried about
future-proofing like this at least just yet.

Do not fear the future churn... ;)

I mean I guess David's new flags will make this less egregious anyway.

> >>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> >>  			ptent = pte_modify(oldpte, newprot);
> >>
> >> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>  			 * example, if a PTE is already dirty and no other
> >>  			 * COW or special handling is required.
> >>  			 */
> >> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> >> -			    !pte_write(ptent) &&
> >> -			    can_change_pte_writable(vma, addr, ptent))
> >> -				ptent = pte_mkwrite(ptent, vma);
> >> -
> >> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> >> -			if (pte_needs_flush(oldpte, ptent))
> >> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> >> -			pages++;
> >> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> >> +				    !pte_write(ptent);
> >> +			if (set_write)
> >> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
> >> +
> >> +			while (nr_ptes) {
> >> +				if (set_write == TRI_MAYBE) {
> >> +					sub_nr_ptes = anon_exclusive_batch(folio,
> >> +						pgidx, nr_ptes, &sub_set_write);
> >> +				} else {
> >> +					sub_nr_ptes = nr_ptes;
> >> +					sub_set_write = (set_write == TRI_TRUE);
> >> +				}
> >> +
> >> +				if (sub_set_write)
> >> +					newpte = pte_mkwrite(ptent, vma);
> >> +				else
> >> +					newpte = ptent;
> >> +
> >> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
> >> +							newpte, sub_nr_ptes);
> >> +				if (pte_needs_flush(oldpte, newpte))
> >> +					tlb_flush_pte_range(tlb, addr,
> >> +						sub_nr_ptes * PAGE_SIZE);
> >> +
> >> +				addr += sub_nr_ptes * PAGE_SIZE;
> >> +				pte += sub_nr_ptes;
> >> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
> >> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
> >> +				nr_ptes -= sub_nr_ptes;
> >> +				pages += sub_nr_ptes;
> >> +				pgidx += sub_nr_ptes;
> >> +			}
> >
> > I hate hate hate having this loop here, let's abstract this please.
> >
> > I mean I think we can just use mprotect_folio_pte_batch() no? It's not
> > abstracting much here, and we can just do all this handling there. Maybe have to
> > pass in a bunch more params, but it saves us having to do all this.
>
> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
> return the batch size considering all the relevant PTE bits AND the
> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
> really disliked changing that core function (I think others did too?).

Yeah let's not change the core function.

My suggestion is to have mprotect_folio_pte_batch() do this.

>
> So barring that approach, we are really only left with the batch and sub-batch
> approach - although, yes, it could be abstracted more. We could maintain a
> context struct that persists across all calls to mprotect_folio_pte_batch() and
> it can use that to keep it's state to remember if we are in the middle of a
> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
> call anon_exclusive_batch() to get the next sub-batch within the current batch.
> But that started to feel overly abstracted to me.

Having this nested batch/sub-batch loop really feels worse. You just get lost in
the complexity here very easily.

But i"m also not sure we need to maintain _that_ much state?

We're already looping over all of the PTEs here, so abstracting _the entire
loop_ and all the sub-batch stuff to another function, that is
mprotect_folio_pte_batch() I think sensibly, so it handles this for you makes a
ton of sense.

>
> This loop approach, as written, felt more straightforward for the reader to
> understand (i.e. the least-worst option). Is the context approach what you are
> suggesting or do you have something else in mind?
>

See above.

> >
> > Alternatively, we could add a new wrapper function, but yeah definitely not
> > this.
> >
> > Also the C programming language asks... etc etc. ;)
> >
> > Overall since you always end up processing folio_nr_pages(folio) you can just
> > have the batch function or a wrapper return this and do updates as necessary
> > here on that basis, and leave the 'sub' batching to that function.
>
> Sorry I don't understand this statement - could you clarify? Especially the bit
> about "always ... processing folio_nr_pages(folio)"; I don't think we do. In
> various corner cases the size of the folio has no relationship to the way the
> PTEs are mapped.

Right yeah I put this badly. Obviously you can have all sorts of fun with large
folios partially mapped and page-table split and etc. etc.

I should have said 'always process nr_ptes'.

The idea is to abstract this sub-batch stuff to another function, fundamentally.