lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ae01c6bc-019a-4263-8083-8ab073e72729@lucifer.local>
Date: Thu, 7 Aug 2025 21:58:14 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Pedro Falcato <pfalcato@...e.de>, Barry Song <baohua@...nel.org>,
        Dev Jain <dev.jain@....com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on
 mremap folio pte batch

On Thu, Aug 07, 2025 at 08:56:44PM +0100, Ryan Roberts wrote:
> On 07/08/2025 20:20, Lorenzo Stoakes wrote:
> > +cc Ryan for ContPTE stuff.
>
> Appologies, I was aware of the other thread and on-going issues but haven't had
> the bandwidth to follow too closely.
>
> >
> > On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
> >> Acked-by: David Hildenbrand <david@...hat.com>
> >
> > Thanks!
> >
> >>
> >> Wondering whether we could then just use the patch hint instead of going via
> >> the folio.
> >>
> >> IOW,
> >>
> >> return pte_batch_hint(ptep, pte);
> >
> > Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
> > have some conflicting flags potentially. The check is empirical:
> >
> > static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> > {
> > 	if (!pte_valid_cont(pte))
> > 		return 1;
> >
> > 	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> > }
> >
> > So it's 'the most number of PTEs that _might_ coalesce'.
>
> No that's not correct; It's "at least this number of ptes _do_ coalesce".
> folio_pte_batch() may end up returning a larger batch, but never smaller.

Yup David explained.

I suggest you rename this from 'hint', because that's not what hint means
:) unless I'm really misunderstanding what this word means (it's 10pm and I
started work at 6am so it's currently rather possible).

I understand the con PTE bit is a 'hint' but as I recall you saying at
LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
from, but that's kinda deceptive.

Anyway the reason I was emphatic here is on the basis that I believe I had
this explained to met his way, which obviously I or whoever it was (don't
recall) must have misunderstood. Or perhaps I hallucinated it... :)

I see that folio_pte_batch() can get _more_, is this on the basis of there
being adjacent, physically contiguous contPTE entries that can also be
batched up?

>
> This function is looking to see if ptep is inside a conpte mapping, and if it
> is, it's returning the number of ptes to the end of the contpte mapping (which
> is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
> if the physical memory is appropriately aligned/sized and all belongs to a
> single folio.
>
> >
> > (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
> >
> > I suppose we could not even bother with checking if same folio and _just_ check
> > if PTEs have consecutive PFNs, which is not very likely if different folio
> > but... could that break something?
>
> Yes something could break; the batch must *all* belong to the same folio.
> Functions like set_ptes() require that in their documentation, and arm64 depends
> upon it in order not to screw up the access/dirty bits.

Turning this around - is a cont pte range guaranteed to belong to only one
folio?

If so then we can just limit the range to one batched block for the sake of
mremap that perhaps doesn't necessarily hugely benefit from further
batching anyway?

Let's take the time to check performance on arm64 hardware.

Are we able to check to see how things behave if we have small folios only
in the tested range on arm64

>
> >
> > It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
> > thing' for a contPTE batch (I may be missing something - please correct me if so
> > Dev/Ryan).
>
> It will all do the right thing functionally no matter how you call it. But if
> you can set_ptes() (and friends) on full contpte mappings, things are more
> efficient.

Yup this is what I was... hinting at ;)

>
> >
> > So actually do we even really care that much about folio?
>
> From arm64's perspective, we're happy enough with batches the size of
> pte_batch_hint(). folio_pte_batch() is a bonus, but certainly not a deal-breaker
> for this location.

OK, so I think we should definitely refactor this.

David pointed out off-list we are duplicating the a/d handing _anyway_ in
get_and_clear_ptes(). So that bit is just wasted effort, so there's really
no need to do much that.

>
> For the record, I'm pretty sure I was the person pushing for protecting
> vm_normal_folio() with pte_batch_hint() right at the start of this process :)

I think you didn't give your hint clearly enough ;)

>
> Thanks,
> Ryan
>
> >
> >>
> >>
> >> Not sure if that was discussed at some point before we went into the
> >> direction of using folios. But there really doesn't seem to be anything
> >> gained for other architectures here (as raised by Jann).
> >
> > Yup... I wonder about the other instances of this... ruh roh.
>
> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
> already needed the folio. I haven't actually looked at how mprotect ended up,
> but maybe worth checking to see if it should protect with pte_batch_hint() too.

mprotect didn't? I mean let's check.

We definitely need to be careful about other arches.

>
> Thanks,
> Ryan

Cheers, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ