linux-kernel - Re: [PATCH v3 08/22] mm: Allow page table accessors to be non-idempotent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc637516-d013-4f3b-b1a8-33bc1d9f52d1@lucifer.local>
Date: Thu, 18 Dec 2025 17:27:13 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Samuel Holland <samuel.holland@...ive.com>,
        "David Hildenbrand (Red Hat)" <david@...nel.org>,
        Palmer Dabbelt <palmer@...belt.com>, Paul Walmsley <pjw@...nel.org>,
        linux-riscv@...ts.infradead.org,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        devicetree@...r.kernel.org, Suren Baghdasaryan <surenb@...gle.com>,
        linux-kernel@...r.kernel.org, Mike Rapoport <rppt@...nel.org>,
        Michal Hocko <mhocko@...e.com>, Conor Dooley <conor@...nel.org>,
        Krzysztof Kozlowski <krzk+dt@...nel.org>,
        Alexandre Ghiti <alex@...ti.fr>,
        Emil Renner Berthing <kernel@...il.dk>,
        Rob Herring <robh+dt@...nel.org>, Vlastimil Babka <vbabka@...e.cz>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [PATCH v3 08/22] mm: Allow page table accessors to be
 non-idempotent

On Tue, Dec 16, 2025 at 05:46:57PM +0000, Ryan Roberts wrote:
> On 16/12/2025 10:29, Lorenzo Stoakes wrote:
> > On Thu, Dec 11, 2025 at 01:59:53PM +0000, Ryan Roberts wrote:
> >> On 11/12/2025 00:33, Samuel Holland wrote:
> >>> On 2025-11-28 2:47 AM, David Hildenbrand (Red Hat) wrote:
> >>>> On 11/27/25 17:57, Ryan Roberts wrote:
> >>>>> On 13/11/2025 01:45, Samuel Holland wrote:
> >>>>>> Currently, some functions such as pte_offset_map() are passed both
> >>>>>> pointers to hardware page tables, and pointers to previously-read PMD
> >>>>>> entries on the stack. To ensure correctness in the first case, these
> >>>>>> functions must use the page table accessor function (pmdp_get()) to
> >>>>>> dereference the supplied pointer. However, this means pmdp_get() is
> >>>>>> called twice in the second case. This double call must be avoided if
> >>>>>> pmdp_get() applies some non-idempotent transformation to the value.
> >>>>>>
> >>>>>> Avoid the double transformation by calling set_pmd() on the stack
> >>>>>> variables where necessary to keep set_pmd()/pmdp_get() calls balanced.
> >>>>>
> >>>>> I don't think this is a good solution.
> >>>>
> >>>> Agreed,
> >>>>
> >>>>     set_pmd(&pmd, pmd);
> >>>>
> >>>> is rather horrible.
> >>> I agree that this patch is ugly. The only way I see to avoid code like this is
> >>> to refactor (or duplicate) the functions so no function takes pointers to both
> >>> hardware page tables and on-stack page table entries. Is that sort of
> >>> refactoring the right direction to go for v4?
> >>
> >> From a quick look at the code, I think that some cases are solvable by
> >> refactoring to pass the value instead of the pointer, and leave it to the higher
> >> level decide how to read the value from the pointer - it knows if it is pointing
> >> to HW pgtable or if it's a (e.g) stack value.
> >>
> >> But the more I look at the code, the more instances I find where pointers to
> >> stack variables are being passed to arch pgtable helpers as if they are HW
> >> pgtable entry pointers. (Mainly pmd level).
> >
> > Ugh. We do need to nip this in the bud I think!
> >
> >>
> >> I wonder if we need to bite the bullet and explicitly separate the types? At
> >> each level, we have:
> >>
> >>  1. page table entry value
> >>  2. pointer to page table entry _value_ (e.g. pointer to pXX_t on stack)
> >>  3. pointer to page table entry in HW pgtable
> >>
> >> Today, 1 is represented by pte_t, pmd_t, etc. 2 and 3 are represented by the
> >> same type; pte_t*, pmd_t*, etc.
> >>
> >> If we create a new type for 3, it will both document and enforce when type 2 or
> >> type 3 is required.
> >>
> >> e.g:
> >>
> >> // pte_t: defined by arch.
> >> typedef unsigned long pte_t;
> >>
> >> // ptep_t: new opaque type that can't be dereferenced.
> >> struct __ptep_t;
> >> typedef struct __ptep_t *ptep_t;
> >>
> >> // getter/setter responsible for cast & deref as appropriate.
> >> pte_t ptep_get(ptep_t ptep)
> >> {
> >> 	return READ_ONCE(*(pte_t *)ptep);
> >> }
> >
> > I think we've got ourselves in a jumble with pte vs ptep vs pteval
> > vs. etc. etc. and we aren't always consistent with it.
> >
> > So I think we ought to put 'hw' somewhere in the name.
>
> Yeah that would probably give the best end result. I was thinking that ptep/pXdp
> might reduce churn since we already use it in function names. Stack pointers are
> definitely a small minority so easier to change their convention.
>
> But having an explicit "hw" would probably be best.

Yeah, unfortunately I think the existing inconsistencies call for a clean
break.

>
> >
> > And we can also now get away from the overly abbreviated names given we have the
> > chance :)
> >
> > So something like 'pte_hw_t' perhaps?
>
> sounds good. Or hw_pte_t? Then "pte_t" is still greppable and it better matches
> the hw_pte_get() that you have below.

Sure hw_pte_t works!

>
> >
> > I like the general concept, though I think it's kinda gross to hide the fact
> > that it's a pointer in the typedef _that_ directly.
>
> My aim was to have a type that you can't dereference, can't accidentally pass as
> a pte_t* and also can't increment/decrement/index.

Right.

>
> This gives us a framework where the architecture can completely virtualize the
> pgtable if it likes; and the compiler can enforce that higher level code can't
> accidentally work around it.

Yeah it's nice to be able to hide properties like this, I did think the use
of an incomplete type pointer was quite nice there :)

>
> I have a couple of use cases where these properties will come in handy; arm64
> now defines 128-bit pgtables. We have a prototype adding it to the kernel, but
> at the moment it has to be a compile time descision so that all the generic code
> knows the size of the entries. It would be preferable to make it a boot time
> decision, based on the HW capabilities (or cmdline, ...). But for that, we need
> to better insulate the physical pgtables from the core-mm. This concept solves that.

NIce.

>
> >
> > So perhaps be less horrid if it was something like:
> >
> > typedef struct {
> > 	pte_t *ptr;
> > } pte_hw_t;
> >
> > Perhaps?
>
> I _think_ that works for the above, and it is certainly nicer.
>
> pte_hw_t pte_table = pte_alloc(...);
> for (i = 0; i < PTRS_PER_PTE; i++)
> 	hw_pte_get(pte_table[i]);
>
> I think that would fail to compile, right? Which is what we want. Instead we
> need something like:

Yeah that wouldn't work of course.

But wouldn't it also not work if you had an opaque type? As you'd then be
dereferencing it? So it'd be something to solve for either approach right?

(An aside - this brings me back to the whole thing of differentiating
between page _tables_ and _entries_, another thing we do badly at.)

>
> pte_hw_t hwpte = pte_alloc(...);
> for (i = 0; i < PTRS_PER_PTE; i++, hwpte = hw_pte_next(hwpte))
> 	hw_pte_get(hwpte);
>
> >
> > Then could have:
> >
> > pte_t hw_pte_get(pte_hw_t pte_hw)
> > {
> > 	return READ_ONCE(*pte_hw.ptr);
> > }
> >

Yeah could do something like that.

It might actually be nice to abstract the iterator-like behaviour too to
the arch for e.g. your 128-bit entry stuff?

> >>
> >> int do_stuff(void)
> >> {
> >> 	// value on stack: ok
> >> 	pte_t mypte;
> >>
> >> 	// pointer to value on stack: ok
> >> 	pte_t *pmypte = &mypte;
> >>
> >> 	// handle to entry on stack: not allowed by compiler!
> >> 	ptep_t myptep = &mypte;
> >>
> >> 	// handle to entry in pgtable: ok
> >> 	ptep_t myptep = pte_offset_kernel(...);
> >>
> >> 	// read value of pgtable entry: ok
> >> 	pte_t val = ptep_get(myptep);
> >>
> >> 	// attempt to pass pointer to stack variable: not allowed by compiler!
> >> 	pte_t val = ptep_get(&mypte);
> >>
> >> 	// attempt to directly dereference ptep: not allowed by compiler!
> >> 	pte_t val = *myptep;
> >> }
> >>
> >>
> >> We could do this incrementally by initially typedefing ptep_t to be:
> >> typedef pte_t *ptep_t;
> >
> > Hm yeah still hate the idea of typedef'ing this as a ptr so directly.
>
> What's the problem with it?

It's the fact you are making what looks like not-a-pointer into a pointer.

It's even more egregious for the non-opaque default thing of typedef pte_t
*hw_pte_t - as then you truly are just making hw_pte_t not look like a
pointer but behave like one.


>
> >
> > Obviously this would make an incremental thing a little harder, but could have
> > this stuff on top of existing logic and make it incremental by changing logic
> > bit-by-bit?
>
> I'm not sure... sounds to me like we would need to convert absolutely everything
> including arches that are difficult for most people to test. With the "typedef
> pte_t *hw_pte_t;" approach, you can go bit-by-bit and leave those unloved arches
> alone entirely.

Couldn't we have a generic thing for arches that don't yet use the type
potentially? Might be a pain.

>
> >
> >>
> >> Then we could flip the switch arch-by-arch to enable the stronger checking. We
> >> likely wouldn't need to convert arches that don't care.
> >>
> >> I think by doing this, it will expose all the current issues and force us to fix
> >> them properly.
> >
> > Yeah this is nice.
> >
> >>
> >> On the related subject of conversion to pXXp_get(); I've been looking into this
> >> and personally, I think we should have 2 helper flavours at each level:
> >>
> >>  - pXXd_get()      optimizable by compiler; defaults to C dereference
> >>  - pXXd_get_once() single-copy-atomic and unmovable by compiler
> >
> > Yes!
> >
> > There is some _real_ confusion about when and when we don't need to do this.
>
> I think the requirements are basically this:
>
> "_ONCE" gives 2 properties; single-copy-atomicity and compiler can't move/remove
> instructions.
>
> SCA is important to prevent tearing if the HW can update the PTE concurrently
> with SW reading it (i.e. HW access/dirty) or if SW is reading without holding
> the PTL. But there are times when it doesn't matter if the PTE gets torn because
> you are only comparing the result using pte_none() or pte_same().
>
> The "compiler can't move/remove" property (I'm sure there is a formal name for
> this?) is important for lockless walkers. Probably you could achieve similar
> results by using barrier()?

OK makes sense.

>
> >
> > Though wouldn't having a typdef separating out a hw entry imply that hw entry ->
> > get once, sw entry -> get?
>
> No; you should never use the accessor methods for sw entries - they are just
> variables - read them normally. If all HW accesses use "once", then the compiler
> can't optimize away the accesses for the folded pgtable case.

Right, yes makes sense.

>
> >
> >>
> >> It simplifies the converstion process, and reduces the risk of bugs
> >> significantly (go read about the arm32 issues discussed in Anshuman's series if
> >> you haven't done already).
> >
> > Though I haven't read this series so can't remember actually if there were cases
> > where even with hw entries we wanted to sometimes READ_ONCE() and sometimes not?
>
> Yes exactly that.

Yeah I guess based on whether the properties desired match the above or
not.

>
> >
> >>
> >> I appreciate this is all probably a lot more work than you would prefer to sign
> >> up for, I'd be happy to collaborate if we get concensus that this approach makes
> >> sense. What do you think?
> >>
> >> Thanks,
> >> Ryan
> >>
> >
> > Thanks for this concept, overall very much in agreement!
>
> I think it would work but it's going to be a lot of churn. But if you think it's
> worth it and we'll be able to get it in, then I'm up for working on it in the
> background.

Well I am never averse to churn :)

I think at this point we have so much of a confusing mess of things going
on in the page table logic that it makes sense to pay the churn price for this.

>
> >
> > Cheers, Lorenzo
>

Thanks for looking into this,

Cheers, Lorenzo