lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <871poz2299.wl-maz@kernel.org>
Date: Mon, 25 Aug 2025 10:12:50 +0100
From: Marc Zyngier <maz@...nel.org>
To: Sam Edwards <cfsworks@...il.com>
Cc: Ard Biesheuvel <ardb@...nel.org>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Anshuman Khandual <anshuman.khandual@....com>,
	Ryan Roberts <ryan.roberts@....com>,
	Baruch Siach <baruch@...s.co.il>,
	Kevin Brodsky <kevin.brodsky@....com>,
	Joey Gouly <joey.gouly@....com>,
	linux-arm-kernel@...ts.infradead.org,
	linux-kernel@...r.kernel.org,
	stable@...r.kernel.org
Subject: Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use

On Mon, 25 Aug 2025 00:43:08 +0100,
Sam Edwards <cfsworks@...il.com> wrote:
> 
> Hi, Marc! It's been a while; hope you're well.
> 
> On Sun, Aug 24, 2025 at 1:55 AM Marc Zyngier <maz@...nel.org> wrote:
> >
> > Hi Sam,
> >
> > On Sun, 24 Aug 2025 04:05:05 +0100,
> > Sam Edwards <cfsworks@...il.com> wrote:
> > >
> > > On Sat, Aug 23, 2025 at 5:29 PM Ard Biesheuvel <ardb@...nel.org> wrote:
> > > >
> >
> > [...]
> >
> > > > Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?
> > >
> > > I might be doing my math wrong, but wouldn't 52-bit VA with 4K
> > > granules and 5 levels result in this?
> >
> > No. 52bit VA at 4kB granule results in levels 0-3 each resolving 9
> > bits, and level -1 resolving 4 bits. That's a total of 40 bits, plus
> > the 12 bits coming directly from the VA making for the expected 52.
> 
> Thank you, that makes it clear: I made an off-by-one mistake in my
> counting of the levels.
> 
> > > Each PTE represents 4K of virtual memory, so covers VA bits [11:0]
> > > (this is level 3)
> >
> > That's where you got it wrong. The architecture is pretty clear that
> > each level resolves PAGE_SHIFT-3 bits, hence the computation
> > above. The bottom PAGE_SHIFT bits are directly extracted from the VA,
> > without any translation.
> 
> Bear with me a moment while I unpack which part of that I got wrong:
> A PTE is the terminal entry of the MMU walk, so I believe I'm correct
> (in this example, and assuming no hugepages) that each PTE represents
> 4K of virtual memory: that means the final step of computing a PA
> takes a (valid) PTE and the low 12 bits of the VA, then just adds
> those bits to the physical frame address.
> It sounds like what you're saying is "That isn't a *level* though:
> that's just concatenation. A 'level' always takes a bitslice of the VA
> and uses it as an index into a table of word-sized entries. PTEs don't
> point to a further table: they have all of the final information
> encoded directly."

That's mostly it, yes. Each valid descriptor has an output address,
which either points to another table or to actual memory, further to
be indexed by the remaining bits of the VA (for 4kB pages: 12 bits for
a level-3, 21 bits for a level-2...). Level-3 (aka PTEs in x86
parlance) are always final.

> That makes a lot more sense to me, but contradicts how I read this
> comment from pgtable-hwdef.h:
>  * Level 3 descriptor (PTE).
> I took this as, "a PTE describes how to perform level 3 of the
> translation." But because in fact there are no "levels" after a PTE,
> it must actually be saying "Level 3 of the translation is a lookup
> into an array of PTEs."? The problem with that latter reading is that
> this comment...
>  * Level -1 descriptor (PGD).
> ...when read the same way, is saying "Level -1 of the translation is a
> lookup into an array of PGDs." An "array of PGDs" is nonsense, so I
> reverted back to my earlier readings: "PGD describes how to do level
> -1." and "PTE describes how to do level 3."

The initial level of lookup *is* an array: you take the base address
from TTBR, index it with the correct slice of bits from the VA, read
the value at that address, and you have the information needed for the
next level. The only difference is that you obtain that initial
address from a register instead of getting it from memory.

> 
> This smells like a classic "fencepost problem": The "PXX" Linuxisms
> refer to the *nodes* along the MMU walk, while the "levels" in ARM
> parlance are the actual steps of the walk taken by hardware -- edges,
> not nodes, getting us from fencepost to fencepost. A fence with five
> segments needs six posts, but we only have five currently.
> 
> So: where do the terms P4D, PUD, and PMD fit in here? And which one's
> our missing fencepost?
> PGD ----> ??? ----> ??? ----> ??? ----> ??? ----> PTE (|| low VA bits
> = final PA)

I'm struggling to see what you consider a problem, really. For me, the
original mistake is that you seem to have started off the LSBs of the
VA, instead of the MSBs.

As for the naming, the comments in pgtable-hwdef.h do apply. Except
that they only match a full 5-level walk, while the kernel can be
configured for as little as 2 levels. Hence the macro hell of folding
levels to hide the fact that we don't have 5 levels in most cases.

I find it much easier to reason about a start level (anywhere from -1
to 2, depending on the page size and the number of VA bits), and the
walk to always finish at level 3. The x86 naming is just compatibility
cruft that I tend to ignore.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ