lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH5Ym4h+2w6aayzsVu__3qu3-6ETq1HK7u18yGzOrRqZ--2H9w@mail.gmail.com>
Date: Sat, 23 Aug 2025 20:05:05 -0700
From: Sam Edwards <cfsworks@...il.com>
To: Ard Biesheuvel <ardb@...nel.org>
Cc: Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>, 
	Marc Zyngier <maz@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	Anshuman Khandual <anshuman.khandual@....com>, Ryan Roberts <ryan.roberts@....com>, 
	Baruch Siach <baruch@...s.co.il>, Kevin Brodsky <kevin.brodsky@....com>, 
	Joey Gouly <joey.gouly@....com>, linux-arm-kernel@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use

On Sat, Aug 23, 2025 at 5:29 PM Ard Biesheuvel <ardb@...nel.org> wrote:
>
> On Sun, 24 Aug 2025 at 09:56, Sam Edwards <cfsworks@...il.com> wrote:
> >
> > On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel <ardb@...nel.org> wrote:
> > >
> > > Hi Sam,
> > >
> > > On Fri, 22 Aug 2025 at 14:15, Sam Edwards <cfsworks@...il.com> wrote:
> > > >
> > > > In early boot, Linux creates identity virtual->physical address mappings
> > > > so that it can enable the MMU before full memory management is ready.
> > > > To ensure some available physical memory to back these structures,
> > > > vmlinux.lds reserves some space (and defines marker symbols) in the
> > > > middle of the kernel image. However, because they are defined outside of
> > > > PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
> > > > is concerned.
> > > >
> > > > In the typical case, this isn't actually a problem: the boot image is
> > > > prepared with objcopy, which zero-fills the gaps, so these structures
> > > > are incidentally zero-initialized (an all-zeroes entry is considered
> > > > absent, so zero-initialization is appropriate).
> > > >
> > > > However, that is just a happy accident: the `vmlinux` ELF output
> > > > authoritatively represents the state of memory at entry. If the ELF
> > > > says a region of memory isn't initialized, we must treat it as
> > > > uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
> > > > the ELF directly -- sidestepping the objcopy-produced image entirely --
> > > > and therefore do not initialize the gaps. This results in the early boot
> > > > code crashing when it attempts to create identity mappings.
> > > >
> > > > Therefore, add boot-time zero-initialization for the following:
> > > > - __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
> > > > - idmap_pg_dir
> > > > - reserved_pg_dir
> > >
> > > I don't think this is the right approach.
> > >
> > > If the ELF representation is inaccurate, it should be fixed, and this
> > > should be achievable without impacting the binary image at all.
> >
> > Hi Ard,
> >
> > I don't believe I can declare the ELF output "inaccurate" per se,
> > since it's the linker's final determination about the state of memory
> > at kernel entry -- including which regions are not the loader's
> > responsibility to initialize (and should therefore be initialized at
> > runtime, e.g. .bss). But, I think I understand your meaning: you would
> > prefer consistent load-time zero-initialization over run-time. I'm
> > open to that approach if that's the consensus here, but it will make
> > `vmlinux` dozens of KBs larger (even though it keeps `Image` the same
> > size).
> >
>
> Indeed, I'd like the ELF representation to be such that only the tail
> end of the image needs explicit clearing. A bit of bloat of vmlinux is
> tolerable IMO.

Since the explicit clearing region already includes the entirety of
__pi_init_pg_dir, would it make sense if I instead move the other
pg_dir items (except __pi_init_idmap_pg_dir) inside that region too,
both to keep them all grouped and to ensure that they're all cleared
in the same go? I'd still need to handle __pi_init_idmap_pg_dir, and
it would mean that reserved_pg_dir is first installed in TTBR1_EL1 a
few cycles before being zeroed, but beyond those two drawbacks it
sounds simpler to me, reduces the image size by a few pages, and meets
the "only clear the tail end" goal.

> Note that your fix is not complete: stores to memory done with the MMU
> and caches disabled need to be invalidated from the D-caches too, or
> they could carry stale clean lines. This is precisely the reason why
> manipulation of memory should be limited to the bare minimum until the
> ID map is enabled in the MMU.

ACK. ARM64 caches are one of those things that I understand in
principle but I'm still learning all of the gotchas. I appreciate that
you shared this insight despite rejecting the overall approach!

> > >
> > > > - tramp_pg_dir # Already done, but this patch corrects the size
> > > >
> > >
> > > What is wrong with the size?
> >
> > On higher-VABIT targets, that memset is overflowing by writing
> > PGD_SIZE bytes despite tramp_pg_dir being only PAGE_SIZE bytes in
> > size.
>
> Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?

I might be doing my math wrong, but wouldn't 52-bit VA with 4K
granules and 5 levels result in this?

Each PTE represents 4K of virtual memory, so covers VA bits [11:0]
(this is level 3)
Each PMD has 512 PTEs, the index of which covers VA bits [20:12] (this
is level 2)
Each PUD references 512 PMDs, the index covering VA [29:21] (this is level 1)
Each P4D references 512 PUDs, indexed by VA [38:30] (this is level 0)
The PGD, at level -1, therefore has to cover VA bits [51:39], which
means it has a 13-bit index: 8192 entries of 8 bytes each would make
it 16 pages in size.

> Note that at stage 1, arm64 does not support page table concatenation,
> and so the root page table is never larger than a page.

Doesn't PGD_SIZE refer to the size used for userspace PGDs after the
boot progresses beyond stage 1? (What do you mean by "never" here?
"Under no circumstances is it larger than a page at stage 1"? Or
"during the entire lifecycle of the system, there is no time at which
it's larger than a page"?)

Thanks for your time and attention to this,
Sam

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ