linux-kernel - Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aXd2rud1E0Gff8-2@willie-the-truck>
Date: Mon, 26 Jan 2026 14:14:06 +0000
From: Will Deacon <will@...nel.org>
To: Yang Shi <yang@...amperecomputing.com>
Cc: Ryan Roberts <ryan.roberts@....com>, catalin.marinas@....com,
	cl@...two.org, linux-arm-kernel@...ts.infradead.org,
	linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo

On Thu, Jan 22, 2026 at 01:59:54PM -0800, Yang Shi wrote:
> On 1/22/26 6:43 AM, Ryan Roberts wrote:
> > On 21/01/2026 22:44, Yang Shi wrote:
> > > On 1/21/26 9:23 AM, Ryan Roberts wrote:
> > But it looks like all the higher level users will only ever unplug in the same
> > granularity that was plugged in (I might be wrong but that's the sense I get).
> > 
> > arm64 adds the constraint that it won't unplug any memory that was present at
> > boot - see prevent_bootmem_remove_notifier().
> > 
> > So in practice this is probably safe, though perhaps brittle.
> > 
> > Some options:
> > 
> >   - leave it as is and worry about it if/when something shifts and hits the
> >     problem.
> 
> Seems like the most simple way :-)
> 
> >   - Enhance prevent_bootmem_remove_notifier() to reject unplugging memory blocks
> >     whose boundaries are within leaf mappings.
> 
> I don't quite get why we should enhance prevent_bootmem_remove_notifier().
> If I read the code correctly, it just simply reject offline boot memory.
> Offlining a single memory block is fine. If you check the boundaries there,
> will it prevent from offlining a single memory block?
> 
> I think you need enhance try_remove_memory(). But kernel may unmap linear
> mapping by memory blocks if altmap is used. So you should need an extra page
> table walk with the start and the size of unplugged dimm before removing the
> memory to tell whether the boundaries are within leaf mappings or not IIUC.
> Can it be done in arch_remove_memory()? It seems not because
> arch_remove_memory() may be called on memory block granularity if altmap is
> used.
> 
> >   - For non-bbml2_noabort systems, map hotplug memory with a new flag to ensure
> >     that leaf mappings are always <= memory_block_size_bytes(). For
> >     bbml2_noabort, split at the block boundaries before doing the unmapping.
> 
> The linear mapping will be at most 128M (4K page size), it sounds sub
> optimal IMHO.
> 
> > Given I don't think this can happen in practice, probably the middle option is
> > the best? There is no runtime impact and it will give us a warning if it ever
> > does happen in future.
> > 
> > What do you think?
> 
> I agree it can't happen in practice, so why not just take option #1 given
> the complexity added by option #2?

It still looks broken in the case that a region that was mapped with the
contiguous bit is then unmapped. The sequence seems to iterate over
each contiguous PTE, zapping the entry and doing the TLBI while the
other entries in the contiguous range remain intact. I don't think
that's sufficient to guarantee that you don't have stale TLB entries
once you've finished processing the whole range.

For example, imagine you have an L1 TLB that only supports 4k entries
and an L2 TLB that supports 64k entries. Let's say that the contiguous
range is mapped by pte0 ... pte15 and we've zapped and invalidated
pte0 ... pte14. At that point, I think the hardware is permitted to use
the last remaining contiguous pte (pte15) to allocate a 64k entry in the
L2 TLB covering the whole range. A (speculative) walk via one of the
virtual addresses translated by pte0 ... pte14 could then hit that entry
and fill a 4k entry into the L1 TLB. So at the end of the sequence, you
could presumably still access the first 60k of the range thanks to stale
entries in the L1 TLB?

So it looks broken to me. What do you think? If you agree, then let's
fix this problem first before adding the new /proc/meminfo stuff.

Will