linux-kernel - Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <04f816f9-5533-4fe1-99b0-cd405caac485@os.amperecomputing.com>
Date: Mon, 26 Jan 2026 09:55:06 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: Will Deacon <will@...nel.org>
Cc: Ryan Roberts <ryan.roberts@....com>, catalin.marinas@....com,
 cl@...two.org, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo



On 1/26/26 6:14 AM, Will Deacon wrote:
> On Thu, Jan 22, 2026 at 01:59:54PM -0800, Yang Shi wrote:
>> On 1/22/26 6:43 AM, Ryan Roberts wrote:
>>> On 21/01/2026 22:44, Yang Shi wrote:
>>>> On 1/21/26 9:23 AM, Ryan Roberts wrote:
>>> But it looks like all the higher level users will only ever unplug in the same
>>> granularity that was plugged in (I might be wrong but that's the sense I get).
>>>
>>> arm64 adds the constraint that it won't unplug any memory that was present at
>>> boot - see prevent_bootmem_remove_notifier().
>>>
>>> So in practice this is probably safe, though perhaps brittle.
>>>
>>> Some options:
>>>
>>>    - leave it as is and worry about it if/when something shifts and hits the
>>>      problem.
>> Seems like the most simple way :-)
>>
>>>    - Enhance prevent_bootmem_remove_notifier() to reject unplugging memory blocks
>>>      whose boundaries are within leaf mappings.
>> I don't quite get why we should enhance prevent_bootmem_remove_notifier().
>> If I read the code correctly, it just simply reject offline boot memory.
>> Offlining a single memory block is fine. If you check the boundaries there,
>> will it prevent from offlining a single memory block?
>>
>> I think you need enhance try_remove_memory(). But kernel may unmap linear
>> mapping by memory blocks if altmap is used. So you should need an extra page
>> table walk with the start and the size of unplugged dimm before removing the
>> memory to tell whether the boundaries are within leaf mappings or not IIUC.
>> Can it be done in arch_remove_memory()? It seems not because
>> arch_remove_memory() may be called on memory block granularity if altmap is
>> used.
>>
>>>    - For non-bbml2_noabort systems, map hotplug memory with a new flag to ensure
>>>      that leaf mappings are always <= memory_block_size_bytes(). For
>>>      bbml2_noabort, split at the block boundaries before doing the unmapping.
>> The linear mapping will be at most 128M (4K page size), it sounds sub
>> optimal IMHO.
>>
>>> Given I don't think this can happen in practice, probably the middle option is
>>> the best? There is no runtime impact and it will give us a warning if it ever
>>> does happen in future.
>>>
>>> What do you think?
>> I agree it can't happen in practice, so why not just take option #1 given
>> the complexity added by option #2?
> It still looks broken in the case that a region that was mapped with the
> contiguous bit is then unmapped. The sequence seems to iterate over
> each contiguous PTE, zapping the entry and doing the TLBI while the
> other entries in the contiguous range remain intact. I don't think
> that's sufficient to guarantee that you don't have stale TLB entries
> once you've finished processing the whole range.
>
> For example, imagine you have an L1 TLB that only supports 4k entries
> and an L2 TLB that supports 64k entries. Let's say that the contiguous
> range is mapped by pte0 ... pte15 and we've zapped and invalidated
> pte0 ... pte14. At that point, I think the hardware is permitted to use
> the last remaining contiguous pte (pte15) to allocate a 64k entry in the
> L2 TLB covering the whole range. A (speculative) walk via one of the
> virtual addresses translated by pte0 ... pte14 could then hit that entry
> and fill a 4k entry into the L1 TLB. So at the end of the sequence, you
> could presumably still access the first 60k of the range thanks to stale
> entries in the L1 TLB?

It is a little bit hard for me to understand how come a (speculative) 
walk could happen when we reach here.

Before we reach here, IIUC kernel has:

  * offlined all the page blocks. It means they are freed and isolated 
from buddy allocator, even pfn walk (for example, compaction) should not 
reach them at all.
  * vmemmap has been eliminated. So no struct page available.

 From kernel point of view, they are nonreachable now. Did I miss and/or 
misunderstand something?

Thanks,
Yang

>
> So it looks broken to me. What do you think? If you agree, then let's
> fix this problem first before adding the new /proc/meminfo stuff.
>
> Will