linux-kernel - Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6dd16954-1b63-4179-9666-564d0a3090b6@os.amperecomputing.com>
Date: Tue, 27 Jan 2026 16:50:30 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: Ryan Roberts <ryan.roberts@....com>, Will Deacon <will@...nel.org>,
 Anshuman Khandual <anshuman.khandual@....com>
Cc: catalin.marinas@....com, cl@...two.org,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo



On 1/27/26 12:57 AM, Ryan Roberts wrote:
> On 26/01/2026 20:50, Yang Shi wrote:
>>
>> On 1/26/26 10:58 AM, Will Deacon wrote:
>>> On Mon, Jan 26, 2026 at 09:55:06AM -0800, Yang Shi wrote:
>>>> On 1/26/26 6:14 AM, Will Deacon wrote:
>>>>> On Thu, Jan 22, 2026 at 01:59:54PM -0800, Yang Shi wrote:
>>>>>> On 1/22/26 6:43 AM, Ryan Roberts wrote:
>>>>>>> On 21/01/2026 22:44, Yang Shi wrote:
>>>>>>>> On 1/21/26 9:23 AM, Ryan Roberts wrote:
>>>>>>> But it looks like all the higher level users will only ever unplug in the
>>>>>>> same
>>>>>>> granularity that was plugged in (I might be wrong but that's the sense I
>>>>>>> get).
>>>>>>>
>>>>>>> arm64 adds the constraint that it won't unplug any memory that was present at
>>>>>>> boot - see prevent_bootmem_remove_notifier().
>>>>>>>
>>>>>>> So in practice this is probably safe, though perhaps brittle.
>>>>>>>
>>>>>>> Some options:
>>>>>>>
>>>>>>>      - leave it as is and worry about it if/when something shifts and hits the
>>>>>>>        problem.
>>>>>> Seems like the most simple way :-)
>>>>>>
>>>>>>>      - Enhance prevent_bootmem_remove_notifier() to reject unplugging
>>>>>>> memory blocks
>>>>>>>        whose boundaries are within leaf mappings.
>>>>>> I don't quite get why we should enhance prevent_bootmem_remove_notifier().
>>>>>> If I read the code correctly, it just simply reject offline boot memory.
>>>>>> Offlining a single memory block is fine. If you check the boundaries there,
>>>>>> will it prevent from offlining a single memory block?
>>>>>>
>>>>>> I think you need enhance try_remove_memory(). But kernel may unmap linear
>>>>>> mapping by memory blocks if altmap is used. So you should need an extra page
>>>>>> table walk with the start and the size of unplugged dimm before removing the
>>>>>> memory to tell whether the boundaries are within leaf mappings or not IIUC.
>>>>>> Can it be done in arch_remove_memory()? It seems not because
>>>>>> arch_remove_memory() may be called on memory block granularity if altmap is
>>>>>> used.
>>>>>>
>>>>>>>      - For non-bbml2_noabort systems, map hotplug memory with a new flag to
>>>>>>> ensure
>>>>>>>        that leaf mappings are always <= memory_block_size_bytes(). For
>>>>>>>        bbml2_noabort, split at the block boundaries before doing the
>>>>>>> unmapping.
>>>>>> The linear mapping will be at most 128M (4K page size), it sounds sub
>>>>>> optimal IMHO.
>>>>>>
>>>>>>> Given I don't think this can happen in practice, probably the middle
>>>>>>> option is
>>>>>>> the best? There is no runtime impact and it will give us a warning if it ever
>>>>>>> does happen in future.
>>>>>>>
>>>>>>> What do you think?
>>>>>> I agree it can't happen in practice, so why not just take option #1 given
>>>>>> the complexity added by option #2?
>>>>> It still looks broken in the case that a region that was mapped with the
>>>>> contiguous bit is then unmapped. The sequence seems to iterate over
>>>>> each contiguous PTE, zapping the entry and doing the TLBI while the
>>>>> other entries in the contiguous range remain intact. I don't think
>>>>> that's sufficient to guarantee that you don't have stale TLB entries
>>>>> once you've finished processing the whole range.
>>>>>
>>>>> For example, imagine you have an L1 TLB that only supports 4k entries
>>>>> and an L2 TLB that supports 64k entries. Let's say that the contiguous
>>>>> range is mapped by pte0 ... pte15 and we've zapped and invalidated
>>>>> pte0 ... pte14. At that point, I think the hardware is permitted to use
>>>>> the last remaining contiguous pte (pte15) to allocate a 64k entry in the
>>>>> L2 TLB covering the whole range. A (speculative) walk via one of the
>>>>> virtual addresses translated by pte0 ... pte14 could then hit that entry
>>>>> and fill a 4k entry into the L1 TLB. So at the end of the sequence, you
>>>>> could presumably still access the first 60k of the range thanks to stale
>>>>> entries in the L1 TLB?
>>>> It is a little bit hard for me to understand how come a (speculative) walk
>>>> could happen when we reach here.
>>>>
>>>> Before we reach here, IIUC kernel has:
>>>>
>>>>    * offlined all the page blocks. It means they are freed and isolated from
>>>> buddy allocator, even pfn walk (for example, compaction) should not reach
>>>> them at all.
>>>>    * vmemmap has been eliminated. So no struct page available.
>>>>
>>>>   From kernel point of view, they are nonreachable now. Did I miss and/or
>>>> misunderstand something?
>>> I'm talking about hardware speculation. It's mapped as normal memory so
>>> the CPU can speculate from it. We can't really reason about the bounds
>>> of that, especially in a world with branch predictors and history-based
>>> prefetchers.
>> OK. If it could happen, I think the suggestions from you and Ryan should work IIUC:
>>
>> Clear all the entries in the cont range, then invalidate TLB for the whole range.
>>
>> I can come up with a patch or Ryan would like to take it?
> Hi,
>
> There are 2 separate issues that have been raised here and I think we are
> conflating them a bit...
>
>
> 1: The contiguous range teardown + tlbi issue that Will raised. That is
> definitely a problem and needs to be fixed. (though I think prior to the BBML2
> dynamic linear block mapping support it would be rare in practice; probably it
> would only affect cont-pmd mappings for 16K and 64K base page configs. With
> BBML2 dynamic linear block mapping support, this can happen for contiguous
> mappings at all levels with all base page sizes).
>
> I roughed out a patch to hoist out the tlbis and issue as a single range after
> clearing all the pgtable entries. I think this will be MUCH faster and will
> solve the contiguous issue too. The one catch is that this only works for linear
> map and the same helpers are used for the vmemmap. For the latter we also free
> the memory, so the tlbis need to happen before the freeing. But vmemmap doesn't
> use contiguous mappings so I've added a warning checking that and use a
> different scheme based on whether we are freeing or not.
>
> Anshuman has kindly agreed to knock the patch into shape and do the testing.
> Hopefully he can post shortly.
>
>
> 2: hot-unplugging a range that starts or terminates in the middle of a large
> leaf mapping. The low level hot-unplug implementation allows unplugging any
> range of memory as long as it is section size aligned (128M). So theoretically
> you could have a 1G PUD leaf mapping and try to unplug 128M from the middle of
> it. In practice this doesn't happen because all the users of the hot-unplug code
> group memory into devices. If you add a range, you can only remove that same
> range. When adding, we will guarrantee that the leaf mappings exactly map the
> range, so the same guarrantee can be given for hot-remove.
>
> BUT, that feels fragile to me. I'd like to add a check in
> prevent_bootmem_remove_notifier() to ensure that the proposed unplug range is
> exactly covered by leaf mappings, and if it isn't, warn and reject. This will
> allow us to fail safe for a tiny amount of overhead (which will be made up for
> many, many times over by hoisting the tlbis batching the barriers in 1.).
>
> Anshuman has also kindly agreed to put a patch together for that.

Thanks for the update. Look forward to seeing the patches from Anshuman 
soon.

Thanks,
Yang

>
>
> Thanks,
> Ryan
>
>
>> Thanks,
>> Yang
>>
>>> Will