lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67570fb2-bde3-43e8-8661-ab62444b2626@arm.com>
Date: Tue, 27 Jan 2026 08:57:26 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Yang Shi <yang@...amperecomputing.com>, Will Deacon <will@...nel.org>,
 Anshuman Khandual <anshuman.khandual@....com>
Cc: catalin.marinas@....com, cl@...two.org,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo

On 26/01/2026 20:50, Yang Shi wrote:
> 
> 
> On 1/26/26 10:58 AM, Will Deacon wrote:
>> On Mon, Jan 26, 2026 at 09:55:06AM -0800, Yang Shi wrote:
>>>
>>> On 1/26/26 6:14 AM, Will Deacon wrote:
>>>> On Thu, Jan 22, 2026 at 01:59:54PM -0800, Yang Shi wrote:
>>>>> On 1/22/26 6:43 AM, Ryan Roberts wrote:
>>>>>> On 21/01/2026 22:44, Yang Shi wrote:
>>>>>>> On 1/21/26 9:23 AM, Ryan Roberts wrote:
>>>>>> But it looks like all the higher level users will only ever unplug in the
>>>>>> same
>>>>>> granularity that was plugged in (I might be wrong but that's the sense I
>>>>>> get).
>>>>>>
>>>>>> arm64 adds the constraint that it won't unplug any memory that was present at
>>>>>> boot - see prevent_bootmem_remove_notifier().
>>>>>>
>>>>>> So in practice this is probably safe, though perhaps brittle.
>>>>>>
>>>>>> Some options:
>>>>>>
>>>>>>     - leave it as is and worry about it if/when something shifts and hits the
>>>>>>       problem.
>>>>> Seems like the most simple way :-)
>>>>>
>>>>>>     - Enhance prevent_bootmem_remove_notifier() to reject unplugging
>>>>>> memory blocks
>>>>>>       whose boundaries are within leaf mappings.
>>>>> I don't quite get why we should enhance prevent_bootmem_remove_notifier().
>>>>> If I read the code correctly, it just simply reject offline boot memory.
>>>>> Offlining a single memory block is fine. If you check the boundaries there,
>>>>> will it prevent from offlining a single memory block?
>>>>>
>>>>> I think you need enhance try_remove_memory(). But kernel may unmap linear
>>>>> mapping by memory blocks if altmap is used. So you should need an extra page
>>>>> table walk with the start and the size of unplugged dimm before removing the
>>>>> memory to tell whether the boundaries are within leaf mappings or not IIUC.
>>>>> Can it be done in arch_remove_memory()? It seems not because
>>>>> arch_remove_memory() may be called on memory block granularity if altmap is
>>>>> used.
>>>>>
>>>>>>     - For non-bbml2_noabort systems, map hotplug memory with a new flag to
>>>>>> ensure
>>>>>>       that leaf mappings are always <= memory_block_size_bytes(). For
>>>>>>       bbml2_noabort, split at the block boundaries before doing the
>>>>>> unmapping.
>>>>> The linear mapping will be at most 128M (4K page size), it sounds sub
>>>>> optimal IMHO.
>>>>>
>>>>>> Given I don't think this can happen in practice, probably the middle
>>>>>> option is
>>>>>> the best? There is no runtime impact and it will give us a warning if it ever
>>>>>> does happen in future.
>>>>>>
>>>>>> What do you think?
>>>>> I agree it can't happen in practice, so why not just take option #1 given
>>>>> the complexity added by option #2?
>>>> It still looks broken in the case that a region that was mapped with the
>>>> contiguous bit is then unmapped. The sequence seems to iterate over
>>>> each contiguous PTE, zapping the entry and doing the TLBI while the
>>>> other entries in the contiguous range remain intact. I don't think
>>>> that's sufficient to guarantee that you don't have stale TLB entries
>>>> once you've finished processing the whole range.
>>>>
>>>> For example, imagine you have an L1 TLB that only supports 4k entries
>>>> and an L2 TLB that supports 64k entries. Let's say that the contiguous
>>>> range is mapped by pte0 ... pte15 and we've zapped and invalidated
>>>> pte0 ... pte14. At that point, I think the hardware is permitted to use
>>>> the last remaining contiguous pte (pte15) to allocate a 64k entry in the
>>>> L2 TLB covering the whole range. A (speculative) walk via one of the
>>>> virtual addresses translated by pte0 ... pte14 could then hit that entry
>>>> and fill a 4k entry into the L1 TLB. So at the end of the sequence, you
>>>> could presumably still access the first 60k of the range thanks to stale
>>>> entries in the L1 TLB?
>>> It is a little bit hard for me to understand how come a (speculative) walk
>>> could happen when we reach here.
>>>
>>> Before we reach here, IIUC kernel has:
>>>
>>>   * offlined all the page blocks. It means they are freed and isolated from
>>> buddy allocator, even pfn walk (for example, compaction) should not reach
>>> them at all.
>>>   * vmemmap has been eliminated. So no struct page available.
>>>
>>>  From kernel point of view, they are nonreachable now. Did I miss and/or
>>> misunderstand something?
>> I'm talking about hardware speculation. It's mapped as normal memory so
>> the CPU can speculate from it. We can't really reason about the bounds
>> of that, especially in a world with branch predictors and history-based
>> prefetchers.
> 
> OK. If it could happen, I think the suggestions from you and Ryan should work IIUC:
> 
> Clear all the entries in the cont range, then invalidate TLB for the whole range.
> 
> I can come up with a patch or Ryan would like to take it?

Hi,

There are 2 separate issues that have been raised here and I think we are
conflating them a bit...


1: The contiguous range teardown + tlbi issue that Will raised. That is
definitely a problem and needs to be fixed. (though I think prior to the BBML2
dynamic linear block mapping support it would be rare in practice; probably it
would only affect cont-pmd mappings for 16K and 64K base page configs. With
BBML2 dynamic linear block mapping support, this can happen for contiguous
mappings at all levels with all base page sizes).

I roughed out a patch to hoist out the tlbis and issue as a single range after
clearing all the pgtable entries. I think this will be MUCH faster and will
solve the contiguous issue too. The one catch is that this only works for linear
map and the same helpers are used for the vmemmap. For the latter we also free
the memory, so the tlbis need to happen before the freeing. But vmemmap doesn't
use contiguous mappings so I've added a warning checking that and use a
different scheme based on whether we are freeing or not.

Anshuman has kindly agreed to knock the patch into shape and do the testing.
Hopefully he can post shortly.


2: hot-unplugging a range that starts or terminates in the middle of a large
leaf mapping. The low level hot-unplug implementation allows unplugging any
range of memory as long as it is section size aligned (128M). So theoretically
you could have a 1G PUD leaf mapping and try to unplug 128M from the middle of
it. In practice this doesn't happen because all the users of the hot-unplug code
group memory into devices. If you add a range, you can only remove that same
range. When adding, we will guarrantee that the leaf mappings exactly map the
range, so the same guarrantee can be given for hot-remove.

BUT, that feels fragile to me. I'd like to add a check in
prevent_bootmem_remove_notifier() to ensure that the proposed unplug range is
exactly covered by leaf mappings, and if it isn't, warn and reject. This will
allow us to fail safe for a tiny amount of overhead (which will be made up for
many, many times over by hoisting the tlbis batching the barriers in 1.).

Anshuman has also kindly agreed to put a patch together for that.


Thanks,
Ryan


> 
> Thanks,
> Yang
> 
>>
>> Will
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ