linux-kernel - Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f48ca859-c65e-9b2d-2d33-b86edc77cebd@gmail.com>
Date:   Thu, 19 Jan 2023 14:33:35 -0800
From:   Doug Berger <opendmb@...il.com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Jonathan Corbet <corbet@....net>,
        Mike Rapoport <rppt@...nel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Neeraj Upadhyay <quic_neeraju@...cinc.com>,
        Randy Dunlap <rdunlap@...radead.org>,
        Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
        Muchun Song <songmuchun@...edance.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...e.com>,
        KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Florian Fainelli <f.fainelli@...il.com>,
        David Hildenbrand <david@...hat.com>,
        Oscar Salvador <osalvador@...e.de>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks

On 1/4/2023 7:43 AM, Mel Gorman wrote:
> On Wed, Dec 14, 2022 at 04:17:35PM -0800, Doug Berger wrote:
>> On 11/18/2022 9:05 AM, Mel Gorman wrote:
>>> On Wed, Nov 02, 2022 at 03:33:53PM -0700, Doug Berger wrote:

[snip]

>> I was not familiar with page_alloc.shuffle, but it may very well have a role
>> to play here.
>>
> 
> It almost certainly does because unlike zones or CMA, it affects how
> free lists are arranged. IIRC, the original purpose was about improving
> performance of high-speed direct-mapped cache but it also serves a
> purpose in this case -- randomising allocations between two channels.
> It's still not perfect interleaving but better than none.

Agreed.

>>>>> A
>>>>> major limitation of ZONE_MOVABLE is that there is no way of controlling
>>>>> access from userspace to restrict the high-speed memory to a designated
>>>>> application, only to all applications in general. The primary interface
>>>>> to control access to memory with different characteristics is mempolicies
>>>>> which is NUMA orientated, not zone orientated. So, if there is a special
>>>>> application that requires exclusive access, it's very difficult to configure
>>>>> based on zones.  Furthermore, page table pages mapping data located in the
>>>>> high-speed region are stored in the slower memory which potentially impacts
>>>>> the performance if the working set of the application exceeds TLB reach.
>>>>> Finally, while there is mention that Broadcom may have some special
>>>>> interface to determine what applications can use the high-speed region,
>>>>> it's hardware-specific as opposed to something that belongs in the core mm.
>>>>>
>>>>> I agree that keeping the high-speed memory in a local node and using "sticky"
>>>>> pageblocks or CMA has limitations of its own but in itself, that does not
>>>>> justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
>>>>> multiple controllers with equal distance and bandwidth (if I'm reading it
>>>>> correctly) but places them in different zones.... that's just a bit weird if
>>>>> there are no other addressing limitations. It's not obvious why ARM would do
>>>>> that, but it also does not matter because it shouldn't be a core mm concern.
>>>>
>>>> There appears to be some confusion regarding my explanation of multiple
>>>> memory controllers on a device like the BCM7278. There is no inherent
>>>> performance difference between the two memory controllers and their attached
>>>> DRAM. They merely provide the opportunity to perform memory accesses in
>>>> parallel for different physical address ranges. The physical address ranges
>>>> were selected by the SoC designers for reasons only known to them, but I'm
>>>> sure they had no consideration of zones in their decision making. The
>>>> selection of zones remains an artifact of the design of Linux.
>>>>
>>>
>>> Ok, so the channels are equal but the channels are not interleaved in
>>> hardware so basically you are trying to implement software-based memory
>>> channel interleaving?
>>
>> I suppose that could be a fair characterization of the objective, though the
>> approach taken here is very much a "poor man's" approach that attempts to
>> improve things without requiring the "heavy lifting" required for a more
>> complete solution.
>>
> 
> It's still unfortunate that the concept of zones being primarily about
> addressing or capability limitations changes.

Perhaps arguably, the ZONE_MOVABLE zone continues to be about a 
capability limitation (i.e. the page allocator cannot use the zone to 
satisfy requests for non-movable/pinnable memory). This capability 
limitation has value in different use cases. The hugetlbfs benefits by 
being able to move data to better compact free memory into higher order 
free pages. The memory hotplug users benefit by being able to move data 
before removing memory from the system. A "reusable" reserved memory 
implementation could benefit from it by being able to move data out of 
the range when it is reclaimed by the software that owns the reservation.

The capability limitation has the follow-on attribute that the zone is 
prioritized for user-space allocations because the virtual address 
abstraction of user-space creates the perfect opportunity for physical 
address independence allowing for movement of data. This is the 
attribute that is of interest to the multi-channel memory without 
hardware interleaving use case discussed here rather than the actual 
movability of the data.

The Designated Movable Blocks proposal is a generic mechanism for adding 
flexibility to determining what memory should be included in the 
ZONE_MOVABLE zone, and as a result it could support any of these use 
cases. The memory hotplug developers proposed a similar mechanism early 
on in their development of what ultimately became the movable_node 
implementation.

> It's also difficult to use as
> any user of it has to be very aware of the memory channel configuration of
> the machine and know how to match addresses to channels. Information from
> zoneinfo on start_pfns, spanned ranges and the like become less useful. It's
> relatively minor but splitting the zones also means there is a performance
> hit during compaction because pageblock_pfn_to_page is more expensive.

I agree that it requires special knowledge of the system to configure 
for the multi-channel memory without hardware interleaving use case, but 
that becomes a task for the system administrator that wants to take 
advantage of the performance benefit of this specific use case. The 
users don't actually need to be aware of it in this case, and there are 
no cases where such configuration would occur automatically on systems 
that were not explicitly interested in it. The memory hotplug developers 
were able to avoid this complexity using ACPI and SRAT tables, which is 
why they withdrew their early proposed command line arguments, but those 
features are not currently available to Broadcom customers.

[snip]

>> What is of interest to Broadcom customers is to better distribute user space
>> accesses across each memory controller to improve the bandwidth available to
>> user space dominated work flows. With no ZONE_MOVABLE, the BCM7278 SoC with
>> 1GB of memory on each memory controller will place the 1GB on the low
>> address memory controller in ZONE_DMA and the 1GB on the high address memory
>> controller in ZONE_NORMAL. With this layout movable allocation requests will
>> only fallback to the ZONE_DMA (low memory controller) once the ZONE_NORMAL
>> (high memory controller) is sufficiently depleted of free memory.
>>
>> Adding ZONE_MOVABLE memory above ZONE_NORMAL with the current movablecore
>> behavior does not improve this situation other than forcing more kernel
>> allocations off of the high memory controller. User space allocations are
>> even more likely to be on the high memory controller.
>>
> 
> But it's a weak promise that interleaving will happen. If only a portion
> of ZONE_MOVABLE is used, it might still be all on the same channel. This
> might improve over time if enough memory was used and the system was up
> for long enough.

A "lightly" loaded system is unlikely to see much, if any, benefit from 
this configuration, but such a system has much less competition for 
resources. As noted previously, it is the more "heavily" loaded system 
with multiple parallel user space intensive processes that can benefit 
by reducing the memory bottleneck created by the biasing of user space 
allocations to higher addressed zones. The page_alloc.shuffle feature 
does appear to remove the need for time to pass.

> 
>> The Designated Movable Block mechanism allows ZONE_MOVABLE memory to be
>> located on the low memory controller to make it easier for user space
>> allocations to land on the low memory controller. If ZONE_MOVABLE is only
>> placed on the low memory controller then user space allocations can land in
>> ZONE_NORMAL on the high memory controller, but only through fallback after
>> ZONE_MOVABLE is sufficiently depleted of free memory which is just the
>> reverse of the existing situation. The Designated Movable Block mechanism
>> allows ZONE_MOVABLE memory to be located on each memory controller so that
>> user space allocations have equal access to each memory controller until the
>> ZONE_MOVABLE memory is depleted and fallback to other zones occurs.
>>
>> To my knowledge Broadcom customers that are currently using the Designated
>> Movable Block mechanism are relying on the somewhat random starting and
>> stopping of parallel user space processes to produce a more random
>> distribution of ZONE_MOVABLE allocations across multiple memory controllers,
>> but the page_alloc.shuffle mechanism seems like it would be a good addition
>> to promote this randomness. Even better, it appears that page_alloc.shuffle
>> is already enabled in the GKI configuration.
>>
> 
> The "random starting and stopping of parallel user space processes" is
> required for the mechanism to work. It's arbitrary and unknown if the
> interleaving happens where as shuffle has an immediate, if random, impact.

Yes, page_alloc.shuffle does improve things.

> 
>> You are of course correct that the access patterns make all of the
>> difference and it is almost certain that one memory controller or the other
>> will be saturated at any given time, but the intent is to increase the
>> opportunity to use more of the total bandwidth made available by the
>> multiple memory controllers.
>>
> 
> And shuffle should also provide that opportunity except it's trivial
> to configure and only requires the user to know the memory channels are
> not interleaved.

The problem with page_alloc.shuffle on its own is that the shuffling can 
only occur within a zone. As noted for the BCM7278 SoC described above, 
the low memory controller contains only ZONE_DMA memory and the high 
memory controller contains only ZONE_NORMAL memory. Shuffling the pages 
within a zone will not improve the random placement of allocations 
across the multiple memory controllers unless a zone spans all memory 
controllers. The creation of Designated Movable Blocks allows a 
ZONE_MOVABLE zone to be created that spans all memory controllers in the 
system with an equivalent footprint on each.

[snip]

>> I experimented with a
>> Broadcom BCM7278 system with 1GB on each memory controller (i.e. 2GB total
>> memory). The buffers were made large to render data caching meaningless and
>> to require several pages to be allocated to populate the buffer.
>>
>> With V3 of this patch set applied to a 6.1-rc1 kernel I observed these
>> results:
>> With no movablecore kernel parameter specified:
>> # time /tmp/thread_test
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>> Thread 4 returns: 0
>>
>> real    0m4.047s
>> user    0m14.183s
>> sys     0m1.215s
>>
>> With this additional kernel parameter "movablecore=600M":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real    0m4.068s
>> user    0m14.402s
>> sys     0m1.117s
>>
>> With this additional kernel parameter "movablecore=600M@...0000000":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real    0m4.010s
>> user    0m13.979s
>> sys     0m1.070s
>>
>> However, with these additional kernel parameters
>> "movablecore=300M@...0000000,300M@...20000000 page_alloc.shuffle=1":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real    0m3.173s
>> user    0m11.175s
>> sys     0m1.067s
>>
> 
> What were the results with just
> "movablecore=300M@...0000000,300M@...20000000" on its own and
> page_alloc.shuffle=1 on its own?
> 
> For shuffle on its own, my expectations are that the results will be
> variable, sometimes good and sometimes bad, because it's at the mercy of
> the randomisation. It might be slightly improved if the initial top-level
> lists were ordered "1, n, 2, n-1, 3, n-2" optionally in __shuffle_zone or
> if shuffle_pick_tail was aware of the memory channels but a lot more work
> to implement.

With the kernel parameters
"movablecore=300M@...0000000,300M@...20000000"
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0

real    0m3.562s
user    0m12.386s
sys     0m1.176s

The "movablecore=300M@...0000000,300M@...20000000" result is worse than 
when combined with the shuffle parameter, but may improve over time due 
to "random starting and stopping of parallel user space processes".

With the kernel parameters
"page_alloc.shuffle=1"
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0

real    0m4.056s
user    0m14.680s
sys     0m1.060s

The shuffle on its own result is no better than no movablecore parameter 
because all of ZONE_NORMAL is on the high memory controller so the pages 
don't get shuffled between controllers.

Happy New Year!
     Doug