[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f48ca859-c65e-9b2d-2d33-b86edc77cebd@gmail.com>
Date: Thu, 19 Jan 2023 14:33:35 -0800
From: Doug Berger <opendmb@...il.com>
To: Mel Gorman <mgorman@...e.de>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Jonathan Corbet <corbet@....net>,
Mike Rapoport <rppt@...nel.org>,
"Paul E. McKenney" <paulmck@...nel.org>,
Neeraj Upadhyay <quic_neeraju@...cinc.com>,
Randy Dunlap <rdunlap@...radead.org>,
Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
Muchun Song <songmuchun@...edance.com>,
Vlastimil Babka <vbabka@...e.cz>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...e.com>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Florian Fainelli <f.fainelli@...il.com>,
David Hildenbrand <david@...hat.com>,
Oscar Salvador <osalvador@...e.de>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
On 1/4/2023 7:43 AM, Mel Gorman wrote:
> On Wed, Dec 14, 2022 at 04:17:35PM -0800, Doug Berger wrote:
>> On 11/18/2022 9:05 AM, Mel Gorman wrote:
>>> On Wed, Nov 02, 2022 at 03:33:53PM -0700, Doug Berger wrote:
[snip]
>> I was not familiar with page_alloc.shuffle, but it may very well have a role
>> to play here.
>>
>
> It almost certainly does because unlike zones or CMA, it affects how
> free lists are arranged. IIRC, the original purpose was about improving
> performance of high-speed direct-mapped cache but it also serves a
> purpose in this case -- randomising allocations between two channels.
> It's still not perfect interleaving but better than none.
Agreed.
>>>>> A
>>>>> major limitation of ZONE_MOVABLE is that there is no way of controlling
>>>>> access from userspace to restrict the high-speed memory to a designated
>>>>> application, only to all applications in general. The primary interface
>>>>> to control access to memory with different characteristics is mempolicies
>>>>> which is NUMA orientated, not zone orientated. So, if there is a special
>>>>> application that requires exclusive access, it's very difficult to configure
>>>>> based on zones. Furthermore, page table pages mapping data located in the
>>>>> high-speed region are stored in the slower memory which potentially impacts
>>>>> the performance if the working set of the application exceeds TLB reach.
>>>>> Finally, while there is mention that Broadcom may have some special
>>>>> interface to determine what applications can use the high-speed region,
>>>>> it's hardware-specific as opposed to something that belongs in the core mm.
>>>>>
>>>>> I agree that keeping the high-speed memory in a local node and using "sticky"
>>>>> pageblocks or CMA has limitations of its own but in itself, that does not
>>>>> justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
>>>>> multiple controllers with equal distance and bandwidth (if I'm reading it
>>>>> correctly) but places them in different zones.... that's just a bit weird if
>>>>> there are no other addressing limitations. It's not obvious why ARM would do
>>>>> that, but it also does not matter because it shouldn't be a core mm concern.
>>>>
>>>> There appears to be some confusion regarding my explanation of multiple
>>>> memory controllers on a device like the BCM7278. There is no inherent
>>>> performance difference between the two memory controllers and their attached
>>>> DRAM. They merely provide the opportunity to perform memory accesses in
>>>> parallel for different physical address ranges. The physical address ranges
>>>> were selected by the SoC designers for reasons only known to them, but I'm
>>>> sure they had no consideration of zones in their decision making. The
>>>> selection of zones remains an artifact of the design of Linux.
>>>>
>>>
>>> Ok, so the channels are equal but the channels are not interleaved in
>>> hardware so basically you are trying to implement software-based memory
>>> channel interleaving?
>>
>> I suppose that could be a fair characterization of the objective, though the
>> approach taken here is very much a "poor man's" approach that attempts to
>> improve things without requiring the "heavy lifting" required for a more
>> complete solution.
>>
>
> It's still unfortunate that the concept of zones being primarily about
> addressing or capability limitations changes.
Perhaps arguably, the ZONE_MOVABLE zone continues to be about a
capability limitation (i.e. the page allocator cannot use the zone to
satisfy requests for non-movable/pinnable memory). This capability
limitation has value in different use cases. The hugetlbfs benefits by
being able to move data to better compact free memory into higher order
free pages. The memory hotplug users benefit by being able to move data
before removing memory from the system. A "reusable" reserved memory
implementation could benefit from it by being able to move data out of
the range when it is reclaimed by the software that owns the reservation.
The capability limitation has the follow-on attribute that the zone is
prioritized for user-space allocations because the virtual address
abstraction of user-space creates the perfect opportunity for physical
address independence allowing for movement of data. This is the
attribute that is of interest to the multi-channel memory without
hardware interleaving use case discussed here rather than the actual
movability of the data.
The Designated Movable Blocks proposal is a generic mechanism for adding
flexibility to determining what memory should be included in the
ZONE_MOVABLE zone, and as a result it could support any of these use
cases. The memory hotplug developers proposed a similar mechanism early
on in their development of what ultimately became the movable_node
implementation.
> It's also difficult to use as
> any user of it has to be very aware of the memory channel configuration of
> the machine and know how to match addresses to channels. Information from
> zoneinfo on start_pfns, spanned ranges and the like become less useful. It's
> relatively minor but splitting the zones also means there is a performance
> hit during compaction because pageblock_pfn_to_page is more expensive.
I agree that it requires special knowledge of the system to configure
for the multi-channel memory without hardware interleaving use case, but
that becomes a task for the system administrator that wants to take
advantage of the performance benefit of this specific use case. The
users don't actually need to be aware of it in this case, and there are
no cases where such configuration would occur automatically on systems
that were not explicitly interested in it. The memory hotplug developers
were able to avoid this complexity using ACPI and SRAT tables, which is
why they withdrew their early proposed command line arguments, but those
features are not currently available to Broadcom customers.
[snip]
>> What is of interest to Broadcom customers is to better distribute user space
>> accesses across each memory controller to improve the bandwidth available to
>> user space dominated work flows. With no ZONE_MOVABLE, the BCM7278 SoC with
>> 1GB of memory on each memory controller will place the 1GB on the low
>> address memory controller in ZONE_DMA and the 1GB on the high address memory
>> controller in ZONE_NORMAL. With this layout movable allocation requests will
>> only fallback to the ZONE_DMA (low memory controller) once the ZONE_NORMAL
>> (high memory controller) is sufficiently depleted of free memory.
>>
>> Adding ZONE_MOVABLE memory above ZONE_NORMAL with the current movablecore
>> behavior does not improve this situation other than forcing more kernel
>> allocations off of the high memory controller. User space allocations are
>> even more likely to be on the high memory controller.
>>
>
> But it's a weak promise that interleaving will happen. If only a portion
> of ZONE_MOVABLE is used, it might still be all on the same channel. This
> might improve over time if enough memory was used and the system was up
> for long enough.
A "lightly" loaded system is unlikely to see much, if any, benefit from
this configuration, but such a system has much less competition for
resources. As noted previously, it is the more "heavily" loaded system
with multiple parallel user space intensive processes that can benefit
by reducing the memory bottleneck created by the biasing of user space
allocations to higher addressed zones. The page_alloc.shuffle feature
does appear to remove the need for time to pass.
>
>> The Designated Movable Block mechanism allows ZONE_MOVABLE memory to be
>> located on the low memory controller to make it easier for user space
>> allocations to land on the low memory controller. If ZONE_MOVABLE is only
>> placed on the low memory controller then user space allocations can land in
>> ZONE_NORMAL on the high memory controller, but only through fallback after
>> ZONE_MOVABLE is sufficiently depleted of free memory which is just the
>> reverse of the existing situation. The Designated Movable Block mechanism
>> allows ZONE_MOVABLE memory to be located on each memory controller so that
>> user space allocations have equal access to each memory controller until the
>> ZONE_MOVABLE memory is depleted and fallback to other zones occurs.
>>
>> To my knowledge Broadcom customers that are currently using the Designated
>> Movable Block mechanism are relying on the somewhat random starting and
>> stopping of parallel user space processes to produce a more random
>> distribution of ZONE_MOVABLE allocations across multiple memory controllers,
>> but the page_alloc.shuffle mechanism seems like it would be a good addition
>> to promote this randomness. Even better, it appears that page_alloc.shuffle
>> is already enabled in the GKI configuration.
>>
>
> The "random starting and stopping of parallel user space processes" is
> required for the mechanism to work. It's arbitrary and unknown if the
> interleaving happens where as shuffle has an immediate, if random, impact.
Yes, page_alloc.shuffle does improve things.
>
>> You are of course correct that the access patterns make all of the
>> difference and it is almost certain that one memory controller or the other
>> will be saturated at any given time, but the intent is to increase the
>> opportunity to use more of the total bandwidth made available by the
>> multiple memory controllers.
>>
>
> And shuffle should also provide that opportunity except it's trivial
> to configure and only requires the user to know the memory channels are
> not interleaved.
The problem with page_alloc.shuffle on its own is that the shuffling can
only occur within a zone. As noted for the BCM7278 SoC described above,
the low memory controller contains only ZONE_DMA memory and the high
memory controller contains only ZONE_NORMAL memory. Shuffling the pages
within a zone will not improve the random placement of allocations
across the multiple memory controllers unless a zone spans all memory
controllers. The creation of Designated Movable Blocks allows a
ZONE_MOVABLE zone to be created that spans all memory controllers in the
system with an equivalent footprint on each.
[snip]
>> I experimented with a
>> Broadcom BCM7278 system with 1GB on each memory controller (i.e. 2GB total
>> memory). The buffers were made large to render data caching meaningless and
>> to require several pages to be allocated to populate the buffer.
>>
>> With V3 of this patch set applied to a 6.1-rc1 kernel I observed these
>> results:
>> With no movablecore kernel parameter specified:
>> # time /tmp/thread_test
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>> Thread 4 returns: 0
>>
>> real 0m4.047s
>> user 0m14.183s
>> sys 0m1.215s
>>
>> With this additional kernel parameter "movablecore=600M":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real 0m4.068s
>> user 0m14.402s
>> sys 0m1.117s
>>
>> With this additional kernel parameter "movablecore=600M@...0000000":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real 0m4.010s
>> user 0m13.979s
>> sys 0m1.070s
>>
>> However, with these additional kernel parameters
>> "movablecore=300M@...0000000,300M@...20000000 page_alloc.shuffle=1":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real 0m3.173s
>> user 0m11.175s
>> sys 0m1.067s
>>
>
> What were the results with just
> "movablecore=300M@...0000000,300M@...20000000" on its own and
> page_alloc.shuffle=1 on its own?
>
> For shuffle on its own, my expectations are that the results will be
> variable, sometimes good and sometimes bad, because it's at the mercy of
> the randomisation. It might be slightly improved if the initial top-level
> lists were ordered "1, n, 2, n-1, 3, n-2" optionally in __shuffle_zone or
> if shuffle_pick_tail was aware of the memory channels but a lot more work
> to implement.
With the kernel parameters
"movablecore=300M@...0000000,300M@...20000000"
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0
real 0m3.562s
user 0m12.386s
sys 0m1.176s
The "movablecore=300M@...0000000,300M@...20000000" result is worse than
when combined with the shuffle parameter, but may improve over time due
to "random starting and stopping of parallel user space processes".
With the kernel parameters
"page_alloc.shuffle=1"
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0
real 0m4.056s
user 0m14.680s
sys 0m1.060s
The shuffle on its own result is no better than no movablecore parameter
because all of ZONE_NORMAL is on the high memory controller so the pages
don't get shuffled between controllers.
Happy New Year!
Doug
Powered by blists - more mailing lists