[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <76A363AB-13DD-4969-B58B-9A56BB4E409E@nvidia.com>
Date: Tue, 06 May 2025 08:59:43 -0400
From: Zi Yan <ziy@...dia.com>
To: Anshuman Khandual <anshuman.khandual@....com>
Cc: Juan Yescas <jyescas@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, tjmercier@...gle.com,
isaacmanjarres@...gle.com, surenb@...gle.com, kaleshsingh@...gle.com,
Vlastimil Babka <vbabka@...e.cz>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
David Hildenbrand <david@...hat.com>, Mike Rapoport <rppt@...nel.org>,
Minchan Kim <minchan@...nel.org>
Subject: Re: [PATCH v3] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block
order
On 6 May 2025, at 2:53, Anshuman Khandual wrote:
> On 5/6/25 05:52, Juan Yescas wrote:
>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>> and this causes the CMA reservations to be larger than necessary.
>> This means that system will have less available MIGRATE_UNMOVABLE and
>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>
>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>
>> For example, in ARM, the CMA alignment requirement when:
>>
>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>
>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>> -----------------------------------------------------------------------
>> 4KiB | 10 | 10 | 4KiB * (2 ^ 10) = 4MiB
>> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>
>> There are some extreme cases for the CMA alignment requirement when:
>>
>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>> - CONFIG_HUGETLB_PAGE is NOT set
>>
>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>> ------------------------------------------------------------------------
>> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
>> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>
>> This affects the CMA reservations for the drivers. If a driver in a
>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>> reservation has to be 32MiB due to the alignment requirements:
>>
>> reserved-memory {
>> ...
>> cma_test_reserve: cma_test_reserve {
>> compatible = "shared-dma-pool";
>> size = <0x0 0x400000>; /* 4 MiB */
>> ...
>> };
>> };
>>
>> reserved-memory {
>> ...
>> cma_test_reserve: cma_test_reserve {
>> compatible = "shared-dma-pool";
>> size = <0x0 0x2000000>; /* 32 MiB */
>> ...
>> };
>> };
>
> This indeed is a valid problem which reduces available memory for
> non-CMA page blocks on system required for general memory usage.
>
>>
>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>> allows to set the page block order in all the architectures.
>> The maximum page block order will be given by
>> ARCH_FORCE_MAX_ORDER.
>>
>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>> current kernel configurations won't be affected by this
>> change. It is a opt-in change.
>
> Right.
>
>>
>> This patch will allow to have the same CMA alignment
>> requirements for large page sizes (16KiB, 64KiB) as that
>> in 4kb kernels by setting a lower pageblock_order.
>>
>> Tests:
>>
>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>> on 4k and 16k kernels.
>>
>> - Verified that Transparent Huge Pages work when pageblock_order
>> is 1, 7, 10 on 4k and 16k kernels.
>>
>> - Verified that dma-buf heaps allocations work when pageblock_order
>> is 1, 7, 10 on 4k and 16k kernels.
>
> pageblock_order are choosen as 1, 7 and 10 to cover the entire possible
> range for ARCH_FORCE_MAX_ORDER. Although kernel CI should test this for
> all values in the range. Because this now opens up different ranges for
> different platforms which were never tested earlier.
>
>>
>> Benchmarks:
>>
>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>> reason for the pageblock_order 7 is because this value makes the min
>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>
>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>> the # of instructions and page-faults on 16k kernels.
>> The benchmark was executed 10 times. The averages are below:
>>
>> # instructions | #page-faults
>> order 10 | order 7 | order 10 | order 7
>> --------------------------------------------------------
>> 13,891,765,770 | 11,425,777,314 | 220 | 217
>> 14,456,293,487 | 12,660,819,302 | 224 | 219
>> 13,924,261,018 | 13,243,970,736 | 217 | 221
>> 13,910,886,504 | 13,845,519,630 | 217 | 221
>> 14,388,071,190 | 13,498,583,098 | 223 | 224
>> 13,656,442,167 | 12,915,831,681 | 216 | 218
>> 13,300,268,343 | 12,930,484,776 | 222 | 218
>> 13,625,470,223 | 14,234,092,777 | 219 | 218
>> 13,508,964,965 | 13,432,689,094 | 225 | 219
>> 13,368,950,667 | 13,683,587,37 | 219 | 225
>> -------------------------------------------------------------------
>> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
>>
>> There were 4.85% #instructions when order was 7, in comparison
>> with order 10.
>>
>> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>
>> The number of page faults in order 7 and 10 were the same.
>>
>> These results didn't show any significant regression when the
>> pageblock_order is set to 7 on 16kb kernels.
>>
>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>> on the 16k kernels with pageblock_order 7 and 10.
>>
>> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
>> -------------------------------------------------------------------
>> 15.8 | 16.4 | 0.6 | 3.80%
>> 16.4 | 16.2 | -0.2 | -1.22%
>> 16.6 | 16.3 | -0.3 | -1.81%
>> 16.8 | 16.3 | -0.5 | -2.98%
>> 16.6 | 16.8 | 0.2 | 1.20%
>> -------------------------------------------------------------------
>> 16.44 16.4 -0.04 -0.24% Averages
>>
>> The results didn't show any significant regression when the
>> pageblock_order is set to 7 on 16kb kernels.
>>
>> Cc: Andrew Morton <akpm@...ux-foundation.org>
>> Cc: Vlastimil Babka <vbabka@...e.cz>
>> Cc: Liam R. Howlett <Liam.Howlett@...cle.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
>> Cc: David Hildenbrand <david@...hat.com>
>> CC: Mike Rapoport <rppt@...nel.org>
>> Cc: Zi Yan <ziy@...dia.com>
>> Cc: Suren Baghdasaryan <surenb@...gle.com>
>> Cc: Minchan Kim <minchan@...nel.org>
>> Signed-off-by: Juan Yescas <jyescas@...gle.com>
>> Acked-by: Zi Yan <ziy@...dia.com>
>> ---
>> Changes in v3:
>> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>> as per Matthew's suggestion.
>> - Update comments in pageblock-flags.h for pageblock_order
>> value when THP or HugeTLB are not used.
>>
>> Changes in v2:
>> - Add Zi's Acked-by tag.
>> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>> per Zi and Matthew suggestion so it is available to
>> all the architectures.
>> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>> ARCH_FORCE_MAX_ORDER is not available.
>>
>>
>>
>>
>> include/linux/pageblock-flags.h | 14 ++++++++++----
>> mm/Kconfig | 31 +++++++++++++++++++++++++++++++
>> 2 files changed, 41 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>> index fc6b9c87cb0a..0c4963339f0b 100644
>> --- a/include/linux/pageblock-flags.h
>> +++ b/include/linux/pageblock-flags.h
>> @@ -28,6 +28,12 @@ enum pageblock_bits {
>> NR_PAGEBLOCK_BITS
>> };
>>
>> +#if defined(CONFIG_PAGE_BLOCK_ORDER)
>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>> +#else
>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>> +
>> #if defined(CONFIG_HUGETLB_PAGE)
>>
>> #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
>> @@ -41,18 +47,18 @@ extern unsigned int pageblock_order;
>> * Huge pages are a constant size, but don't exceed the maximum allocation
>> * granularity.
>> */
>> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>>
>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>
>> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>
>> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>
>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>> -#define pageblock_order MAX_PAGE_ORDER
>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>> +#define pageblock_order PAGE_BLOCK_ORDER
>>
>> #endif /* CONFIG_HUGETLB_PAGE */
>>
>
> These all look good.
>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index e113f713b493..c52be3489aa3 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -989,6 +989,37 @@ config CMA_AREAS
>>
>> If unsure, leave the default value "8" in UMA and "20" in NUMA.
>>
>> +#
>> +# Select this config option from the architecture Kconfig, if available, to set
>> +# the max page order for physically contiguous allocations.
>> +#
>> +config ARCH_FORCE_MAX_ORDER
>> + int
>
> ARCH_FORCE_MAX_ORDER needs to be defined here first before PAGE_BLOCK_ORDER
> could use that subsequently.But ARCH_FORCE_MAX_ORDER is defined in various
> architectures in 'int' or 'int "<description>"' formats. So could there be
> a problem for this config to be defined both in generic and platform code ?
> But clearly ARCH_FORCE_MAX_ORDER still remains a arch specific config.
>
> #git grep "config ARCH_FORCE_MAX_ORDER"
> arch/arc/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/arm/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/arm64/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/loongarch/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/m68k/Kconfig.cpu:config ARCH_FORCE_MAX_ORDER
> arch/mips/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/nios2/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/powerpc/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/sh/mm/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/sparc/Kconfig:config ARCH_FORCE_MAX_ORDER
> arch/xtensa/Kconfig:config ARCH_FORCE_MAX_ORDER
> mm/Kconfig:config ARCH_FORCE_MAX_ORDER
>
> arch/arc/
>
> config ARCH_FORCE_MAX_ORDER
> int "Maximum zone order"
>
> arch/arm/
>
> config ARCH_FORCE_MAX_ORDER
> int "Order of maximal physically contiguous allocations"
>
> arch/arm64/
>
> config ARCH_FORCE_MAX_ORDER
> int
> ...........
>
> arch/sparc/
>
> config ARCH_FORCE_MAX_ORDER
> int "Order of maximal physically contiguous allocations"
>
>> +
>> +# When ARCH_FORCE_MAX_ORDER is not defined, the default page block order is 10,
>
> Just wondering - why the default is 10 ?
For x86_64, MAX_PAGE_ORDER is 10. I wonder if it is related.
>
>> +# as per include/linux/mmzone.h.
>> +config PAGE_BLOCK_ORDER
>> + int "Page Block Order"
>> + range 1 10 if !ARCH_FORCE_MAX_ORDER
>
> Also why the range is restricted to 10 ?
>
>> + default 10 if !ARCH_FORCE_MAX_ORDER
>> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER
>> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER
>
> We still have the MAX_PAGE_ORDER which maps into ARCH_FORCE_MAX_ORDER
> when available or otherwise just falls back as 10.
>
> /* Free memory management - zoned buddy allocator. */
> #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
> #define MAX_PAGE_ORDER 10
> #else
> #define MAX_PAGE_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
> #endif
>
> Hence could PAGE_BLOCK_ORDER config description block be simplified as
>
> config PAGE_BLOCK_ORDER
> int "Page Block Order"
> range 1 MAX_PAGE_ORDER
> default MAX_PAGE_ORDER
Could this work? MAX_PAGE_ORDER is a macro defined in linux/mmzone.h.
Can Kconfig access it? I am not an expert on Kconfig.
>
> As MAX_PAGE_ORDER could switch between ARCH_FORCE_MAX_ORDER and 10 as
> and when required.
If the above Kconfig code work, that would be great.
>
>> +
>> + help
>> + The page block order refers to the power of two number of pages that
>> + are physically contiguous and can have a migrate type associated to
>> + them. The maximum size of the page block order is limited by
>> + ARCH_FORCE_MAX_ORDER.
>
> s/ARCH_FORCE_MAX_ORDER/ARCH_FORCE_MAX_ORDER when available on the platform/ ?
>
> Also mention about max range when ARCH_FORCE_MAX_ORDER is not available.
>
>> +
>> + This option allows overriding the default setting when the page
>> + block order requires to be smaller than ARCH_FORCE_MAX_ORDER.
>> +
>> + Reducing pageblock order can negatively impact THP generation
>> + successful rate. If your workloads uses THP heavily, please use this
>> + option with caution.
>
> Just wondering - could there be any other side effects besides THP ? Will it
> be better to depend on CONFIG_EXPERT while selecting anything other than the
> default option i.e ARCH_FORCE_MAX_ORDER or 10 from the value range.
Another side effect (or maybe benefit) is that things like virtio-balloon free
page reporting, virtio-mem using pageblock in their work can have smaller
granularity with a reduced pageblock size.
--
Best Regards,
Yan, Zi
Powered by blists - more mailing lists