lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3e075035-fb74-4b5b-81e4-d32b832de44f@redhat.com>
Date: Tue, 3 Jun 2025 17:42:28 +0200
From: David Hildenbrand <david@...hat.com>
To: Zi Yan <ziy@...dia.com>
Cc: Juan Yescas <jyescas@...gle.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
 <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, tjmercier@...gle.com,
 isaacmanjarres@...gle.com, kaleshsingh@...gle.com, masahiroy@...nel.org,
 Minchan Kim <minchan@...nel.org>
Subject: Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block
 order

On 03.06.25 17:14, Zi Yan wrote:
> On 3 Jun 2025, at 10:55, Zi Yan wrote:
> 
>> On 3 Jun 2025, at 9:03, David Hildenbrand wrote:
>>
>>> On 21.05.25 23:57, Juan Yescas wrote:
>>>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>>>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>>>> and this causes the CMA reservations to be larger than necessary.
>>>> This means that system will have less available MIGRATE_UNMOVABLE and
>>>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>>>
>>>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>>>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>>>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>>>
>>>> For example, in ARM, the CMA alignment requirement when:
>>>>
>>>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>>>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>>>
>>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>>> -----------------------------------------------------------------------
>>>>      4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
>>>>     16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
>>>>     64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
>>>>
>>>> There are some extreme cases for the CMA alignment requirement when:
>>>>
>>>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>>>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>>>> - CONFIG_HUGETLB_PAGE is NOT set
>>>>
>>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
>>>> ------------------------------------------------------------------------
>>>>      4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
>>>>     16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
>>>>     64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
>>>>
>>>> This affects the CMA reservations for the drivers. If a driver in a
>>>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>>>> reservation has to be 32MiB due to the alignment requirements:
>>>>
>>>> reserved-memory {
>>>>       ...
>>>>       cma_test_reserve: cma_test_reserve {
>>>>           compatible = "shared-dma-pool";
>>>>           size = <0x0 0x400000>; /* 4 MiB */
>>>>           ...
>>>>       };
>>>> };
>>>>
>>>> reserved-memory {
>>>>       ...
>>>>       cma_test_reserve: cma_test_reserve {
>>>>           compatible = "shared-dma-pool";
>>>>           size = <0x0 0x2000000>; /* 32 MiB */
>>>>           ...
>>>>       };
>>>> };
>>>>
>>>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>>>> allows to set the page block order in all the architectures.
>>>> The maximum page block order will be given by
>>>> ARCH_FORCE_MAX_ORDER.
>>>>
>>>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>>>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>>>> current kernel configurations won't be affected by this
>>>> change. It is a opt-in change.
>>>>
>>>> This patch will allow to have the same CMA alignment
>>>> requirements for large page sizes (16KiB, 64KiB) as that
>>>> in 4kb kernels by setting a lower pageblock_order.
>>>>
>>>> Tests:
>>>>
>>>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>>>> on 4k and 16k kernels.
>>>>
>>>> - Verified that Transparent Huge Pages work when pageblock_order
>>>> is 1, 7, 10 on 4k and 16k kernels.
>>>>
>>>> - Verified that dma-buf heaps allocations work when pageblock_order
>>>> is 1, 7, 10 on 4k and 16k kernels.
>>>>
>>>> Benchmarks:
>>>>
>>>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>>>> reason for the pageblock_order 7 is because this value makes the min
>>>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>>>
>>>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>>>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>>>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>>>> the # of instructions and page-faults on 16k kernels.
>>>> The benchmark was executed 10 times. The averages are below:
>>>>
>>>>              # instructions         |     #page-faults
>>>>       order 10     |  order 7       | order 10 | order 7
>>>> --------------------------------------------------------
>>>>    13,891,765,770	 | 11,425,777,314 |    220   |   217
>>>>    14,456,293,487	 | 12,660,819,302 |    224   |   219
>>>>    13,924,261,018	 | 13,243,970,736 |    217   |   221
>>>>    13,910,886,504	 | 13,845,519,630 |    217   |   221
>>>>    14,388,071,190	 | 13,498,583,098 |    223   |   224
>>>>    13,656,442,167	 | 12,915,831,681 |    216   |   218
>>>>    13,300,268,343	 | 12,930,484,776 |    222   |   218
>>>>    13,625,470,223	 | 14,234,092,777 |    219   |   218
>>>>    13,508,964,965	 | 13,432,689,094 |    225   |   219
>>>>    13,368,950,667	 | 13,683,587,37  |    219   |   225
>>>> -------------------------------------------------------------------
>>>>    13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
>>>>
>>>> There were 4.85% #instructions when order was 7, in comparison
>>>> with order 10.
>>>>
>>>>        13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>>>
>>>> The number of page faults in order 7 and 10 were the same.
>>>>
>>>> These results didn't show any significant regression when the
>>>> pageblock_order is set to 7 on 16kb kernels.
>>>>
>>>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>>>>    on the 16k kernels with pageblock_order 7 and 10.
>>>>
>>>> order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
>>>> -------------------------------------------------------------------
>>>>     15.8	 |  16.4    |         0.6        |     3.80%
>>>>     16.4	 |  16.2    |        -0.2        |    -1.22%
>>>>     16.6	 |  16.3    |        -0.3        |    -1.81%
>>>>     16.8	 |  16.3    |        -0.5        |    -2.98%
>>>>     16.6	 |  16.8    |         0.2        |     1.20%
>>>> -------------------------------------------------------------------
>>>>     16.44     16.4            -0.04	          -0.24%   Averages
>>>>
>>>> The results didn't show any significant regression when the
>>>> pageblock_order is set to 7 on 16kb kernels.
>>>>
>>>> Cc: Andrew Morton <akpm@...ux-foundation.org>
>>>> Cc: Vlastimil Babka <vbabka@...e.cz>
>>>> Cc: Liam R. Howlett <Liam.Howlett@...cle.com>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
>>>> Cc: David Hildenbrand <david@...hat.com>
>>>> CC: Mike Rapoport <rppt@...nel.org>
>>>> Cc: Zi Yan <ziy@...dia.com>
>>>> Cc: Suren Baghdasaryan <surenb@...gle.com>
>>>> Cc: Minchan Kim <minchan@...nel.org>
>>>> Signed-off-by: Juan Yescas <jyescas@...gle.com>
>>>> Acked-by: Zi Yan <ziy@...dia.com>
>>>> ---
>>>> Changes in v7:
>>>>     - Update alignment calculation to 2MiB as per David's
>>>>       observation.
>>>>     - Update page block order calculation in mm/mm_init.c for
>>>>       powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>>>>
>>>> Changes in v6:
>>>>     - Applied the change provided by Zi Yan to fix
>>>>       the Kconfig. The change consists in evaluating
>>>>       to true or false in the if expression for range:
>>>>       range 1 <symbol> if <expression to eval true/false>.
>>>>
>>>> Changes in v5:
>>>>     - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>>>>       ranges with config definitions don't work in Kconfig,
>>>>       for example (range 1 MY_CONFIG).
>>>>     - Add PAGE_BLOCK_ORDER_MANUAL config for the
>>>>       page block order number. The default value was not
>>>>       defined.
>>>>     - Fix typos reported by Andrew.
>>>>     - Test default configs in powerpc.
>>>>
>>>> Changes in v4:
>>>>     - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>>>>       validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>>>>       compile time.
>>>>     - This change fixes the warning in:
>>>>     https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
>>>>
>>>> Changes in v3:
>>>>     - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>>>>       as per Matthew's suggestion.
>>>>     - Update comments in pageblock-flags.h for pageblock_order
>>>>       value when THP or HugeTLB are not used.
>>>>
>>>> Changes in v2:
>>>>     - Add Zi's Acked-by tag.
>>>>     - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>>>>       per Zi and Matthew suggestion so it is available to
>>>>       all the architectures.
>>>>     - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>>>>       ARCH_FORCE_MAX_ORDER is not available.
>>>>
>>>>    include/linux/mmzone.h          | 16 ++++++++++++++++
>>>>    include/linux/pageblock-flags.h |  8 ++++----
>>>>    mm/Kconfig                      | 34 +++++++++++++++++++++++++++++++++
>>>>    mm/mm_init.c                    |  2 +-
>>>>    4 files changed, 55 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 6ccec1bf2896..05610337bbb6 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -37,6 +37,22 @@
>>>>     #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>>>>   +/* Defines the order for the number of pages that have a migrate type. */
>>>> +#ifndef CONFIG_PAGE_BLOCK_ORDER
>>>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>>>> +#else
>>>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>>>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>>>> +
>>>> +/*
>>>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
>>>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
>>>> + * which defines the order for the number of pages that can have a migrate type
>>>> + */
>>>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
>>>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>>>>     * costly to service.  That is between allocation orders which should
>>>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>>>> index fc6b9c87cb0a..e73a4292ef02 100644
>>>> --- a/include/linux/pageblock-flags.h
>>>> +++ b/include/linux/pageblock-flags.h
>>>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>>>>     * Huge pages are a constant size, but don't exceed the maximum allocation
>>>>     * granularity.
>>>>     */
>>>> -#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>>>> +#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>>>>     #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>>>     #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>>>   -#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>>>> +#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>>>     #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>   -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>>>> -#define pageblock_order		MAX_PAGE_ORDER
>>>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>>>> +#define pageblock_order		PAGE_BLOCK_ORDER
>>>>     #endif /* CONFIG_HUGETLB_PAGE */
>>>>   diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index e113f713b493..13a5c4f6e6b6 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -989,6 +989,40 @@ config CMA_AREAS
>>>>     	  If unsure, leave the default value "8" in UMA and "20" in NUMA.
>>>>   +#
>>>> +# Select this config option from the architecture Kconfig, if available, to set
>>>> +# the max page order for physically contiguous allocations.
>>>> +#
>>>> +config ARCH_FORCE_MAX_ORDER
>>>> +	int
>>>> +
>>>> +#
>>>> +# When ARCH_FORCE_MAX_ORDER is not defined,
>>>> +# the default page block order is MAX_PAGE_ORDER (10) as per
>>>> +# include/linux/mmzone.h.
>>>> +#
>>>> +config PAGE_BLOCK_ORDER
>>>> +	int "Page Block Order"
>>>> +	range 1 10 if ARCH_FORCE_MAX_ORDER = 0
>>>> +	default 10 if ARCH_FORCE_MAX_ORDER = 0
>>>> +	range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>>> +	default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>>> +	help
>>>> +	  The page block order refers to the power of two number of pages that
>>>> +	  are physically contiguous and can have a migrate type associated to
>>>> +	  them. The maximum size of the page block order is limited by
>>>> +	  ARCH_FORCE_MAX_ORDER.
>>>> +
>>>> +	  This config allows overriding the default page block order when the
>>>> +	  page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
>>>> +	  or MAX_PAGE_ORDER.
>>>> +
>>>> +	  Reducing pageblock order can negatively impact THP generation
>>>> +	  success rate. If your workloads uses THP heavily, please use this
>>>> +	  option with caution.
>>>> +
>>>> +	  Don't change if unsure.
>>>
>>>
>>> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have
>>>
>>> CONFIG_PAGE_BLOCK_ORDER=10
>>>
>>>
>>> But then, we'll do this
>>>
>>> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>>
>>>
>>> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER.
>>>
>>> Confusing.
>>>
>>> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
>>
>> IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region
>> size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me.
> 
> LIMIT might be still ambiguous, since it can be lower limit or upper limit.
> CONFIG_PAGE_BLOCK_ORDER_CEIL is better. Here is the patch I come up with,
> if it looks good to you, I can send it out properly.

LGTM

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ