[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <70049abc-bf79-4d04-a0a8-dd3787195986@redhat.com>
Date: Mon, 4 Aug 2025 19:07:06 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Pankaj Raghav (Samsung)" <kernel@...kajraghav.com>
Cc: Suren Baghdasaryan <surenb@...gle.com>,
Ryan Roberts <ryan.roberts@....com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, Borislav Petkov <bp@...en8.de>,
Ingo Molnar <mingo@...hat.com>, "H . Peter Anvin" <hpa@...or.com>,
Vlastimil Babka <vbabka@...e.cz>, Zi Yan <ziy@...dia.com>,
Mike Rapoport <rppt@...nel.org>, Dave Hansen <dave.hansen@...ux.intel.com>,
Michal Hocko <mhocko@...e.com>, Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>, Nico Pache <npache@...hat.com>,
Dev Jain <dev.jain@....com>, "Liam R . Howlett" <Liam.Howlett@...cle.com>,
Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, willy@...radead.org, x86@...nel.org,
linux-block@...r.kernel.org, Ritesh Harjani <ritesh.list@...il.com>,
linux-fsdevel@...r.kernel.org, "Darrick J . Wong" <djwong@...nel.org>,
mcgrof@...nel.org, gost.dev@...sung.com, hch@....de,
Pankaj Raghav <p.raghav@...sung.com>
Subject: Re: [PATCH 3/5] mm: add static huge zero folio
On 04.08.25 18:46, Lorenzo Stoakes wrote:
> On Mon, Aug 04, 2025 at 02:13:54PM +0200, Pankaj Raghav (Samsung) wrote:
>> From: Pankaj Raghav <p.raghav@...sung.com>
>>
>> There are many places in the kernel where we need to zeroout larger
>> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
>> is limited by PAGE_SIZE.
>>
>> This is especially annoying in block devices and filesystems where we
>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>> bvec support in block layer, it is much more efficient to send out
>> larger zero pages as a part of single bvec.
>>
>> This concern was raised during the review of adding LBS support to
>> XFS[1][2].
>>
>> Usually huge_zero_folio is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left. At moment,
>> huge_zero_folio infrastructure refcount is tied to the process lifetime
>> that created it. This might not work for bio layer as the completions
>> can be async and the process that created the huge_zero_folio might no
>> longer be alive. And, one of the main point that came during discussion
>> is to have something bigger than zero page as a drop-in replacement.
>>
>> Add a config option STATIC_HUGE_ZERO_FOLIO that will result in allocating
>> the huge zero folio on first request, if not already allocated, and turn
>> it static such that it can never get freed. This makes using the
>> huge_zero_folio without having to pass any mm struct and does not tie the
>> lifetime of the zero folio to anything, making it a drop-in replacement
>> for ZERO_PAGE.
>>
>> If STATIC_HUGE_ZERO_FOLIO config option is enabled, then
>> mm_get_huge_zero_folio() will simply return this page instead of
>> dynamically allocating a new PMD page.
>>
>> This option can waste memory in small systems or systems with 64k base
>> page size. So make it an opt-in and also add an option from individual
>> architecture so that we don't enable this feature for larger base page
>> size systems. Only x86 is enabled as a part of this series. Other
>> architectures shall be enabled as a follow-up to this series.
>>
>> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
>> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
>>
>> Co-developed-by: David Hildenbrand <david@...hat.com>
>> Signed-off-by: David Hildenbrand <david@...hat.com>
>> Signed-off-by: Pankaj Raghav <p.raghav@...sung.com>
>> ---
>> arch/x86/Kconfig | 1 +
>> include/linux/huge_mm.h | 18 ++++++++++++++++
>> mm/Kconfig | 21 +++++++++++++++++++
>> mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++++++++++-
>> 4 files changed, 85 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 0ce86e14ab5e..8e2aa1887309 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -153,6 +153,7 @@ config X86
>> select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
>> select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>> select ARCH_WANTS_THP_SWAP if X86_64
>> + select ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO if X86_64
>> select ARCH_HAS_PARANOID_L1D_FLUSH
>> select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>> select BUILDTIME_TABLE_SORT
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 7748489fde1b..78ebceb61d0e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -476,6 +476,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
>>
>> extern struct folio *huge_zero_folio;
>> extern unsigned long huge_zero_pfn;
>> +extern atomic_t huge_zero_folio_is_static;
>
> Really don't love having globals like this, please can we have a helper
> function that tells you this and not extern it?
>
> Also we're not checking CONFIG_STATIC_HUGE_ZERO_FOLIO but still exposing
> this value which a helper function would avoid also.
>
>>
>> static inline bool is_huge_zero_folio(const struct folio *folio)
>> {
>> @@ -494,6 +495,18 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
>>
>> struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
>> void mm_put_huge_zero_folio(struct mm_struct *mm);
>> +struct folio *__get_static_huge_zero_folio(void);
>
> Why are we declaring a static inline function prototype that we then
> implement immediately below?
>
>> +
>> +static inline struct folio *get_static_huge_zero_folio(void)
>> +{
>> + if (!IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
>> + return NULL;
>> +
>> + if (likely(atomic_read(&huge_zero_folio_is_static)))
>> + return huge_zero_folio;
>> +
>> + return __get_static_huge_zero_folio();
>> +}
>>
>> static inline bool thp_migration_supported(void)
>> {
>> @@ -685,6 +698,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
>> {
>> return 0;
>> }
>> +
>> +static inline struct folio *get_static_huge_zero_folio(void)
>> +{
>> + return NULL;
>> +}
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> static inline int split_folio_to_list_to_order(struct folio *folio,
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index e443fe8cd6cf..366a6d2d771e 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -823,6 +823,27 @@ config ARCH_WANT_GENERAL_HUGETLB
>> config ARCH_WANTS_THP_SWAP
>> def_bool n
>>
>> +config ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO
>> + def_bool n
>> +
>> +config STATIC_HUGE_ZERO_FOLIO
>> + bool "Allocate a PMD sized folio for zeroing"
>> + depends on ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO && TRANSPARENT_HUGEPAGE
>> + help
>> + Without this config enabled, the huge zero folio is allocated on
>> + demand and freed under memory pressure once no longer in use.
>> + To detect remaining users reliably, references to the huge zero folio
>> + must be tracked precisely, so it is commonly only available for mapping
>> + it into user page tables.
>> +
>> + With this config enabled, the huge zero folio can also be used
>> + for other purposes that do not implement precise reference counting:
>> + it is still allocated on demand, but never freed, allowing for more
>> + wide-spread use, for example, when performing I/O similar to the
>> + traditional shared zeropage.
>> +
>> + Not suitable for memory constrained systems.
>> +
>> config MM_ID
>> def_bool n
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ff06dee213eb..e117b280b38d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -75,6 +75,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>> static bool split_underused_thp = true;
>>
>> static atomic_t huge_zero_refcount;
>> +atomic_t huge_zero_folio_is_static __read_mostly;
>> struct folio *huge_zero_folio __read_mostly;
>> unsigned long huge_zero_pfn __read_mostly = ~0UL;
>> unsigned long huge_anon_orders_always __read_mostly;
>> @@ -266,6 +267,45 @@ void mm_put_huge_zero_folio(struct mm_struct *mm)
>> put_huge_zero_folio();
>> }
>>
>> +#ifdef CONFIG_STATIC_HUGE_ZERO_FOLIO
>> +
>
> Extremely tiny silly nit - there's a blank line below this, but not under the
> #endif, let's remove this line.
>
>> +struct folio *__get_static_huge_zero_folio(void)
>> +{
>> + static unsigned long fail_count_clear_timer;
>> + static atomic_t huge_zero_static_fail_count __read_mostly;
>> +
>> + if (unlikely(!slab_is_available()))
>> + return NULL;
>> +
>> + /*
>> + * If we failed to allocate a huge zero folio, just refrain from
>> + * trying for one minute before retrying to get a reference again.
>> + */
>> + if (atomic_read(&huge_zero_static_fail_count) > 1) {
>> + if (time_before(jiffies, fail_count_clear_timer))
>> + return NULL;
>> + atomic_set(&huge_zero_static_fail_count, 0);
>> + }
>
> Yeah I really don't like this. This seems overly complicated and too
> fiddly. Also if I want a static PMD, do I want to wait a minute for next
> attempt?
>
> Also doing things this way we might end up:
>
> 0. Enabling CONFIG_STATIC_HUGE_ZERO_FOLIO
> 1. Not doing anything that needs a static PMD for a while + get fragmentation.
> 2. Do something that needs it - oops can't get order-9 page, and waiting 60
> seconds between attempts
> 3. This is silent so you think you have it switched on but are actually getting
> bad performance.
>
> I appreciate wanting to reuse this code, but we need to find a way to do this
> really really early, and get rid of this arbitrary time out. It's very aribtrary
> and we have no easy way of tracing how this might behave under workload.
>
> Also we end up pinning an order-9 page either way, so no harm in getting it
> first thing?
What we could do, to avoid messing with memblock and two ways of initializing a huge zero folio early, and just disable the shrinker.
Downside is that the page is really static (not just when actually used at least once). I like it:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0ce86e14ab5e1..8e2aa18873098 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
select ARCH_WANTS_THP_SWAP if X86_64
+ select ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select BUILDTIME_TABLE_SORT
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b7..ccfa5c95f14b1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -495,6 +495,17 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
void mm_put_huge_zero_folio(struct mm_struct *mm);
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+ if (!IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
+ return NULL;
+
+ if (unlikely(!huge_zero_folio))
+ return NULL;
+
+ return huge_zero_folio;
+}
+
static inline bool thp_migration_supported(void)
{
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
@@ -685,6 +696,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
{
return 0;
}
+
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+ return NULL;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf2..366a6d2d771e3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -823,6 +823,27 @@ config ARCH_WANT_GENERAL_HUGETLB
config ARCH_WANTS_THP_SWAP
def_bool n
+config ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO
+ def_bool n
+
+config STATIC_HUGE_ZERO_FOLIO
+ bool "Allocate a PMD sized folio for zeroing"
+ depends on ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO && TRANSPARENT_HUGEPAGE
+ help
+ Without this config enabled, the huge zero folio is allocated on
+ demand and freed under memory pressure once no longer in use.
+ To detect remaining users reliably, references to the huge zero folio
+ must be tracked precisely, so it is commonly only available for mapping
+ it into user page tables.
+
+ With this config enabled, the huge zero folio can also be used
+ for other purposes that do not implement precise reference counting:
+ it is allocated statically and never freed, allowing for more
+ wide-spread use, for example, when performing I/O similar to the
+ traditional shared zeropage.
+
+ Not suitable for memory constrained systems.
+
config MM_ID
def_bool n
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff06dee213eb2..f65ba3e6f0824 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -866,9 +866,14 @@ static int __init thp_shrinker_init(void)
huge_zero_folio_shrinker->scan_objects = shrink_huge_zero_folio_scan;
shrinker_register(huge_zero_folio_shrinker);
- deferred_split_shrinker->count_objects = deferred_split_count;
- deferred_split_shrinker->scan_objects = deferred_split_scan;
- shrinker_register(deferred_split_shrinker);
+ if (IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO)) {
+ if (!get_huge_zero_folio())
+ pr_warn("Allocating static huge zero folio failed\n");
+ } else {
+ deferred_split_shrinker->count_objects = deferred_split_count;
+ deferred_split_shrinker->scan_objects = deferred_split_scan;
+ shrinker_register(deferred_split_shrinker);
+ }
return 0;
}
--
2.50.1
Now, one thing I do not like is that we have "ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO" but
then have a user-selectable option.
Should we just get rid of ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO?
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists