linux-kernel - Re: [PATCH 3/5] mm: add static huge zero folio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <558da90d-e43d-464d-a3b6-02f6ee0de035@intel.com>
Date: Tue, 5 Aug 2025 09:33:10 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: "Pankaj Raghav (Samsung)" <kernel@...kajraghav.com>,
 Suren Baghdasaryan <surenb@...gle.com>, Ryan Roberts <ryan.roberts@....com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, Borislav Petkov <bp@...en8.de>,
 Ingo Molnar <mingo@...hat.com>, "H . Peter Anvin" <hpa@...or.com>,
 Vlastimil Babka <vbabka@...e.cz>, Zi Yan <ziy@...dia.com>,
 Mike Rapoport <rppt@...nel.org>, Dave Hansen <dave.hansen@...ux.intel.com>,
 Michal Hocko <mhocko@...e.com>, David Hildenbrand <david@...hat.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Thomas Gleixner <tglx@...utronix.de>, Nico Pache <npache@...hat.com>,
 Dev Jain <dev.jain@....com>, "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Jens Axboe <axboe@...nel.dk>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, willy@...radead.org,
 x86@...nel.org, linux-block@...r.kernel.org,
 Ritesh Harjani <ritesh.list@...il.com>, linux-fsdevel@...r.kernel.org,
 "Darrick J . Wong" <djwong@...nel.org>, mcgrof@...nel.org,
 gost.dev@...sung.com, hch@....de, Pankaj Raghav <p.raghav@...sung.com>
Subject: Re: [PATCH 3/5] mm: add static huge zero folio

On 8/4/25 05:13, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@...sung.com>
> 
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
...

In x86-land, the rules are pretty clear about using imperative voice.
There are quite a few "we's" in the changelog and comments in this series.

I do think they're generally good to avoid and do lead to more clarity,
but I'm also not sure how important that is in mm-land these days.


> +static inline struct folio *get_static_huge_zero_folio(void)
> +{
> +	if (!IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
> +		return NULL;
> +
> +	if (likely(atomic_read(&huge_zero_folio_is_static)))
> +		return huge_zero_folio;
> +
> +	return __get_static_huge_zero_folio();
> +}

This seems like an ideal place to use 'struct static_key'.

>  static inline bool thp_migration_supported(void)
>  {
> @@ -685,6 +698,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
>  {
>  	return 0;
>  }
> +
> +static inline struct folio *get_static_huge_zero_folio(void)
> +{
> +	return NULL;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static inline int split_folio_to_list_to_order(struct folio *folio,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e443fe8cd6cf..366a6d2d771e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -823,6 +823,27 @@ config ARCH_WANT_GENERAL_HUGETLB
>  config ARCH_WANTS_THP_SWAP
>  	def_bool n
>  
> +config ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO
> +	def_bool n
> +
> +config STATIC_HUGE_ZERO_FOLIO
> +	bool "Allocate a PMD sized folio for zeroing"
> +	depends on ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO && TRANSPARENT_HUGEPAGE
> +	help
> +	  Without this config enabled, the huge zero folio is allocated on
> +	  demand and freed under memory pressure once no longer in use.
> +	  To detect remaining users reliably, references to the huge zero folio
> +	  must be tracked precisely, so it is commonly only available for mapping
> +	  it into user page tables.
> +
> +	  With this config enabled, the huge zero folio can also be used
> +	  for other purposes that do not implement precise reference counting:
> +	  it is still allocated on demand, but never freed, allowing for more
> +	  wide-spread use, for example, when performing I/O similar to the
> +	  traditional shared zeropage.
> +
> +	  Not suitable for memory constrained systems.

IMNHO, this is written like a changelog, not documentation for end users
trying to make sense of Kconfig options. I'd suggest keeping it short
and sweet:

config PERSISTENT_HUGE_ZERO_FOLIO
	bool "Allocate a persistent PMD-sized folio for zeroing"
	...
	help
	  Enable this option to reduce the runtime refcounting overhead
	  of the huge zero folio and expand the places in the kernel
	  that can use huge zero folios.

	  With this option enabled, the huge zero folio is allocated
	  once and never freed. It potentially wastes one huge page
	  worth of memory.

	  Say Y if your system has lots of memory. Say N if you are
	  memory constrained.

>  config MM_ID
>  	def_bool n
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ff06dee213eb..e117b280b38d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -75,6 +75,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  static bool split_underused_thp = true;
>  
>  static atomic_t huge_zero_refcount;
> +atomic_t huge_zero_folio_is_static __read_mostly;
>  struct folio *huge_zero_folio __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
>  unsigned long huge_anon_orders_always __read_mostly;
> @@ -266,6 +267,45 @@ void mm_put_huge_zero_folio(struct mm_struct *mm)
>  		put_huge_zero_folio();
>  }
>  
> +#ifdef CONFIG_STATIC_HUGE_ZERO_FOLIO
> +
> +struct folio *__get_static_huge_zero_folio(void)
> +{
> +	static unsigned long fail_count_clear_timer;
> +	static atomic_t huge_zero_static_fail_count __read_mostly;
> +
> +	if (unlikely(!slab_is_available()))
> +		return NULL;
> +
> +	/*
> +	 * If we failed to allocate a huge zero folio, just refrain from
> +	 * trying for one minute before retrying to get a reference again.
> +	 */
> +	if (atomic_read(&huge_zero_static_fail_count) > 1) {
> +		if (time_before(jiffies, fail_count_clear_timer))
> +			return NULL;
> +		atomic_set(&huge_zero_static_fail_count, 0);
> +	}

Any reason that this is an open-coded ratelimit instead of using
'struct ratelimit_state'?

I also find the 'huge_zero_static_fail_count' use pretty unintuitive.
This is fundamentally a slow path. Ideally, it's called once. In the
pathological case, it's called once a minute.

I'd probably just recommend putting a rate limit on this function, then
using a plain old mutex for the actual allocation to keep multiple
threads out.

Then the function becomes something like this:

	if (__ratelimit(&huge_zero_alloc_ratelimit))
		return;

	guard(mutex)(&huge_zero_mutex);

	if (!get_huge_zero_folio())
		return NULL;

	static_key_enable(&huge_zero_noref_key);

	return huge_zero_folio;

No atomic, no cmpxchg, no races on allocating.


...
>  static unsigned long shrink_huge_zero_folio_count(struct shrinker *shrink,
>  						  struct shrink_control *sc)
>  {
> @@ -277,7 +317,11 @@ static unsigned long shrink_huge_zero_folio_scan(struct shrinker *shrink,
>  						 struct shrink_control *sc)
>  {
>  	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
> -		struct folio *zero_folio = xchg(&huge_zero_folio, NULL);
> +		struct folio *zero_folio;
> +
> +		if (WARN_ON_ONCE(atomic_read(&huge_zero_folio_is_static)))
> +			return 0;
> +		zero_folio = xchg(&huge_zero_folio, NULL);
>  		BUG_ON(zero_folio == NULL);
>  		WRITE_ONCE(huge_zero_pfn, ~0UL);
>  		folio_put(zero_folio);

This seems like a hack to me. If you don't want the shrinker to run,
then deregister it. Keeping the refcount elevated is fine, but
repeatedly calling the shrinker to do atomic_cmpxchg() when you *know*
it will do nothing seems silly.

If you can't deregister the shrinker, at least use the static_key
approach and check the static key instead of doing futile cmpxchg's forever.