linux-kernel - Re: [v2 01/11] mm/zone_device: support large zone device private folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dbebbba0-3c59-4ee1-b32c-4b9f6ed90d92@redhat.com>
Date: Wed, 30 Jul 2025 11:50:12 +0200
From: David Hildenbrand <david@...hat.com>
To: Balbir Singh <balbirs@...dia.com>, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, Karol Herbst <kherbst@...hat.com>,
 Lyude Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Jérôme Glisse <jglisse@...hat.com>,
 Shuah Khan <shuah@...nel.org>, Barry Song <baohua@...nel.org>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Ryan Roberts <ryan.roberts@....com>, Matthew Wilcox <willy@...radead.org>,
 Peter Xu <peterx@...hat.com>, Zi Yan <ziy@...dia.com>,
 Kefeng Wang <wangkefeng.wang@...wei.com>, Jane Chu <jane.chu@...cle.com>,
 Alistair Popple <apopple@...dia.com>, Donet Tom <donettom@...ux.ibm.com>,
 Ralph Campbell <rcampbell@...dia.com>, Mika Penttilä
 <mpenttil@...hat.com>, Matthew Brost <matthew.brost@...el.com>,
 Francois Dugast <francois.dugast@...el.com>
Subject: Re: [v2 01/11] mm/zone_device: support large zone device private
 folios

On 30.07.25 11:21, Balbir Singh wrote:
> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
> 
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
> 
> Cc: Karol Herbst <kherbst@...hat.com>
> Cc: Lyude Paul <lyude@...hat.com>
> Cc: Danilo Krummrich <dakr@...nel.org>
> Cc: David Airlie <airlied@...il.com>
> Cc: Simona Vetter <simona@...ll.ch>
> Cc: "Jérôme Glisse" <jglisse@...hat.com>
> Cc: Shuah Khan <shuah@...nel.org>
> Cc: David Hildenbrand <david@...hat.com>
> Cc: Barry Song <baohua@...nel.org>
> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@....com>
> Cc: Matthew Wilcox <willy@...radead.org>
> Cc: Peter Xu <peterx@...hat.com>
> Cc: Zi Yan <ziy@...dia.com>
> Cc: Kefeng Wang <wangkefeng.wang@...wei.com>
> Cc: Jane Chu <jane.chu@...cle.com>
> Cc: Alistair Popple <apopple@...dia.com>
> Cc: Donet Tom <donettom@...ux.ibm.com>
> Cc: Ralph Campbell <rcampbell@...dia.com>
> Cc: Mika Penttilä <mpenttil@...hat.com>
> Cc: Matthew Brost <matthew.brost@...el.com>
> Cc: Francois Dugast <francois.dugast@...el.com>
> 
> Signed-off-by: Balbir Singh <balbirs@...dia.com>
> ---
>   include/linux/memremap.h | 10 ++++++++-
>   mm/memremap.c            | 48 +++++++++++++++++++++++++++++-----------
>   2 files changed, 44 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 4aa151914eab..a0723b35eeaa 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -199,7 +199,7 @@ static inline bool folio_is_fsdax(const struct folio *folio)
>   }
>   
>   #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page);
> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>   void memunmap_pages(struct dev_pagemap *pgmap);
>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -209,6 +209,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>   bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>   
>   unsigned long memremap_compat_align(void);
> +
> +static inline void zone_device_page_init(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	zone_device_folio_init(folio, 0);
> +}
> +
>   #else
>   static inline void *devm_memremap_pages(struct device *dev,
>   		struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index b0ce0d8254bd..3ca136e7455e 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -427,20 +427,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>   void free_zone_device_folio(struct folio *folio)
>   {
>   	struct dev_pagemap *pgmap = folio->pgmap;
> +	unsigned int nr = folio_nr_pages(folio);
> +	int i;

"unsigned long" is to be future-proof.

(folio_nr_pages() returns long and probably soon unsigned long)

[ I'd probably all it "nr_pages" ]

>   
>   	if (WARN_ON_ONCE(!pgmap))
>   		return;
>   
>   	mem_cgroup_uncharge(folio);
>   
> -	/*
> -	 * Note: we don't expect anonymous compound pages yet. Once supported
> -	 * and we could PTE-map them similar to THP, we'd have to clear
> -	 * PG_anon_exclusive on all tail pages.
> -	 */
>   	if (folio_test_anon(folio)) {
> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -		__ClearPageAnonExclusive(folio_page(folio, 0));
> +		for (i = 0; i < nr; i++)
> +			__ClearPageAnonExclusive(folio_page(folio, i));
> +	} else {
> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>   	}
>   
>   	/*
> @@ -464,11 +463,20 @@ void free_zone_device_folio(struct folio *folio)
>   
>   	switch (pgmap->type) {
>   	case MEMORY_DEVICE_PRIVATE:
> +		if (folio_test_large(folio)) {

Could do "nr > 1" if we already have that value around.

> +			folio_unqueue_deferred_split(folio);

I think I asked that already but maybe missed the reply: Should these 
folios ever be added to the deferred split queue and is there any value 
in splitting them under memory pressure in the shrinker?

My gut feeling is "No", because the buddy cannot make use of these 
folios, but maybe there is an interesting case where we want that behavior?

> +
> +			percpu_ref_put_many(&folio->pgmap->ref, nr - 1);
> +		}
> +		pgmap->ops->page_free(&folio->page);
> +		percpu_ref_put(&folio->pgmap->ref);

Coold you simply do a

	percpu_ref_put_many(&folio->pgmap->ref, nr);

here, or would that be problematic?

> +		folio->page.mapping = NULL;
> +		break;
>   	case MEMORY_DEVICE_COHERENT:
>   		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>   			break;
> -		pgmap->ops->page_free(folio_page(folio, 0));
> -		put_dev_pagemap(pgmap);
> +		pgmap->ops->page_free(&folio->page);
> +		percpu_ref_put(&folio->pgmap->ref);
>   		break;
>   
>   	case MEMORY_DEVICE_GENERIC:
> @@ -491,14 +499,28 @@ void free_zone_device_folio(struct folio *folio)
>   	}
>   }
>   
> -void zone_device_page_init(struct page *page)
> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>   {
> +	struct page *page = folio_page(folio, 0);
> +
> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> +
> +	/*
> +	 * Only PMD level migration is supported for THP migration
> +	 */

Talking about something that does not exist yet (and is very specific) 
sounds a bit weird.

Should this go into a different patch, or could we rephrase the comment 
to be a bit more generic?

In this patch here, nothing would really object to "order" being 
intermediate.

(also, this is a device_private limitation? shouldn't that check go 
somehwere where we can perform this device-private limitation check?)

> +	WARN_ON_ONCE(order && order != HPAGE_PMD_ORDER);
> +
>   	/*
>   	 * Drivers shouldn't be allocating pages after calling
>   	 * memunmap_pages().
>   	 */
> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
> -	set_page_count(page, 1);
> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> +	folio_set_count(folio, 1);
>   	lock_page(page);
> +
> +	if (order > 1) {
> +		prep_compound_page(page, order);
> +		folio_set_large_rmappable(folio);
> +	}
>   }
> -EXPORT_SYMBOL_GPL(zone_device_page_init);
> +EXPORT_SYMBOL_GPL(zone_device_folio_init);


-- 
Cheers,

David / dhildenb