linux-kernel - Re: [PATCH v5 1/5] mm/zone_device: Reinitialize large zone device private folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWpptcor29vqLqiJ@intel.com>
Date: Fri, 16 Jan 2026 11:39:17 -0500
From: Rodrigo Vivi <rodrigo.vivi@...el.com>
To: Matthew Brost <matthew.brost@...el.com>, Madhavan Srinivasan
	<maddy@...ux.ibm.com>, Nicholas Piggin <npiggin@...il.com>, Michael Ellerman
	<mpe@...erman.id.au>, "Christophe Leroy (CS GROUP)" <chleroy@...nel.org>,
	<linuxppc-dev@...ts.ozlabs.org>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, David Hildenbrand <david@...nel.org>, "Oscar
 Salvador" <osalvador@...e.de>, <linux-mm@...ck.org>, Andrew Morton
	<akpm@...ux-foundation.org>, Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky
	<leon@...nel.org>, Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann
	<tzimmermann@...e.de>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
	Dave Airlie <airlied@...il.com>, Simona Vetter <simona.vetter@...ll.ch>
CC: Alistair Popple <apopple@...dia.com>, Francois Dugast
	<francois.dugast@...el.com>, <intel-xe@...ts.freedesktop.org>,
	<dri-devel@...ts.freedesktop.org>, Zi Yan <ziy@...dia.com>, "adhavan
 Srinivasan" <maddy@...ux.ibm.com>, Nicholas Piggin <npiggin@...il.com>,
	Michael Ellerman <mpe@...erman.id.au>, "Christophe Leroy (CS GROUP)"
	<chleroy@...nel.org>, Felix Kuehling <Felix.Kuehling@....com>, Alex Deucher
	<alexander.deucher@....com>, Christian König
	<christian.koenig@....com>, David Airlie <airlied@...il.com>, Simona Vetter
	<simona@...ll.ch>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
	Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
	Lyude Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>, "David
 Hildenbrand" <david@...nel.org>, Oscar Salvador <osalvador@...e.de>, "Andrew
 Morton" <akpm@...ux-foundation.org>, Jason Gunthorpe <jgg@...pe.ca>, "Leon
 Romanovsky" <leon@...nel.org>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	"Liam R . Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
	<vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan
	<surenb@...gle.com>, Michal Hocko <mhocko@...e.com>, Balbir Singh
	<balbirs@...dia.com>, <linuxppc-dev@...ts.ozlabs.org>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <amd-gfx@...ts.freedesktop.org>,
	<nouveau@...ts.freedesktop.org>, <linux-mm@...ck.org>,
	<linux-cxl@...r.kernel.org>
Subject: Re: [PATCH v5 1/5] mm/zone_device: Reinitialize large zone device
 private folios

On Thu, Jan 15, 2026 at 10:35:56PM -0800, Matthew Brost wrote:
> On Thu, Jan 15, 2026 at 10:05:00PM +1100, Alistair Popple wrote:
> > On 2026-01-15 at 18:43 +1100, Matthew Brost <matthew.brost@...el.com> wrote...
> > > On Thu, Jan 15, 2026 at 06:07:08PM +1100, Alistair Popple wrote:
> > > > On 2026-01-15 at 17:18 +1100, Matthew Brost <matthew.brost@...el.com> wrote...
> > > > > On Wed, Jan 14, 2026 at 09:57:31PM -0800, Matthew Brost wrote:
> > > > > > On Thu, Jan 15, 2026 at 04:27:26PM +1100, Alistair Popple wrote:
> > > > > > > On 2026-01-15 at 06:19 +1100, Francois Dugast <francois.dugast@...el.com> wrote...
> > > > > > > > From: Matthew Brost <matthew.brost@...el.com>
> > > > > > > > 
> > > > > > > > Reinitialize metadata for large zone device private folios in
> > > > > > > > zone_device_page_init prior to creating a higher-order zone device
> > > > > > > > private folio. This step is necessary when the folio’s order changes
> > > > > > > > dynamically between zone_device_page_init calls to avoid building a
> > > > > > > > corrupt folio. As part of the metadata reinitialization, the dev_pagemap
> > > > > > > > must be passed in from the caller because the pgmap stored in the folio
> > > > > > > > page may have been overwritten with a compound head.
> > > > > > > 
> > > > > > > Thanks for fixing, a couple of minor comments below.
> > > > > > > 
> > > > > > > > Cc: Zi Yan <ziy@...dia.com>
> > > > > > > > Cc: Alistair Popple <apopple@...dia.com>
> > > > > > > > Cc: adhavan Srinivasan <maddy@...ux.ibm.com>
> > > > > > > > Cc: Nicholas Piggin <npiggin@...il.com>
> > > > > > > > Cc: Michael Ellerman <mpe@...erman.id.au>
> > > > > > > > Cc: "Christophe Leroy (CS GROUP)" <chleroy@...nel.org>
> > > > > > > > Cc: Felix Kuehling <Felix.Kuehling@....com>
> > > > > > > > Cc: Alex Deucher <alexander.deucher@....com>
> > > > > > > > Cc: "Christian König" <christian.koenig@....com>
> > > > > > > > Cc: David Airlie <airlied@...il.com>
> > > > > > > > Cc: Simona Vetter <simona@...ll.ch>
> > > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>
> > > > > > > > Cc: Maxime Ripard <mripard@...nel.org>
> > > > > > > > Cc: Thomas Zimmermann <tzimmermann@...e.de>
> > > > > > > > Cc: Lyude Paul <lyude@...hat.com>
> > > > > > > > Cc: Danilo Krummrich <dakr@...nel.org>
> > > > > > > > Cc: David Hildenbrand <david@...nel.org>
> > > > > > > > Cc: Oscar Salvador <osalvador@...e.de>
> > > > > > > > Cc: Andrew Morton <akpm@...ux-foundation.org>
> > > > > > > > Cc: Jason Gunthorpe <jgg@...pe.ca>
> > > > > > > > Cc: Leon Romanovsky <leon@...nel.org>
> > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@...cle.com>
> > > > > > > > Cc: Vlastimil Babka <vbabka@...e.cz>
> > > > > > > > Cc: Mike Rapoport <rppt@...nel.org>
> > > > > > > > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > > > > > > > Cc: Michal Hocko <mhocko@...e.com>
> > > > > > > > Cc: Balbir Singh <balbirs@...dia.com>
> > > > > > > > Cc: linuxppc-dev@...ts.ozlabs.org
> > > > > > > > Cc: kvm@...r.kernel.org
> > > > > > > > Cc: linux-kernel@...r.kernel.org
> > > > > > > > Cc: amd-gfx@...ts.freedesktop.org
> > > > > > > > Cc: dri-devel@...ts.freedesktop.org
> > > > > > > > Cc: nouveau@...ts.freedesktop.org
> > > > > > > > Cc: linux-mm@...ck.org
> > > > > > > > Cc: linux-cxl@...r.kernel.org
> > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@...el.com>
> > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@...el.com>
> > > > > > > > ---
> > > > > > > >  arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
> > > > > > > >  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
> > > > > > > >  drivers/gpu/drm/drm_pagemap.c            |  2 +-
> > > > > > > >  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
> > > > > > > >  include/linux/memremap.h                 |  9 ++++++---
> > > > > > > >  lib/test_hmm.c                           |  4 +++-
> > > > > > > >  mm/memremap.c                            | 20 +++++++++++++++++++-
> > > > > > > >  7 files changed, 32 insertions(+), 9 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > > > > > > index e5000bef90f2..7cf9310de0ec 100644
> > > > > > > > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > > > > > > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > > > > > > @@ -723,7 +723,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
> > > > > > > >  
> > > > > > > >  	dpage = pfn_to_page(uvmem_pfn);
> > > > > > > >  	dpage->zone_device_data = pvt;
> > > > > > > > -	zone_device_page_init(dpage, 0);
> > > > > > > > +	zone_device_page_init(dpage, &kvmppc_uvmem_pgmap, 0);
> > > > > > > >  	return dpage;
> > > > > > > >  out_clear:
> > > > > > > >  	spin_lock(&kvmppc_uvmem_bitmap_lock);
> > > > > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> > > > > > > > index af53e796ea1b..6ada7b4af7c6 100644
> > > > > > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> > > > > > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> > > > > > > > @@ -217,7 +217,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
> > > > > > > >  	page = pfn_to_page(pfn);
> > > > > > > >  	svm_range_bo_ref(prange->svm_bo);
> > > > > > > >  	page->zone_device_data = prange->svm_bo;
> > > > > > > > -	zone_device_page_init(page, 0);
> > > > > > > > +	zone_device_page_init(page, page_pgmap(page), 0);
> > > > > > > >  }
> > > > > > > >  
> > > > > > > >  static void
> > > > > > > > diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
> > > > > > > > index 03ee39a761a4..c497726b0147 100644
> > > > > > > > --- a/drivers/gpu/drm/drm_pagemap.c
> > > > > > > > +++ b/drivers/gpu/drm/drm_pagemap.c
> > > > > > > > @@ -201,7 +201,7 @@ static void drm_pagemap_get_devmem_page(struct page *page,
> > > > > > > >  					struct drm_pagemap_zdd *zdd)
> > > > > > > >  {
> > > > > > > >  	page->zone_device_data = drm_pagemap_zdd_get(zdd);
> > > > > > > > -	zone_device_page_init(page, 0);
> > > > > > > > +	zone_device_page_init(page, zdd->dpagemap->pagemap, 0);
> > > > > > > >  }
> > > > > > > >  
> > > > > > > >  /**
> > > > > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > > > > > > > index 58071652679d..3d8031296eed 100644
> > > > > > > > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > > > > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > > > > > > > @@ -425,7 +425,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
> > > > > > > >  			order = ilog2(DMEM_CHUNK_NPAGES);
> > > > > > > >  	}
> > > > > > > >  
> > > > > > > > -	zone_device_folio_init(folio, order);
> > > > > > > > +	zone_device_folio_init(folio, page_pgmap(folio_page(folio, 0)), order);
> > > > > > > >  	return page;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > index 713ec0435b48..e3c2ccf872a8 100644
> > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > @@ -224,7 +224,8 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > >  }
> > > > > > > >  
> > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > > -void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> > > > > > > > +			   unsigned int order);
> > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > @@ -234,9 +235,11 @@ bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
> > > > > > > >  
> > > > > > > >  unsigned long memremap_compat_align(void);
> > > > > > > >  
> > > > > > > > -static inline void zone_device_folio_init(struct folio *folio, unsigned int order)
> > > > > > > > +static inline void zone_device_folio_init(struct folio *folio,
> > > > > > > > +					  struct dev_pagemap *pgmap,
> > > > > > > > +					  unsigned int order)
> > > > > > > >  {
> > > > > > > > -	zone_device_page_init(&folio->page, order);
> > > > > > > > +	zone_device_page_init(&folio->page, pgmap, order);
> > > > > > > >  	if (order)
> > > > > > > >  		folio_set_large_rmappable(folio);
> > > > > > > >  }
> > > > > > > > diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> > > > > > > > index 8af169d3873a..455a6862ae50 100644
> > > > > > > > --- a/lib/test_hmm.c
> > > > > > > > +++ b/lib/test_hmm.c
> > > > > > > > @@ -662,7 +662,9 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
> > > > > > > >  			goto error;
> > > > > > > >  	}
> > > > > > > >  
> > > > > > > > -	zone_device_folio_init(page_folio(dpage), order);
> > > > > > > > +	zone_device_folio_init(page_folio(dpage),
> > > > > > > > +			       page_pgmap(folio_page(page_folio(dpage), 0)),
> > > > > > > > +			       order);
> > > > > > > >  	dpage->zone_device_data = rpage;
> > > > > > > >  	return dpage;
> > > > > > > >  
> > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > index 63c6ab4fdf08..6f46ab14662b 100644
> > > > > > > > --- a/mm/memremap.c
> > > > > > > > +++ b/mm/memremap.c
> > > > > > > > @@ -477,10 +477,28 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > >  	}
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > -void zone_device_page_init(struct page *page, unsigned int order)
> > > > > > > > +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> > > > > > > > +			   unsigned int order)
> > > > > > > >  {
> > > > > > > > +	struct page *new_page = page;
> > > > > > > > +	unsigned int i;
> > > > > > > > +
> > > > > > > >  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> > > > > > > >  
> > > > > > > > +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> > > > > > > > +		struct folio *new_folio = (struct folio *)new_page;
> > > > > > > > +
> > > > > > > > +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> > > > > > > 
> > > > > > > This seems odd to me, mainly due to the "magic" number. Why not just clear
> > > > > > > the flags entirely? Or at least explicitly just clear the flags you care about
> > > > > > > which would remove the need for the comment and  should let you use the proper
> > > > > > > PageFlag functions.
> > > > > > > 
> > > > > > 
> > > > > > I'm copying this from folio_reset_order [1]. My paranoia about touching
> > > > > > anything related to struct page is high, so I did the same thing
> > > > > > folio_reset_order does here.
> > > > 
> > > > So why not just use folio_reset_order() below?
> > > > 
> > > > > > 
> > > > > > [1] https://elixir.bootlin.com/linux/v6.18.5/source/include/linux/mm.h#L1075
> > > > > > 
> > > > > 
> > > > > This immediately hangs my first SVM test...
> > > > > 
> > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > index 6f46ab14662b..ef8c56876cf5 100644
> > > > > --- a/mm/memremap.c
> > > > > +++ b/mm/memremap.c
> > > > > @@ -488,7 +488,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> > > > >         for (i = 0; i < (1UL << order); ++i, ++new_page) {
> > > > >                 struct folio *new_folio = (struct folio *)new_page;
> > > > > 
> > > > > -               new_page->flags.f &= ~0xffUL;   /* Clear possible order, page head */
> > > > > +               new_page->flags.f = 0;
> > > > >  #ifdef NR_PAGES_IN_LARGE_FOLIO
> > > > >                 ((struct folio *)(new_page - 1))->_nr_pages = 0;
> > > > 
> > > > This seems wrong to me - I saw your reply to Balbir but for an order-0 page
> > > > isn't this going to access a completely different, possibly already allocated,
> > > > page?
> > > > 
> > > 
> > > No — it accesses itself (new_page). It just uses some odd memory tricks
> > > for this, which I agree isn’t the best thing I’ve ever written, but it
> > > was the least-worst idea I had. I didn’t design the folio/page field
> > > aliasing; I understand why it exists, but it still makes my head hurt.
> > 
> > And obviously mine, because I (was) still not getting it and had typed up a
> > whole response and code walk through to show what was wrong, in the hope it
> > would help settle the misunderstanding. Which it did, because I discovered
> > where I was getting things wrong. But I've left the analysis below because it's
> > probably useful for others following along:
> > 
> > Walking through the code we have:
> > 
> > void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> > 			   unsigned int order)
> > {
> > 
> > The first argument, page, is the first in a set of 1 << order contiguous
> > struct page. In the simplest case order == 0, meaning this function should only
> > initialise (ie. touch) a single struct page pointer which is passed as the first
> > argument to the function.
> 
> Yes.
> 
> > 
> > 	struct page *new_page = page;
> > 
> > So now *new_page points to the single struct page we should touch.
> > 	
> > 	unsigned int i;
> > 
> > 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> > 
> > 	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> > 
> > order == 0, so this loop will only execute once.
> > 
> 
> Yes.
> 
> > 		struct folio *new_folio = (struct folio *)new_page;
> > 
> > new_page still points to the single page we're initialising, and new_folio
> > points to the same page. Ie: &new_folio->page == new_page. There is a hazard
> > here because new_folio->__page_1, __page_2, etc. all point to pages we shouldn't
> > touch.
> > 
> 
> Yes.
> 
> > 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> > 
> > Clears the flags, makes sense.
> > 
> 
> +1
> 
> > #ifdef NR_PAGES_IN_LARGE_FOLIO
> > 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > 
> > If we break this down we have:
> > 
> > struct page *tmp_new_page = new_page - 1;
> > 
> > Which is the page before the one we're initialising and shouldn't be touched.
> > Then we cast to a folio:
> > 
> > struct folio *tmp_new_folio = (struct folio *) tmp_new_page;
> > 
> > And reset _nr_pages:
> > 
> > tmp_new_folio->_nr_pages = 0
> > 
> > And now I can see where I was confused - &tmp_new_folio->_nr_pages == &tmp_new_folio->__page_1->memcg_data == &new_page->memcg_data
> > 
> 
> Not 100% right, as _nr_pages is 4 bytes and memcg_data is 8, but the
> pointer base address is the same.
> 
> > So after both Balbir, probably yourself, and definitely myself scratching our
> > heads for way too long over this change I think we can conclude that the code as
> > is is way too confusing to merge without a lot more comments :-)
> > 
> 
> I think more comments is the way to go. More below.
> 
> > However why go through all this magic in the first place? Why not just treat
> > everything here as a page and just do
> > 
> > 	new_page->memcg_data = 0
> > 
> 
> Well, memcg_data is 8 bytes and _nr_pages is 4. They also have different
> #ifdef conditions around each field, etc.
> 
> I’ve also seen failures in our testing, and so has François, with the
> memcg_data change. I wish I had a stack trace to share or explain, but
> the times I hit the error I didn’t capture the dmesg, and I’ve been
> having issues with my dev machine today. If I catch the error again,
> I’ll reply with a stack trace and analysis.
> 
> > directly? That seems like the more straight forward approach. In fact given
> > all the confusion I wonder if it wouldn't be better to just do
> > memset(new_page, 0, sizeof(*new_page)) and reinitialise everything from
> > scratch.
> 
> I had considered this option too, but I’d be a little concerned about
> the performance. Reinitializing a zone page/folio is a hot path, as this
> is typically done in a GPU fault handler. I think adding verbose
> comments explaining why this works, plus some follow-up helpers, might
> be the better option.
> 
> > 
> > > folio->_nr_pages is page + 1 for reference (new_page after this math).
> > > Again, if I touched this memory directly in new_page, it’s most likely
> > > memcg_data, but that is hidden behind a Kconfig.
> > > 
> > > This just blindly implementing part of folio_reset_order which clears
> > > _nr_pages.
> > 
> > Yeah, I get it now. But I think just clearing memcg_data would be the easiest to
> > understand approach, especially if it had a comment explaining that it may have
> > previously been used for _nr_pages.
> > 
> 
> See above — the different sizes, the failure I’m seeing, and the
> conflicting #ifdefs are why this is not my preferred option.
> 
> > > > >  #endif
> > > > > 
> > > > > I can walk through exactly which flags need to be cleared, but my
> > > > > feeling is that likely any flag that the order field overloads and can
> > > > > possibly encode should be cleared—so bits 0–7 based on the existing
> > > > > code.
> > > > > 
> > > > > How about in a follow-up we normalize setting / clearing the order flag
> > > > > field with a #define and an inline helper?
> > > > 
> > > > Ie: Would something like the following work:
> > > > 
> > > > 		ClearPageHead(new_page);
> > > 
> > > Any of these bit could possibly be set the order field in a folio, which
> > > modifies page + 1 flags field.
> > > 
> > > 	PG_locked,		/* Page is locked. Don't touch. */
> > > 	PG_writeback,		/* Page is under writeback */
> > > 	PG_referenced,
> > > 	PG_uptodate,
> > > 	PG_dirty,
> > > 	PG_lru,
> > > 	PG_head,		/* Must be in bit 6 */
> > > 	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
> > > 
> > > So a common order-9 (2MB) folio would have PG_locked | PG_uptodate set.
> > > Now we get stuck on the next page lock because PG_locked is set.
> > > Offhand, I don’t know if different orders—which set different bits—cause
> > > any nasty issues either. So I figured the safest thing was clear any
> > > bits which folio order can set within subsequent page's memory flags
> > > like folio_reset_order does.
> > 
> > Oh, I get the above. I was thinking folio_reset_order() below would clear the
> > flags, but I see the folly there - that resets the flags for the next page.
> > 
> 
> Correct.
> 
> > > 
> > > > 		clear_compound_head(new_page);
> > > > 		folio_reset_order(new_folio);
> > > > 
> > > > Which would also deal with setting _nr_pages.
> > > >
> > > 
> > > folio_reset_order(new_folio) would set _nr_pages in the memory that is
> > > new_page + 1. So let's say that page has a ref count + memcg_data, now
> > > that memory is corrupted and will crash the kernel.
> > 
> > Yep, I just noticed that. Thanks for pointing that out.
> > 
> > > All of the above is why is took me multiple hours to write 6 lines of
> > > code :).
> > 
> > And to review :) Good thing we don't get paid per SLOC of code right?
> >
> 
> I don’t think anyone would touch core MM if pay were based on SLOC — it
> would be a terrible career choice. :)
> 
> All joking aside, I think the next revision should use this version,
> plus more comments and helpers/defines in a follow-up—which I’ll commit
> to—along with fixing the branch mismatch Andrew pointed out between
> drm-tip (which this series is based on) and 6.19 (where this patch needs
> to apply).

I believe the best branch for this series would be drm-misc-next indeed.

But this patch in particular needs multiple acks to get through drm trees.
At least one from each block:

## arch/powerpc/kvm/book3s_hv_uvmem.c
Madhavan Srinivasan <maddy@...ux.ibm.com> (maintainer:KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc))
Nicholas Piggin <npiggin@...il.com> (reviewer:KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc))
Michael Ellerman <mpe@...erman.id.au> (maintainer:LINUX FOR POWERPC (32-BIT AND 64-BIT))
"Christophe Leroy (CS GROUP)" <chleroy@...nel.org> (reviewer:LINUX FOR POWERPC (32-BIT AND 64-BIT))

## include/linux/memremap.h
David Hildenbrand <david@...nel.org> (maintainer:MEMORY HOT(UN)PLUG)
Oscar Salvador <osalvador@...e.de> (maintainer:MEMORY HOT(UN)PLUG)

## lib/test_hmm.c
Andrew Morton <akpm@...ux-foundation.org> (maintainer:LIBRARY CODE)
Jason Gunthorpe <jgg@...pe.ca> (maintainer:HMM - Heterogeneous Memory Management)
Leon Romanovsky <leon@...nel.org> (maintainer:HMM - Heterogeneous Memory Management)

## mm/memremap.c
David Hildenbrand <david@...nel.org> (maintainer:MEMORY HOT(UN)PLUG)
Oscar Salvador <osalvador@...e.de> (maintainer:MEMORY HOT(UN)PLUG)
Andrew Morton <akpm@...ux-foundation.org> (maintainer:MEMORY MANAGEMENT)


On the other hand we would also need Max to do one extra last pull-request
towards 7.0 after we get this merged. Because our window in drm closed
earlier.

Or this patch goes to any regular mm tree, and we hold the drm ones
after we backmerge 7.0-rc1

> 
> Matt
> 
> >  - Alistair
> > 
> > > > > Matt
> > > > > 
> > > > > > > > +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > > > > > > > +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > > > > > > > +#endif
> > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > +		new_folio->pgmap = pgmap;	/* Also clear compound head */
> > > > > > > > +		new_folio->share = 0;   /* fsdax only, unused for device private */
> > > > > > > 
> > > > > > > It would be nice if the FS DAX code actually used this as well. Is there a
> > > > > > > reason that change was dropped from the series?
> > > > > > > 
> > > > > > 
> > > > > > I don't have a test platform for FS DAX. In prior revisions, I was just
> > > > > > moving existing FS DAX code to a helper, which I felt confident about.
> > > > > > 
> > > > > > This revision is slightly different, and I don't feel comfortable
> > > > > > modifying FS DAX code without a test platform. I agree we should update
> > > > > > FS DAX, but that should be done in a follow-up with coordinated testing.
> > > > 
> > > > Fair enough, I figured something like this might be your answer :-) You
> > > > could update it and ask people with access to such a system to test it though
> > > > (unfortunately my setup has bit-rotted beyond repair).
> > > > 
> > > > But I'm ok leaving to for a future change.
> > > >
> > > 
> > > I did a quick grep in fs/dax.c and don’t see zone_device_page_init
> > > called there. It probably could be used if it’s creating compound pages
> > > and drop the open-coded reinit when shared == 0, but yeah, that’s not
> > > something I can blindly code without testing.
> > > 
> > > I can try to put something together for people to test soonish.
> > > 
> > > Matt
> > > 
> > > > > > 
> > > > > > Matt 
> > > > > > 
> > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > >  	/*
> > > > > > > >  	 * Drivers shouldn't be allocating pages after calling
> > > > > > > >  	 * memunmap_pages().
> > > > > > > > -- 
> > > > > > > > 2.43.0
> > > > > > > >