linux-kernel - Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <y7dm2sqf5t5txirxkbu7hlmsfsnlbtdirgn4ts2l4st3z4kawo@qpa56ysy5v3t>
Date: Mon, 19 Jan 2026 16:41:42 +1100
From: Alistair Popple <apopple@...dia.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Vlastimil Babka <vbabka@...e.cz>, 
	Francois Dugast <francois.dugast@...el.com>, intel-xe@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org, 
	Matthew Brost <matthew.brost@...el.com>, Zi Yan <ziy@...dia.com>, 
	adhavan Srinivasan <maddy@...ux.ibm.com>, Nicholas Piggin <npiggin@...il.com>, 
	Michael Ellerman <mpe@...erman.id.au>, "Christophe Leroy (CS GROUP)" <chleroy@...nel.org>, 
	Felix Kuehling <Felix.Kuehling@....com>, Alex Deucher <alexander.deucher@....com>, 
	Christian König <christian.koenig@....com>, David Airlie <airlied@...il.com>, 
	Simona Vetter <simona@...ll.ch>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, 
	Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>, 
	Lyude Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>, 
	David Hildenbrand <david@...nel.org>, Oscar Salvador <osalvador@...e.de>, 
	Andrew Morton <akpm@...ux-foundation.org>, Leon Romanovsky <leon@...nel.org>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, "Liam R . Howlett" <Liam.Howlett@...cle.com>, 
	Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>, 
	Michal Hocko <mhocko@...e.com>, Balbir Singh <balbirs@...dia.com>, linuxppc-dev@...ts.ozlabs.org, 
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org, amd-gfx@...ts.freedesktop.org, 
	nouveau@...ts.freedesktop.org, linux-mm@...ck.org, linux-cxl@...r.kernel.org
Subject: Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device
 private folios

On 2026-01-17 at 11:19 +1100, Jason Gunthorpe <jgg@...dia.com> wrote...
> On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote:
> > >> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > >> +		/*
> > >> +		 * This pointer math looks odd, but new_page could have been
> > >> +		 * part of a previous higher order folio, which sets _nr_pages
> > >> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> > >> +		 * correctly locate the _nr_pages bits within new_page which
> > >> +		 * could have modified by previous higher order folio.
> > >> +		 */
> > >> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > >> +#endif
> > > 
> > > This seems too weird, why is it in the loop?  There is only one
> > > _nr_pages per folio.

Yeah, I don't really know what the motivation is for going via the folio
field which needs the odd pointer math versus just setting page->memcg_data
= 0 directly which would work equally well and would have avoided a lot of
confusion.

> > I suppose we could be getting say an order-9 folio that was previously used
> > as two order-8 folios? And each of them had their _nr_pages in their head
> > and we can't know that at this point so we have to reset everything?
> 
> Er, did I miss something - who reads _nr_pages from a random tail
> page? Doesn't everything working with random tail pages read order,
> compute the head page, cast to folio and then access _nr_pages?
> 
> > Or maybe you mean that stray _nr_pages in some tail page from previous
> > lifetimes can't affect the current lifetime in a wrong way for something
> > looking at said page? I don't know immediately.
> 
> Yes, exactly.
> 
> Basically, what bytes exactly need to be set to what in tail pages for
> the system to work? Those should be set.
> 
> And if we want to have things set on free that's fine too, but there
> should be reasons for doing stuff, and this weird thing above makes
> zero sense.

You can't think of these as tail pages or head pages. They are just random
struct pages, possibly order-0 or PageHead or PageTail, with fields in a
"random" state based on what they were last used for.

All this function should be trying to do is initialising this random state to
something sane as defined by the core-mm for it to consume. Yes, some might
later end up being tail (or head) pages if order > 0 and prep_compound_page()
is called. But the point of this function and the loop is to initialise the
struct page as an order-0 page with "sane" fields to pass to core-mm or call
prep_compound_page() on.

This could for example just use memset(new_page, 0, sizeof(struct page)) and
then refill all the fields correctly (although Vlastimil pointed out some page
flags need preservation). But a big part of the problem is there is no single
definition (AFAIK) of what state a struct page should be in before handing it to
the core-mm via either vm_insert_page()/pages()/etc. or migrate_vma_*() nor what
state the kernel leaves it in once freed.

I would like to see this addressed because it leads to all sorts of weirdness -
for example vm_insert_page() and migrate_vma_*() both require the page refcount
to be 1 for no good reason (drivers usually have to drop it immediately after
the call and they implicitly own the ZONE_DEVICE page lifetimes anyway so why make them
hold a reference just to map the page). Yet only migrate_vma_*() requires the
page to be locked (so other ZONE_DEVICE users just have to immediately unlock).

And I presume page->memcg_data must be set to zero, or Matthew wouldn't have
run into problems prompting him to reinit it. But I don't really know what other
requirements there are for setting page fields, they all sort of come implicitly
from the vm_insert_page/migrate_vma APIs.

 - Alistair

> Jason