[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aWsdv6dX2RgqajFQ@lstrano-desk.jf.intel.com>
Date: Fri, 16 Jan 2026 21:27:27 -0800
From: Matthew Brost <matthew.brost@...el.com>
To: Balbir Singh <balbirs@...dia.com>
CC: Jason Gunthorpe <jgg@...dia.com>, Vlastimil Babka <vbabka@...e.cz>,
Francois Dugast <francois.dugast@...el.com>,
<intel-xe@...ts.freedesktop.org>, <dri-devel@...ts.freedesktop.org>, Zi Yan
<ziy@...dia.com>, Alistair Popple <apopple@...dia.com>, adhavan Srinivasan
<maddy@...ux.ibm.com>, Nicholas Piggin <npiggin@...il.com>, Michael Ellerman
<mpe@...erman.id.au>, "Christophe Leroy (CS GROUP)" <chleroy@...nel.org>,
Felix Kuehling <Felix.Kuehling@....com>, Alex Deucher
<alexander.deucher@....com>, Christian König
<christian.koenig@....com>, David Airlie <airlied@...il.com>, Simona Vetter
<simona@...ll.ch>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
Lyude Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>, "David
Hildenbrand" <david@...nel.org>, Oscar Salvador <osalvador@...e.de>, "Andrew
Morton" <akpm@...ux-foundation.org>, Leon Romanovsky <leon@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, "Liam R . Howlett"
<Liam.Howlett@...cle.com>, Mike Rapoport <rppt@...nel.org>, "Suren
Baghdasaryan" <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
<linuxppc-dev@...ts.ozlabs.org>, <kvm@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <amd-gfx@...ts.freedesktop.org>,
<nouveau@...ts.freedesktop.org>, <linux-mm@...ck.org>,
<linux-cxl@...r.kernel.org>
Subject: Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device
private folios
On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
> On 1/17/26 14:55, Matthew Brost wrote:
> > On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> >> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> >>>> I suppose we could be getting say an order-9 folio that was previously used
> >>>> as two order-8 folios? And each of them had their _nr_pages in their head
> >>>
> >>> Yes, this is a good example. At this point we have idea what previous
> >>> allocation(s) order(s) were - we could have multiple places in the loop
> >>> where _nr_pages is populated, thus we have to clear this everywhere.
> >>
> >> Why? The fact you have to use such a crazy expression to even access
> >> _nr_pages strongly says nothing will read it as _nr_pages.
> >>
> >> Explain each thing:
> >>
> >> new_page->flags.f &= ~0xffUL; /* Clear possible order, page head */
> >>
> >> OK, the tail page flags need to be set right, and prep_compound_page()
> >> called later depends on them being zero.
> >>
> >> ((struct folio *)(new_page - 1))->_nr_pages = 0;
> >>
> >> Can't see a reason, nothing reads _nr_pages from a random tail
> >> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> >> memcg_data, which is also not supposed to be read from a tail page?
> >>
> >> new_folio->mapping = NULL;
> >>
> >> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
> >>
> >> new_folio->pgmap = pgmap; /* Also clear compound head */
> >>
> >> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
> >>
> >> new_folio->share = 0; /* fsdax only, unused for device private */
> >>
> >> Not sure, certainly share isn't read from a tail page..
> >>
> >>>>> Why can't this use the normal helpers, like memmap_init_compound()?
> >>>>>
> >>>>> struct folio *new_folio = page
> >>>>>
> >>>>> /* First 4 tail pages are part of struct folio */
> >>>>> for (i = 4; i < (1UL << order); i++) {
> >>>>> prep_compound_tail(..)
> >>>>> }
> >>>>>
> >>>>> prep_comound_head(page, order)
> >>>>> new_folio->_nr_pages = 0
> >>>>>
> >>>>> ??
> >>>
> >>> I've beat this to death with Alistair, normal helpers do not work here.
> >>
> >> What do you mean? It already calls prep_compound_page()! The issue
> >> seems to be that prep_compound_page() makes assumptions about what
> >> values are in flags already?
> >>
> >> So how about move that page flags mask logic into
> >> prep_compound_tail()? I think that would help Vlastimil's
> >> concern. That function is already touching most of the cache line so
> >> an extra word shouldn't make a performance difference.
> >>
> >>> An order zero allocation could have _nr_pages set in its page,
> >>> new_folio->_nr_pages is page + 1 memory.
> >>
> >> An order zero allocation does not have _nr_pages because it is in page
> >> +1 memory that doesn't exist.
> >>
> >> An order zero allocation might have memcg_data in the same slot, does
> >> it need zeroing? If so why not add that to prep_compound_head() ?
> >>
> >> Also, prep_compound_head() handles order 0 too:
> >>
> >> if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
> >> atomic_set(&folio->_pincount, 0);
> >> atomic_set(&folio->_entire_mapcount, -1);
> >> }
> >> if (order > 1)
> >> INIT_LIST_HEAD(&folio->_deferred_list);
> >>
> >> So some of the problem here looks to be not calling it:
> >>
> >> if (order)
> >> prep_compound_page(page, order);
> >>
> >> So, remove that if ? Also shouldn't it be moved above the
> >> set_page_count/lock_page ?
> >>
> >
> > I'm not addressing each comment, some might be valid, others are not.
> >
> > Ok, can I rework this in a follow-up - I will commit to that? Anything
> > we touch here is extremely sensitive to failures - Intel is the primary
> > test vector for any modification to device pages for what I can tell.
> >
> > The fact is that large device pages do not really work without this
> > patch, or prior revs. I’ve spent a lot of time getting large device
> > pages stable — both here and in the initial series, commiting to help in
> > follow on series touch SVM related things.
> >
>
> Matthew, I feel your frustration and appreciate your help.
> For the current state of 6.19, your changes work for me, I added a
> Reviewed-by to the patch. It affects a small number of drivers and makes
> them work for zone device folios. I am happy to maintain the changes
> sent out as a part of zone_device_page_init()
>
+1
> We can rework the details in a follow up series, there are many ideas
> and ways of doing this (Jason, Alistair, Zi have good ideas as well).
>
I agree we can rework this in a follow-up — the core MM is hard, and for
valid reasons, but we can all work together on cleaning it up.
Matt
> > I’m going to miss my merge window with this (RB’d) patch blocked for
> > large device pages. Expect my commitment to helping other vendors to
> > drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> > try again.
> >
> > Or perhaps we just revert large device pages in 6.19 if we can't get a
> > consensus here as we shouldn't ship a non-functional kernel.
> >
> > Matt
> >
> >> Jason
>
Powered by blists - more mailing lists