lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eb94d115-18a6-455b-b020-f18f372e283a@nvidia.com>
Date: Sat, 17 Jan 2026 15:42:16 +1100
From: Balbir Singh <balbirs@...dia.com>
To: Matthew Brost <matthew.brost@...el.com>, Jason Gunthorpe <jgg@...dia.com>
Cc: Vlastimil Babka <vbabka@...e.cz>,
 Francois Dugast <francois.dugast@...el.com>, intel-xe@...ts.freedesktop.org,
 dri-devel@...ts.freedesktop.org, Zi Yan <ziy@...dia.com>,
 Alistair Popple <apopple@...dia.com>,
 adhavan Srinivasan <maddy@...ux.ibm.com>, Nicholas Piggin
 <npiggin@...il.com>, Michael Ellerman <mpe@...erman.id.au>,
 "Christophe Leroy (CS GROUP)" <chleroy@...nel.org>,
 Felix Kuehling <Felix.Kuehling@....com>,
 Alex Deucher <alexander.deucher@....com>,
 Christian König <christian.koenig@....com>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 Lyude Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>,
 David Hildenbrand <david@...nel.org>, Oscar Salvador <osalvador@...e.de>,
 Andrew Morton <akpm@...ux-foundation.org>, Leon Romanovsky
 <leon@...nel.org>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>, Mike Rapoport
 <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Michal Hocko <mhocko@...e.com>, linuxppc-dev@...ts.ozlabs.org,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 amd-gfx@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
 linux-mm@...ck.org, linux-cxl@...r.kernel.org
Subject: Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device
 private folios

On 1/17/26 14:55, Matthew Brost wrote:
> On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
>> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
>>>> I suppose we could be getting say an order-9 folio that was previously used
>>>> as two order-8 folios? And each of them had their _nr_pages in their head
>>>
>>> Yes, this is a good example. At this point we have idea what previous
>>> allocation(s) order(s) were - we could have multiple places in the loop
>>> where _nr_pages is populated, thus we have to clear this everywhere. 
>>
>> Why? The fact you have to use such a crazy expression to even access
>> _nr_pages strongly says nothing will read it as _nr_pages.
>>
>> Explain each thing:
>>
>> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>
>> OK, the tail page flags need to be set right, and prep_compound_page()
>> called later depends on them being zero.
>>
>> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
>>
>> Can't see a reason, nothing reads _nr_pages from a random tail
>> page. _nr_pages is the last 8 bytes of struct page so it overlaps
>> memcg_data, which is also not supposed to be read from a tail page?
>>
>> 		new_folio->mapping = NULL;
>>
>> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
>>
>> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
>>
>> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
>>
>> 		new_folio->share = 0;   /* fsdax only, unused for device private */
>>
>> Not sure, certainly share isn't read from a tail page..
>>
>>>>> Why can't this use the normal helpers, like memmap_init_compound()?
>>>>>
>>>>>  struct folio *new_folio = page
>>>>>
>>>>>  /* First 4 tail pages are part of struct folio */
>>>>>  for (i = 4; i < (1UL << order); i++) {
>>>>>      prep_compound_tail(..)
>>>>>  }
>>>>>
>>>>>  prep_comound_head(page, order)
>>>>>  new_folio->_nr_pages = 0
>>>>>
>>>>> ??
>>>
>>> I've beat this to death with Alistair, normal helpers do not work here.
>>
>> What do you mean? It already calls prep_compound_page()! The issue
>> seems to be that prep_compound_page() makes assumptions about what
>> values are in flags already?
>>
>> So how about move that page flags mask logic into
>> prep_compound_tail()? I think that would help Vlastimil's
>> concern. That function is already touching most of the cache line so
>> an extra word shouldn't make a performance difference.
>>
>>> An order zero allocation could have _nr_pages set in its page,
>>> new_folio->_nr_pages is page + 1 memory.
>>
>> An order zero allocation does not have _nr_pages because it is in page
>> +1 memory that doesn't exist.
>>
>> An order zero allocation might have memcg_data in the same slot, does
>> it need zeroing? If so why not add that to prep_compound_head() ?
>>
>> Also, prep_compound_head() handles order 0 too:
>>
>> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
>> 		atomic_set(&folio->_pincount, 0);
>> 		atomic_set(&folio->_entire_mapcount, -1);
>> 	}
>> 	if (order > 1)
>> 		INIT_LIST_HEAD(&folio->_deferred_list);
>>
>> So some of the problem here looks to be not calling it:
>>
>> 	if (order)
>> 		prep_compound_page(page, order);
>>
>> So, remove that if ? Also shouldn't it be moved above the
>> set_page_count/lock_page ?
>>
> 
> I'm not addressing each comment, some might be valid, others are not.
> 
> Ok, can I rework this in a follow-up - I will commit to that? Anything
> we touch here is extremely sensitive to failures - Intel is the primary
> test vector for any modification to device pages for what I can tell.
> 
> The fact is that large device pages do not really work without this
> patch, or prior revs. I’ve spent a lot of time getting large device
> pages stable — both here and in the initial series, commiting to help in
> follow on series touch SVM related things.
> 

Matthew, I feel your frustration and appreciate your help.
For the current state of 6.19, your changes work for me, I added a
Reviewed-by to the patch. It affects a small number of drivers and makes
them work for zone device folios. I am happy to maintain the changes
sent out as a part of zone_device_page_init()

We can rework the details in a follow up series, there are many ideas
and ways of doing this (Jason, Alistair, Zi have good ideas as well).

> I’m going to miss my merge window with this (RB’d) patch blocked for
> large device pages. Expect my commitment to helping other vendors to
> drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> try again.
> 
> Or perhaps we just revert large device pages in 6.19 if we can't get a
> consensus here as we shouldn't ship a non-functional kernel.
> 
> Matt
> 
>> Jason


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ