linux-kernel - Re: [GIT PULL] VFIO updates for v6.17-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <44157147-c424-4cc0-9302-ccf42c648247@redhat.com>
Date: Tue, 5 Aug 2025 16:10:45 +0200
From: David Hildenbrand <david@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
 Alex Williamson <alex.williamson@...hat.com>,
 "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "lizhe.67@...edance.com" <lizhe.67@...edance.com>
Subject: Re: [GIT PULL] VFIO updates for v6.17-rc1

On 05.08.25 15:55, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 03:33:49PM +0200, David Hildenbrand wrote:
> 
>>> David, there is another alternative to prevent this, simple though a
>>> bit wasteful, just allocate a bit bigger to ensure the allocation
>>> doesn't end on an exact PAGE_SIZE boundary?
>>
>> :/ in particular doing that through the memblock in sparse_init_nid(), I am
>> not so sure that's a good idea.
> 
> It would probably be some work to make larger allocations to avoid
> padding :\
> 
>> I prefer Linus' proposal and avoids the one nth_page(), unless any other
>> approach can help us get rid of more nth_page() usage -- and I don't think
>> your proposal could, right?
> 
> If the above were solved - so the struct page allocations could be
> larger than a section, arguably just the entire range being plugged,
> then I think you also solve the nth_page() problem too. > Effectively the nth_page() problem is that we allocate the struct page
> arrays on an arbitary section-by-section basis, and then the arch sets
> MAX_ORDER so that a folio can cross sections, effectively guaranteeing
> to virtually fragment the page *'s inside folios.
> 
> Doing a giant vmalloc at the start so you could also cheaply add some
> padding would effectively also prevent the nth_page problem as we can
> reasonably say that no folio should extend past an entire memory
> region.
> 
> Maybe there is some reason we can't do a giant vmalloc on these
> systems that also can't do SPARSE_VMMEMAP :\ But perhaps we could get
> up to MAX_ORDER at least? Or perhaps we could have those systems
> reduce MAX_ORDER?
> 
> So, I think they are lightly linked problems.

There are some weird scenarios where you hotplug memory after boot 
memory, and suddenly you can runtime-allocate a gigantic folio that 
spans both ranges etc.

So while related, the corner cases are all a bit nasty, and just 
forbidding folios to span a memory section on these problematic configs 
(sparse !vmemmap) sounds interesting.

As Linus said, x86-64 and arm64 are already VMEMMAP-only. s390x allows 
for gigantic folios, and VMEMMAP migjt still be configurable. Same for 
ppc at least. Not sure about riscv and others, will have to dig.

That way we could just naturally make folio_page() and folio_page_idx() 
simpler. (and some GUP code IIRC as well where we still have to use 
nth_page)

> 
> I suppose this is also a limitation with Linus's suggestion. It
> doesn't give the optimal answer for for 1G pages on these older systems:
> 
>          for (size_t nr = 1; nr < nr_pages; nr++) {
>                  if (*pages++ != ++page)
>                          break;
> 
> Since that will exit every section.

Yes. If folios can no longer span a section in these configs, then we'd 
be good if we stop when we cross a section. We'd still always cover the 
full folio.

> 
> At least for scatterlist like cases the point of this function is just
> to speed things up. If it returns short the calling code should still
> be directly checking phys_addr contiguity anyhow.

Same for vfio I think.

-- 
Cheers,

David / dhildenb