linux-kernel - Re: [PATCH] mm/gup: restore the ability to pin more than 2GB at a time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <a2b3f866-b7e8-43a9-a3e5-74d46032541c@nvidia.com>
Date: Wed, 30 Oct 2024 17:47:36 -0700
From: John Hubbard <jhubbard@...dia.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: David Hildenbrand <david@...hat.com>, Alistair Popple
 <apopple@...dia.com>, Christoph Hellwig <hch@...radead.org>,
 Andrew Morton <akpm@...ux-foundation.org>,
 LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org,
 linux-stable@...r.kernel.org, Vivek Kasireddy <vivek.kasireddy@...el.com>,
 Dave Airlie <airlied@...hat.com>, Gerd Hoffmann <kraxel@...hat.com>,
 Matthew Wilcox <willy@...radead.org>, Peter Xu <peterx@...hat.com>,
 Arnd Bergmann <arnd@...db.de>, Daniel Vetter <daniel.vetter@...ll.ch>,
 Dongwon Kim <dongwon.kim@...el.com>, Hugh Dickins <hughd@...gle.com>,
 Junxiao Chang <junxiao.chang@...el.com>,
 Mike Kravetz <mike.kravetz@...cle.com>, Oscar Salvador <osalvador@...e.de>
Subject: Re: [PATCH] mm/gup: restore the ability to pin more than 2GB at a
 time

On 10/30/24 5:25 PM, Jason Gunthorpe wrote:
> On Wed, Oct 30, 2024 at 05:17:25PM -0700, John Hubbard wrote:
>> On 10/30/24 5:02 PM, Jason Gunthorpe wrote:
>>> On Wed, Oct 30, 2024 at 11:34:49AM -0700, John Hubbard wrote:
>>>
>>>>   From a very high level design perspective, it's not yet clear to me
>>>> that there is either a "preferred" or "not recommended" aspect to
>>>> pinning in batches vs. all at once here, as long as one stays
>>>> below the type (int, long, unsigned...) limits of the API. Batching
>>>> seems like what you do if the internal implementation is crippled
>>>> and unable to meet its API requirements. So the fact that many
>>>> callers do batching is sort of "tail wags dog".
>>>
>>> No.. all things need to do batching because nothing should be storing
>>> a linear struct page array that is so enormous. That is going to
>>> create vmemap pressure that is not desirable.
>>
>> Are we talking about the same allocation size here? It's not 2GB. It
>> is enough folio pointers to cover 2GB of memory, so 4MB.
> 
> Is 2GB a hard limit? I was expecting this was a range that had upper
> bounds of 100GB's like for rdma.. Then it is 400MB, and yeah, that is
> not great.
>

No, 2GB (original allocation, thus 4MB real allocation) is just the point at
which the page alloc code typically switches over from kmalloc to vmalloc
(internal to kvmalloc)--for a freshly booted machine, that is.

For some reason, I've had "a few GB" in mind as kind of a "likely as much as
people will request" limit, rather than 100's of GB, just from what I've 
seen.
However, I don't have much additional data about how user space (which 
does the
allocation requests, in the end) behaves, either. Maybe it is actually quite
rare to do such large allocation requests. Or maybe not.

But yes, if this went 10x+ higher, it would definitely be "too much".


>> That high level guidance makes sense, but here we are attempting only
>> a 4MB physically contiguous allocation, and if larger than that, then
>> it goes to vmalloc() which is merely virtually contiguous.
> 
> AFAIK any contiguous allocation beyond 4K basically doesn't work
> reliably in a server environment due to fragmentation.
> 
> So you are always using the vmemap..
> 
>> I'm writing this because your adjectives make me suspect that you
>> are referring to a 2GB allocation. But this is orders of magnitude
>> smaller.
> 
> Even 4MB I would wonder about getting it split to PAGE_SIZE chunks
> instead of vmemmap, but I don't know what it is being used for.
> 

For a 64-bit system, I think we have quite a healthy chunk of vmalloc() 
space
(ignoring, with some effort, the multiple KASLR bugs that have been recently
messing that up), right? I mean, your points about keeping kernel
allocations small or at least reasonable are resonating with me, but it's
also true that the numbers are much bigger with 64 bit systems.


thanks,
-- 
John Hubbard