linux-kernel - Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <18db10a0-bd40-4c6a-b099-236f4dcaf0cf@redhat.com>
Date: Fri, 14 Mar 2025 10:33:07 +0100
From: David Hildenbrand <david@...hat.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: "Shah, Amit" <Amit.Shah@....com>,
 "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
 "Roth, Michael" <Michael.Roth@....com>,
 "liam.merwick@...cle.com" <liam.merwick@...cle.com>,
 "seanjc@...gle.com" <seanjc@...gle.com>, "jroedel@...e.de"
 <jroedel@...e.de>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
 "Sampat, Pratik Rajesh" <PratikRajesh.Sampat@....com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "Lendacky, Thomas" <Thomas.Lendacky@....com>, "vbabka@...e.cz"
 <vbabka@...e.cz>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
 "linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>,
 "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
 "Kalra, Ashish" <Ashish.Kalra@....com>,
 "ackerleytng@...gle.com" <ackerleytng@...gle.com>,
 "vannapurve@...gle.com" <vannapurve@...gle.com>
Subject: Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness
 tracking changes

On 14.03.25 10:09, Yan Zhao wrote:
> On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
>> (split is possible if there are no unexpected folio references; private
>> pages cannot be GUP'ed, so it is feasible)
> ...
>>>> Note that I'm not quite sure about the "2MB" interface, should it be
>>>> a
>>>> "PMD-size" interface?
>>>
>>> I think Mike and I touched upon this aspect too - and I may be
>>> misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
>>> in increments -- and then fitting in PMD sizes when we've had enough of
>>> those.  That is to say he didn't want to preclude it, or gate the PMD
>>> work on enabling all sizes first.
>>
>> Starting with 2M is reasonable for now. The real question is how we want to
>> deal with
> Hi David,
> 

Hi!

> I'm just trying to understand the background of in-place conversion.
> 
> Regarding to the two issues you mentioned with THP and non-in-place-conversion,
> I have some questions (still based on starting with 2M):
> 
>> (a) Not being able to allocate a 2M folio reliably
> If we start with fault in private pages from guest_memfd (not in page pool way)
> and shared pages anonymously, is it correct to say that this is only a concern
> when memory is under pressure?

Usually, fragmentation starts being a problem under memory pressure, and 
memory pressure can show up simply because the page cache makes us of as 
much memory as it wants.

As soon as we start allocating a 2 MB page for guest_memfd, to then 
split it up + free only some parts back to the buddy (on private->shared 
conversion), we create fragmentation that cannot get resolved as long as 
the remaining private pages are not freed. A new conversion from 
shared->private on the previously freed parts will allocate other 
unmovable pages (not the freed ones) and make fragmentation worse.

In-place conversion improves that quite a lot, because guest_memfd tself 
will not cause unmovable fragmentation. Of course, under memory 
pressure, when and cannot allocate a 2M page for guest_memfd, it's 
unavoidable. But then, we already had fragmentation (and did not really 
cause any new one).

We discussed in the upstream call, that if guest_memfd (primarily) only 
allocates 2M pages and frees 2M pages, it will not cause fragmentation 
itself, which is pretty nice.

> 
>> (b) Partial discarding
> For shared pages, page migration and folio split are possible for shared THP?

I assume by "shared" you mean "not guest_memfd, but some other memory we 
use as an overlay" -- so no in-place conversion.

Yes, that should be possible as long as nothing else prevents 
migration/split (e.g., longterm pinning)

> 
> For private pages, as you pointed out earlier, if we can ensure there are no
> unexpected folio references for private memory, splitting a private huge folio
> should succeed. 

Yes, and maybe (hopefully) we'll reach a point where private parts will 
not have a refcount at all (initially, frozen refcount, discussed during 
the last upstream call).

Are you concerned about the memory fragmentation after repeated
> partial conversions of private pages to and from shared?

Not only repeated, even just a single partial conversion. But of course, 
repeated partial conversions will make it worse (e.g., never getting a 
private huge page back when there was a partial conversion).

-- 
Cheers,

David / dhildenb