linux-kernel - Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z9P0As/Wv/8PDBNN@yzhao56-desk.sh.intel.com>
Date: Fri, 14 Mar 2025 17:16:50 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Michael Roth <michael.roth@....com>
CC: Vishal Annapurve <vannapurve@...gle.com>, <kvm@...r.kernel.org>,
	<linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <jroedel@...e.de>, <thomas.lendacky@....com>,
	<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
	<amit.shah@....com>, <pratikrajesh.sampat@....com>, <ashish.kalra@....com>,
	<liam.merwick@...cle.com>, <david@...hat.com>, <ackerleytng@...gle.com>,
	<quic_eberman@...cinc.com>
Subject: Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness
 tracking changes

On Wed, Feb 19, 2025 at 07:09:57PM -0600, Michael Roth wrote:
> On Mon, Feb 10, 2025 at 05:16:33PM -0800, Vishal Annapurve wrote:
> > On Wed, Dec 11, 2024 at 10:37 PM Michael Roth <michael.roth@....com> wrote:
> > >
> > > This patchset is also available at:
> > >
> > >   https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
> > >
> > > and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
> > > a snapshot of his patches[1] to provide tracking of whether or not
> > > sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
> > > before guest access:
> > >
> > >   d55475f23cea KVM: gmem: track preparedness a page at a time
> > >   64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
> > >   17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
> > >   e3449f6841ef KVM: gmem: allocate private data for the gmem inode
> > >
> > >   [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@redhat.com/
> > >
> > > This series addresses some of the pending review comments for those patches
> > > (feel free to squash/rework as-needed), and implements a first real user in
> > > the form of a reworked version of Sean's original 2MB THP support for gmem.
> > >
> > 
> > Looking at the work targeted by Fuad to add in-place memory conversion
> > support via [1] and Ackerley in future to address hugetlb page
> > support, can the state tracking for preparedness be simplified as?
> > i) prepare guest memfd ranges when "first time an offset with
> > mappability = GUEST is allocated or first time an allocated offset has
> > mappability = GUEST". Some scenarios that would lead to guest memfd
> > range preparation:
> >      - Create file with default mappability to host, fallocate, convert
> >      - Create file with default mappability to Guest, guest faults on
> > private memory
> 
> Yes, this seems like a compelling approach. One aspect that still
> remains is knowing *when* the preparation has been done, so that the
> next time a private page is accessed, either to re-fault into the guest
> (e.g. because it was originally mapped 2MB and then a sub-page got
> converted to shared so the still-private pages need to get re-faulted
> in as 4K), or maybe some other path where KVM needs to grab the private
> PFN via kvm_gmem_get_pfn() but not actually read/write to it (I think
> the GHCB AP_CREATION path for bringing up APs might do this).
> 
> We could just keep re-checking the RMP table to see if the PFN was
> already set to private in the RMP table, but I think one of the design
> goals of the preparedness tracking was to have gmem itself be aware of
> this and not farm it out to platform-specific data structures/tracking.
> 
> So as a proof of concept I've been experimenting with using Fuad's
> series ([1] in your response) and adding an additional GUEST_PREPARED
> state so that it can be tracked via the same mappability xarray (or
> whatever data structure we end up using for mappability-tracking).
> In that case GUEST becomes sort of a transient state that can be set
> in advance of actual allocation/fault-time.
Hi Michael,

We are currently working on enabling 2M huge pages on TDX.
We noticed this series and hope if could also work with TDX huge pages.

While disallowing <2M page conversion is also not ideal for TDX, we also think
that it would be great if we could start with 2M and non-in-place conversion
first. In that case, is memory fragmentation caused by partial discarding a
problem for you [1]? Is page promotion a must in your initial huge page support?

Do you have any repo containing your latest POC?

Thanks
Yan

[1] https://lore.kernel.org/all/Z9PyLE%2FLCrSr2jCM@yzhao56-desk.sh.intel.com/