linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <25e5dcc794435f1ae8afbead17eee460c1da9aae.camel@intel.com>
Date: Fri, 23 May 2025 23:40:25 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"Huang, Kai" <kai.huang@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "tabba@...gle.com" <tabba@...gle.com>, "Li,
 Zhiquan1" <zhiquan1.li@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Yamahata,
 Isaku" <isaku.yamahata@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de"
	<jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com"
	<pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Thu, 2025-05-22 at 11:52 +0800, Yan Zhao wrote:
> On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> > On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > > to support splitting in the fault path, right?
> > > 
> > > I'm OK if we don't care non-Linux TDs for now.
> > > This can simplify the splitting code and we can add the support when there's a
> > > need.
> > 
> > We do need to care about non-Linux TDs functioning, but we don't need to
> > optimize for them at this point. We need to optimize for things that happen
> > often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> > order to work.
> > 
> > Yesterday Kirill and I were chatting offline about the newly defined
> > TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> > 1. Guest accepts at 2MB
> > 2. Guest releases at 2MB (no notice to VMM)
> > 3. Guest accepts at 4k, EPT violation with expectation to demote
> > 
> > In that case, KVM won't know to expect it, and that it needs to preemptively map
> > things at 4k.
> > 
> > For full coverage of the issue, can we discuss a little bit about what demote in
> > the fault path would look like?
> For demote in the fault path, it will take mmu read lock.
> 
> So, the flow in the fault path is
> 1. zap with mmu read lock.
>    ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
>    if (ret <= 0)
>        return ret;
> 2. track with mmu read lock
>    ret = tdx_track(kvm, true);
>    if (ret)
>        return ret;
> 3. demote with mmu read lock
>    ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
>    if (ret)
>        goto err;
> 4. return success or unzap as error fallback.
>    tdx_sept_unzap_private_spte(kvm, gfn, level);
> 
> Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
> will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
> necessary).

That is true that it would not be taken very often. It's not a performance
issue, but I think we should not add a lock if we can at all avoid it. It
creates a special case for TDX for the TDP MMU. People would have to then keep
in mind that two mmu read lock threads could still still contend.

[snip]
> 
> 
> > The current zapping operation that is involved
> > depends on mmu write lock. And I remember you had a POC that added essentially a
> > hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
> Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
> in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
> tdh_mem_range_unblock().
> 
> The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
> module.
> 
>   Resources          SHARED  users              EXCLUSIVE users 
> -----------------------------------------------------------------------
>  secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
>                    tdh_mem_page_aug            tdh_mem_sept_remove
>                    tdh_mem_page_remove         tdh_mem_range_block
>                    tdh_mem_page_promote        tdh_mem_range_unblock
>                    tdh_mem_page_demote
> 
> > the fault path demote case could actually handle failure. So if we just returned
> > busy and didn't try to force the retry, we would just run the risk of
> > interfering with TDX module sept lock? Is that the only issue with a design that
> > would allows failure of demote in the fault path?
> The concern to support split in the fault path is mainly to avoid unnecesssary
> split, e.g., when two vCPUs try to accept at different levels.

We are just talking about keeping rare TDs functional here, right? Two cases
are:
 - TDs using PAGE.RELEASE
 - TDs using pending #VEs and accepting memory in strange patterns

Not maintaining huge pages there seems totally acceptable. How I look at this
whole thing is that it just an optimization, not a feature. Every aspect has a
complexity/performance tradeoff that we need to make a sensible decision on.
Maintaining huge page mappings in every possible case is not the goal.

> 
> Besides that we need to introduce 3 locks inside TDX:
> rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.

Huh?

> 
> To ensure the success of unzap (to restore the state), kicking of vCPUs in the
> fault path is required, which is not ideal. But with the introduced lock and the
> proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
> the chance to invoke unzap is very low.

Yes, it's probably not safe to expect the exact same demote call chain again.
The fault path could maybe learn to recover from the blocked state?

> 
> > Let's keep in mind that we could ask for TDX module changes to enable this path.
> We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
> an non-ACCEPTable entry to avoid contention with guest and the potential error
> TDX_HOST_PRIORITY_BUSY_TIMEOUT.

Part of that is already in the works (accepting not-present entries). It seems
reasonable. But also, what about looking at having the TDX module do the full
demote operation internally. The track part obviously happens outside of the TDX
module, but maybe the whole thing could be simplified.

> 
> > I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> > a plan to fix it up with TDX module changes. And if the ultimate root cause of
> > the complication is avoiding zero-step (sept lock), we should fix that instead
> > of design around it further.
> Ok.
> 
> > > 

I'll respond to the error code half of this mail separately.