linux-kernel - Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aK/vfyw5lyIZgdH7@yzhao56-desk.sh.intel.com>
Date: Thu, 28 Aug 2025 13:56:15 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "seanjc@...gle.com" <seanjc@...gle.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "michael.roth@....com"
	<michael.roth@....com>, "Weiny, Ira" <ira.weiny@...el.com>
Subject: Re: [RFC PATCH 09/12] KVM: TDX: Fold
 tdx_mem_page_record_premap_cnt() into its sole caller

On Thu, Aug 28, 2025 at 11:13:11AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-08-27 at 12:08 -0700, Sean Christopherson wrote:
> > > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > > TDH.MEM.PAGE.REMOVE.
> > 
> > Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> > underlying problem is unexpectedly removing pages from the initial, whether
> > the
> > VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
> > 
> > > As a result, the TD will still run with uninitialized memory.
> > 
> > No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> > "hole" that the guest might not expect, but that hole will trigger an EPT
> > violation and only get "filled" if the guest explicitly accepts an AUG'd page.
> 
> Ah, I just responded on another patch. I wonder if we can get rid of the premap
> cnt.

I think keeping it is safer.
See my explanation at [1].

[1] https://lore.kernel.org/all/aK%2Fsdr2OQqYv9DBZ@yzhao56-desk.sh.intel.com.

> > 
> > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> > nice with tdh_mem_page_add() failure necessitates both the
> > tdx_is_sept_zap_err_due_to_premap() craziness and the check in
> > tdx_td_finalize()
> > that all pending pages have been consumed.
> 
> Reasons that tdh_mem_page_add() could get BUSY:
> 1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time  they
> could contend the SEPT entry lock
> 2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e.
> tdh_mem_range_block()).
Hmm, two tdh_mem_page_add()s can't contend as they are protected by both
slot_lock and filemap lock.

With regard to the contention to tdh_mem_range_block(), please check my analysis
at the above [1].

tdh_mem_page_add() could get BUSY though, when a misbehaved userspace invokes
KVM_TDX_INIT_MEM_REGION on one vCPU while initializing another vCPU.

Please check more details at [2].

[2] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/


> I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a
> possibility.
2 is possible only for paranoid zaps.
See "case 3. Unexpected zaps" in [1].


> > What reasonable use case is there for gracefully handling tdh_mem_page_add()
> > failure?
> > 
> > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> > case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> > pages mapped into the S-EPT are ADDed, then it can assert that there are no
> > pending pages when it completes (even if it "fails"), and similarly
> > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > non-zero.
> 
> Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or maybe
> even for a single call of it, until someone wants to parallelize the operation.
Hmm. I prefer returning -BUSY directly as invoking KVM_TDX_INIT_MEM_REGION 
before finishing initializing all vCPUs are uncommon.