linux-kernel - Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aK/7rgrUdC2cBiYd@yzhao56-desk.sh.intel.com>
Date: Thu, 28 Aug 2025 14:48:14 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "pbonzini@...hat.com" <pbonzini@...hat.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Annapurve,
 Vishal" <vannapurve@...gle.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "michael.roth@....com"
	<michael.roth@....com>, "Weiny, Ira" <ira.weiny@...el.com>
Subject: Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of
 a poor equivalent

On Thu, Aug 28, 2025 at 10:56:18AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> > Use atomic64_dec_return() when decrementing the number of "pre-mapped"
> > S-EPT pages to ensure that the count can't go negative without KVM
> > noticing.  In theory, checking for '0' and then decrementing in a separate
> > operation could miss a 0=>-1 transition.  In practice, such a condition is
> > impossible because nr_premapped is protected by slots_lock, i.e. doesn't
> > actually need to be an atomic (that wart will be addressed shortly).
> > 
> > Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
> > ensures the VM is dead, i.e. there's no point in trying to limp along.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> > ---
> 
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> 
> This area has gone through a lot of designs. In the v19 era PAGE.ADD got
> performed deep inside the fault by stuffing the source page in the vCPU. Then we
> switched to having userspace call KVM_PRE_FAULT_MEMORY manually to pre-populare
> the mirror EPT, and then have TDX code look up the PFN. Then nearer the end, we
> switched to current code which does something like KVM_PRE_FAULT_MEMORY
> internally, then looks up what got faulted and does the PAGE.ADD. Then the
> version in this series which does it even more directly.
Right.
If we invoke PAGE.ADD directly in tdx_sept_set_private_spte() (similar to
PAGE.AUG), then we'll have to have some way to pass in the source page info.

So, rather than passing around the source page, we opted to record the count
of pages mapped in M-EPT while still unmapped in S-EPT, i.e.,

1. map a page in M-EPT
2. increase nr_premapped.
3. map the page in S-EPT
4. decrease nr_premapped.

If a page is zapped in M-EPT before 3, decrease nr_premapped. So if 3 is
executed successfully after zapping of the M-EPT, decrease nr_premapped too.
The unbalancing of nr_premapped due to the double decrease indicates the
mismatching between M-EPT and S-EPT.
If 3 never comes or fails, it's ok.


> nr_premapped got added during the KVM_PRE_FAULT_MEMORY era. I personally didn't
> like it, but it was needed because userspace could do unexpected things. Now it
> seems like its only purpose is to generate a KVM_BUG_ON() in
> tdx_sept_zap_private_spte(). I wonder if we could drop it all together and
> accept less KVM_BUG_ON() coverage. It seems weird to focus in on this specific
> error case.
> 
> Yan, am I missing something?
Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
code change.