linux-kernel - Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZuBQYvY6Ib4ZYBgx@google.com>
Date: Tue, 10 Sep 2024 06:57:54 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Paolo Bonzini <pbonzini@...hat.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	Yan Y Zhao <yan.y.zhao@...el.com>, Yuan Yao <yuan.yao@...el.com>, 
	"nik.borisov@...e.com" <nik.borisov@...e.com>, "dmatlack@...gle.com" <dmatlack@...gle.com>, 
	Kai Huang <kai.huang@...el.com>, "isaku.yamahata@...il.com" <isaku.yamahata@...il.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with
 operand SEPT

On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> On 9/9/24 23:11, Sean Christopherson wrote:
> > In general, I am_very_  opposed to blindly retrying an SEPT SEAMCALL, ever.  For
> > its operations, I'm pretty sure the only sane approach is for KVM to ensure there
> > will be no contention.  And if the TDX module's single-step protection spuriously
> > kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
> > that it's mitigating single-step, e.g. so that KVM can forward the information
> > to userspace, then that's a TDX module problem to solve.
> 
> In principle I agree but we also need to be pragmatic.  Exiting to userspace
> may not be practical in all flows, for example.
> 
> First of all, we can add a spinlock around affected seamcalls.

No, because that defeates the purpose of having mmu_lock be a rwlock.

> This way we know that "busy" errors must come from the guest and have set
> HOST_PRIORITY.

We should be able to achieve that without a VM-wide spinlock.  My thought (from
v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
it set until the SEAMCALL completes.

> It is still kinda bad that guests can force the VMM to loop, but the VMM can
> always say enough is enough.  In other words, let's assume that a limit of
> 16 is probably appropriate but we can also increase the limit and crash the
> VM if things become ridiculous.
> 
> Something like this:
> 
> 	static u32 max = 16;
> 	int retry = 0;
> 	spin_lock(&kvm->arch.seamcall_lock);
> 	for (;;) {
> 		args_in = *in;
> 		ret = seamcall_ret(op, in);
> 		if (++retry == 1) {
> 			/* protected by the same seamcall_lock */
> 			kvm->stat.retried_seamcalls++;
> 		} else if (retry == READ_ONCE(max)) {
> 			pr_warn("Exceeded %d retries for S-EPT operation\n", max);
> 			if (KVM_BUG_ON(kvm, retry == 1024)) {
> 				pr_err("Crashing due to lock contention in the TDX module\n");
> 				break;
> 			}
> 			cmpxchg(&max, retry, retry * 2);
> 		}
> 	}
> 	spin_unlock(&kvm->arch.seamcall_lock);
> 
> This way we can do some testing and figure out a useful limit.

2 :-)

One try that guarantees no other host task is accessing the S-EPT entry, and a
second try after blasting IPI to kick vCPUs to ensure no guest-side task has
locked the S-EPT entry.

My concern with an arbitrary retry loop is that we'll essentially propagate the
TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
etc...

> For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> not any of the MEM seamcalls.  For that one to be resolved, it should be
> enough to do take and release the mmu_lock back to back, which ensures that
> all pending critical sections have completed (that is,
> "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> loop.  Adding a vCPU stat for that one is a good idea, too.

As above and in my discussion with Rick, I would prefer to kick vCPUs to force
forward progress, especially for the zero-step case.  If KVM gets to the point
where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
kicks in, then it's time to kick and wait, not keep retrying blindly.

There is still risk of a hang, e.g. if a CPU fails to respond to the IPI, but
that's a possibility that always exists.  Kicking vCPUs allows KVM to know with
100% certainty that a SEAMCALL should succeed.

Hrm, the wrinkle is that if we want to guarantee success, the vCPU kick would
need to happen when the SPTE is frozen, to ensure some other host task doesn't
"steal" the lock.