linux-kernel - Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABgObfayLGyWKERXkU+0gjeUg=Sp3r7GEQU=+13sUMpo36weWg@mail.gmail.com>
Date: Tue, 10 Sep 2024 17:16:13 +0200
From: Paolo Bonzini <pbonzini@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	Yan Y Zhao <yan.y.zhao@...el.com>, Yuan Yao <yuan.yao@...el.com>, 
	"nik.borisov@...e.com" <nik.borisov@...e.com>, "dmatlack@...gle.com" <dmatlack@...gle.com>, 
	Kai Huang <kai.huang@...el.com>, "isaku.yamahata@...il.com" <isaku.yamahata@...il.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with
 operand SEPT

On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@...gle.com> wrote:
> On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> No, because that defeates the purpose of having mmu_lock be a rwlock.

But if this part of the TDX module is wrapped in a single big
try_lock, there's no difference in spinning around busy seamcalls, or
doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention
in the same way.  With respect to FROZEN_SPTE...

> > This way we know that "busy" errors must come from the guest and have set
> > HOST_PRIORITY.
>
> We should be able to achieve that without a VM-wide spinlock.  My thought (from
> v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
> it set until the SEAMCALL completes.

Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
which documents that the TDX module returns TDX_OPERAND_BUSY on a
CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
to prevent contention in the TDX module.

If we want to be a bit more optimistic, let's do something more
sophisticated, like only take the lock after the first busy reply. But
the spinlock is the easiest way to completely remove host-induced
TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.

> > It is still kinda bad that guests can force the VMM to loop, but the VMM can
> > always say enough is enough.  In other words, let's assume that a limit of
> > 16 is probably appropriate but we can also increase the limit and crash the
> > VM if things become ridiculous.
>
> 2 :-)
>
> One try that guarantees no other host task is accessing the S-EPT entry, and a
> second try after blasting IPI to kick vCPUs to ensure no guest-side task has
> locked the S-EPT entry.

Fair enough. Though in principle it is possible to race and have the
vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
So I would make it 5 or so just to be safe.

> My concern with an arbitrary retry loop is that we'll essentially propagate the
> TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
> retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
> etc...

How slow are the failed ones? The number of retries is essentially the
cost of successful seamcall / cost of busy seamcall.

If HOST_PRIORITY works, even a not-small-but-not-huge number of
retries would be better than the IPIs. IPIs are not cheap either.

> > For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> > not any of the MEM seamcalls.  For that one to be resolved, it should be
> > enough to do take and release the mmu_lock back to back, which ensures that
> > all pending critical sections have completed (that is,
> > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> > loop.  Adding a vCPU stat for that one is a good idea, too.
>
> As above and in my discussion with Rick, I would prefer to kick vCPUs to force
> forward progress, especially for the zero-step case.  If KVM gets to the point
> where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
> kicks in, then it's time to kick and wait, not keep retrying blindly.

Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
TDH.VP.ENTER is delayed. If it is delayed to the point of failing, we
can do write_lock/write_unlock() in the vCPU entry path.

My issue is that, even if we could make it a bit better by looking at
the TDX module source code, we don't have enough information to make a
good choice.  For now we should start with something _easy_, even if
it may not be the greatest.

Paolo