linux-kernel - Re: x86/bugs: KVM: Add support for SRSO_MSR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aBOnzNCngyS_pQIW@google.com>
Date: Thu, 1 May 2025 09:56:44 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Borislav Petkov <bp@...en8.de>
Cc: Yosry Ahmed <yosry.ahmed@...ux.dev>, Patrick Bellasi <derkling@...gle.com>, 
	Paolo Bonzini <pbonzini@...hat.com>, Josh Poimboeuf <jpoimboe@...hat.com>, 
	Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>, x86@...nel.org, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Patrick Bellasi <derkling@...bug.net>, 
	Brendan Jackman <jackmanb@...gle.com>, David Kaplan <David.Kaplan@....com>, 
	Michael Larabel <Michael@...haellarabel.com>
Subject: Re: x86/bugs: KVM: Add support for SRSO_MSR_FIX, back for moar

On Thu, May 01, 2025, Borislav Petkov wrote:
> On Wed, Apr 30, 2025 at 04:33:19PM -0700, Sean Christopherson wrote:
> > +static void svm_srso_add_remove_vm(int count)
> > +{
> > +	bool set;
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_SRSO_BP_SPEC_REDUCE))
> > +		return;
> > +
> > +	guard(mutex)(&srso_lock);
> > +
> > +	set = !srso_nr_vms;
> > +	srso_nr_vms += count;
> > +
> > +	WARN_ON_ONCE(srso_nr_vms < 0);
> > +	if (!set && srso_nr_vms)
> > +		return;
> 
> So instead of doing this "by-foot", I would've used any of those
> atomic_inc_return() and atomic_dec_and_test() and act upon the value when it
> becomes 0 or !0 instead of passing 1 and -1. Because the count is kinda
> implicit...

Heh, I considered that, and even tried it this morning because I thought it wouldn't
be as tricky as I first thought, but turns out, yeah, it's tricky.  The complication
is that KVM needs to ensure BP_SPEC_REDUCE=1 on all CPUs before any VM is created.

I thought it wouldn't be _that_ tricky once I realized the 1=>0 case doesn't require
ordering, e.g. running host code while other CPUs have BP_SPEC_REDUCE=1 is totally
fine, KVM just needs to ensure no guest code is executed with BP_SPEC_REDUCE=0.
But guarding against all the possible edge cases is comically difficult.

For giggles, I did get it working, but it's a rather absurd amount of complexity
for very little gain.  In practice, when srso_nr_vms >= 1, CPUs will hold the lock
for only a handful of cycles.  The latency of VM creation is somewhat important,
but it's certainly not that latency sensitive, e.g. KVM already serializes VM
creation on kvm_lock (which is a mutex) in order to add VMs to vm_list.

The only truly slow path is the 0<=>1 transitions, and those *must* be slow to
get the ordering correct.

One thing I can/will do is change the mutex into a spinlock, so that on non-RT
systems there's zero chance of VM creation getting bogged down because a task
happens to get preempted at just the wrong time.  I was thinking it's illegal to
use on_each_cpu() in a spinlock, but that's fine; it's using it with IRQs disabled
that's problematic.

I'll post a proper patch with the spinlock and CONFIG_CPU_MITIGATIONS #ifdef,
assuming testing comes back clean.

As for all the problematic scenarios...  If KVM increments the counter before
sending IPIs, then reacing VM creation can result in the second VM skipping the
mutex and doing KVM_RUN with BP_SPEC_REDUCE=0.

  VMs     MSR     CPU-0   CPU-1
  -----------------------------
  0       0       CREATE
  0       0       lock()
  1       0       inc()
  1       0               CREATE
  2       0               inc()
  2       0               KVM_RUN :-(
  2       0       IPI
  2       1       WRMSR   WRMSR

But simply incrementing the count after sending IPIs obviously doesn't work.

  VMs     MSR     CPU-0   CPU-1
  -----------------------------
  0       0       CREATE
  0       0       lock()
  0       0       IPI
  0       0       WRMSR   WRMSR
  1       0       inc()
  1       0       KVM_RUN :-(

And passing in set/clear (or using separate IPI handlers) doesn't handle the case
where the IPI from destroy arrives after the IPI from create.

  VMs     MSR     CPU-0   CPU-1
  -----------------------------
  1       1               DESTROY
  0       1       CREATE  dec()
  0       1       lock()
  0       1       IPI(1)
  0       1       WRMSR   WRMSR
  0       1               IPI(0)
  0       0       WRMSR   WRMSR
  1       0       inc()
  1       0       KVM_RUN :-(

Addressing that by adding a global flag to track that SRSO needs to be set
almost works, but there's still a race with a destroy IPI if the callback only
checks the "set" flag.

  VMs     MSR     CPU-0   CPU-1
  -----------------------------
  1       1               DESTROY
  0       1       CREATE  dec()
  0       1       lock()
  0       1       set=1
  0       1       IPI
  0       1       WRMSR   WRMSR
  1       0       inc()
  1       1       set=0
  1       1               IPI
  1       0       WRMSR   WRMSR
  1       0       KVM_RUN :-(

To address all of those, I ended up with the below.  It's not actually that much
code, but amount of documentation needed to explain everything is ugly.

#ifndef CONFIG_CPU_MITIGATIONS
static DEFINE_MUTEX(srso_add_vm_lock);
static atomic_t srso_nr_vms;
static bool srso_set;

static void svm_toggle_srso_spec_reduce(void *ign)
{
	/*
	 * Read both srso_set and the count (and don't pass in set/clear!) so
	 * that BP_SPEC_REDUCE isn't incorrectly cleared if IPIs from destroying
	 * he last VM arrive after IPIs from creating the first VM (in the new
	 * "generation").
	 */
	if (READ_ONCE(srso_set) || atomic_read(&srso_nr_vms))
		msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT);
	else
		msr_clear_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT);
}

static void svm_srso_vm_init(void)
{
	if (!cpu_feature_enabled(X86_FEATURE_SRSO_BP_SPEC_REDUCE))
		return;

	/*
	 * Acquire the mutex on a 0=>1 transition to ensure BP_SPEC_REDUCE is
	 * set before any VM is fully created.
	 */
	if (atomic_inc_not_zero(&srso_nr_vms))
		return;

	guard(mutex)(&srso_add_vm_lock);

	/*
	 * Re-check the count before sending IPIs, only the first task needs to
	 * toggle BP_SPEC_REDUCE, other tasks just need to wait.  For the 0=>1
	 * case, update the count *after* BP_SPEC_REDUCE is set on all CPUs to
	 * ensure creating multiple VMs concurrently doesn't result in a task
	 * skipping the mutex before BP_SPEC_REDUCE is set.
	 *
	 * Atomically increment the count in all cases as the mutex only guards
	 * 0=>1 transitions, e.g. another task can decrement the count if a VM
	 * was created (0=>1) *and* destroyed (1=>0) between observing a count
	 * of '0' and acquiring the mutex, and another task can increment the
	 * count if the count is already >= 1.
	 */
	if (!atomic_inc_not_zero(&srso_nr_vms)) {
		WRITE_ONCE(srso_set, true);
		on_each_cpu(svm_toggle_srso_spec_reduce, NULL, 1);
		atomic_inc(&srso_nr_vms);
		smp_mb__after_atomic();
		WRITE_ONCE(srso_set, false);
	}
}

static void svm_srso_vm_destroy(void)
{
	if (!cpu_feature_enabled(X86_FEATURE_SRSO_BP_SPEC_REDUCE))
		return;

	/*
	 * If the last VM is being destroyed, clear BP_SPEC_REDUCE on all CPUs.
	 * Unlike the creation case, there is no need to wait on other CPUs as
	 * running code with BP_SPEC_REDUCE=1 is always safe, KVM just needs to
	 * ensure guest code never runs with BP_SPEC_REDUCE=0.
	 */
	 if (atomic_dec_and_test(&srso_nr_vms))
		on_each_cpu(svm_toggle_srso_spec_reduce, NULL, 0);
}
#else
static void svm_srso_vm_init(void) { }
static void svm_srso_vm_destroy(void) { }
#endif /* CONFIG_CPU_MITIGATIONS */