[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d58005fc50fcc1366b40f7ab5e68c94280307c53.camel@intel.com>
Date: Thu, 9 May 2024 12:10:56 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "pbonzini@...hat.com"
<pbonzini@...hat.com>, "seanjc@...gle.com" <seanjc@...gle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/4] KVM: Register cpuhp and syscore callbacks when
enabling hardware
On Tue, 2024-05-07 at 09:31 -0700, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Sean Christopherson wrote:
> > Register KVM's cpuhp and syscore callback when enabling virtualization
> > in hardware instead of registering the callbacks during initialization,
> > and let the CPU up/down framework invoke the inner enable/disable
> > functions. Registering the callbacks during initialization makes things
> > more complex than they need to be, as KVM needs to be very careful about
> > handling races between enabling CPUs being onlined/offlined and hardware
> > being enabled/disabled.
> >
> > Intel TDX support will require KVM to enable virtualization during KVM
> > initialization, i.e. will add another wrinkle to things, at which point
> > sorting out the potential races with kvm_usage_count would become even
> > more complex.
> > +static int hardware_enable_all(void)
> > +{
> > + int r;
> > +
> > + guard(mutex)(&kvm_lock);
> > +
> > + if (kvm_usage_count++)
> > + return 0;
> > +
> > + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
> > + kvm_online_cpu, kvm_offline_cpu);
> > + if (r)
> > + return r;
>
> There's a lock ordering issue here. KVM currently takes kvm_lock inside
> cpu_hotplug_lock, but this code does the opposite. I need to take a closer look
> at the locking, as I'm not entirely certain that the existing ordering is correct
> or ideal.
>
Do you mean currently (upstream) hardware_enable_all() takes
cpus_read_lock() first and then kvm_lock?
For this one I think the cpus_read_lock() must be taken outside of
kvm_lock, because the kvm_online_cpu() also takes kvm_lock. Switching the
order in hardware_enable_all() can result in deadlock.
For example, when CPU 0 is doing hardware_enable_all(), CPU 1 tries to
bring up CPU 2 between kvm_lock and cpus_read_lock() in CPU 0:
cpu 0 cpu 1 cpu 2
(hardware_enable_all()) (online cpu 2) (kvm_online_cpu())
1) mutex_lock(&kvm_lock);
2) cpus_write_lock();
bringup cpu 2
4) mutex_lock(&kvm_lock);
3) cpus_read_lock(); ...
mutex_unlock(&kvm_lock);
5) cpus_write_unlock();
...
6) mutex_unlock(&kvm_lock);
In this case, the cpus_read_lock() in step 3) will wait for the
cpus_write_unlock() in step 5) to complete, which will wait for CPU 2 to
complete kvm_online_cpu(). But kvm_online_cpu() on CPU 2 will in turn
wait for CPU 0 to release the kvm_lock, so deadlock.
But with the code change in this patch, the kvm_online_cpu() doesn't take
the kvm_lock anymore, so to me it looks it's OK to take cpus_read_lock()
inside kvm_lock.
Btw, even in the current upstream code, IIUC the cpus_read_lock() isn't
absolutely necessary. It was introduced to prevent running
hardware_enable_nolock() from on_each_cpu() IPI call for the new cpu
before kvm_online_cpu() is invoked. But due to both hardware_enable_all()
and kvm_online_cpu() both grabs kvm_lock, the hardware_enable_nolock()
inside the kvm_online_cpu() will always wait for hardware_enable_all() to
complete, so the worst case is hardware_enable_nolock() is called twice.
But this is fine because the second call will basically do nothing due to
the @hardware_enabled per-cpu variable.
> E.g. cpu_hotplug_lock is taken when updating static keys, static calls,
> etc., which makes taking cpu_hotplug_lock outside kvm_lock dicey, as flows that
> take kvm_lock then need to be very careful to never trigger seemingly innocuous
> updates.
>
> And this lockdep splat that I've now hit twice with the current implementation
> suggests that cpu_hotplug_lock => kvm_lock is already unsafe/broken (I need to
> re-decipher the splat; I _think_ mostly figured it out last week, but then forgot
> over the weekend).
I think if we remove the kvm_lock in kvm_online_cpu(), it's OK to hold
cpus_read_lock() (call the cpuhp_setup_state()) inside the kvm_lock.
If so, maybe we can just have a rule that cpus_read_lock() cannot be hold
outside of kvm_lock.
Powered by blists - more mailing lists