[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YvU+6fdkHaqQiKxp@google.com>
Date: Thu, 11 Aug 2022 17:39:53 +0000
From: Sean Christopherson <seanjc@...gle.com>
To: "Huang, Kai" <kai.huang@...el.com>
Cc: "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"Yamahata, Isaku" <isaku.yamahata@...el.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"Shahar, Sagi" <sagis@...gle.com>,
"Aktas, Erdem" <erdemaktas@...gle.com>,
"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
Will Deacon <will@...nel.org>
Subject: Re: [PATCH v8 003/103] KVM: Refactor CPU compatibility check on
module initialization
+Will (for arm crud)
On Thu, Aug 11, 2022, Huang, Kai wrote:
> First of all, I think the patch title can be improved. "refactor CPU
> compatibility check on module initialization" isn't the purpose of this patch.
> It is just a bonus. The title should reflect the main purpose (or behaviour) of
> this patch:
>
> KVM: Temporarily enable hardware on all cpus during module loading time
...
> > + /* hardware_enable_nolock() checks CPU compatibility on each CPUs. */
> > + r = hardware_enable_all();
> > + if (r)
> > + goto out_free_2;
> > + /*
> > + * Arch specific initialization that requires to enable virtualization
> > + * feature. e.g. TDX module initialization requires VMXON on all
> > + * present CPUs.
> > + */
> > + kvm_arch_post_hardware_enable_setup(opaque);
> > + /*
> > + * Make hardware disabled after the KVM module initialization. KVM
> > + * enables hardware when the first KVM VM is created and disables
> > + * hardware when the last KVM VM is destroyed. When no KVM VM is
> > + * running, hardware is disabled. Keep that semantics.
> > + */
>
> Except the first sentence, the remaining sentences are more like changelog
> material. Perhaps just say something below to be more specific on the purpose:
>
> /*
> * Disable hardware on all cpus so that out-of-tree drivers which
> * also use hardware-assisted virtualization (such as virtualbox
> * kernel module) can still be loaded when KVM is loaded.
> */
>
> > + hardware_disable_all();
> >
> > r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
> > kvm_starting_cpu, kvm_dying_cpu);
I've been poking at the "hardware enable" code this week for other reasons, and
have come to the conclusion that the current implementation is a mess.
x86 overloads "hardware enable" to do three different things:
1. actually enable hardware
2. snapshot per-CPU MSR value for user-return MSRs
3. handle unstable TSC _for existing VMs_ on suspend+resume and/or CPU hotplug
#2 and #3 have nothing to do with enabling hardware, kvm_arch_hardware_enable() just
so happens to be called in a superset of what is needed for dealing with unstable TSCs,
and AFAICT the user-return MSRs is simply a historical wart. The user-return MSRs
code is subtly very, very nasty, as it means that KVM snaphots MSRs from IRQ context,
e.g. if an out-of-tree module is running VMs, the IRQ can interrupt the _guest_ and
cause KVM to snapshot guest registers. VMX and SVM kinda sorta guard against this
by refusing to load if VMX/SVM are already enabled, but it's not foolproof.
Eww, and #3 is broken. If CPU (un)hotplug collides with kvm_destroy_vm() or
kvm_create_vm(), kvm_arch_hardware_enable() could explode due to vm_list being
modified while it's being walked.
Of course, that path is broken for other reasons too, e.g. needs to prevent CPUs
from going on/off-line when KVM is enabling hardware.
https://lore.kernel.org/all/20220216031528.92558-7-chao.gao@intel.com
arm64 is also quite evil and circumvents KVM's hardware enabling logic to some extent.
kvm_arch_init() => init_subsystems() unconditionally enables hardware, and for pKVM
_leaves_ hardware enabled. And then hyp_init_cpu_pm_notifier() disables/enables
hardware across lower power enter+exit, except if pKVM is enabled. The icing on
the cake is "disabling" hardware doesn't even do anything (AFAICT) if the kernel is
running at EL2 (which I think is nVHE + not-pKVM?).
PPC apparently didn't want to be left out of the party, and despite having a nop
for kvm_arch_hardware_disable(), it does its own "is KVM enabled" tracking (see
kvm_hv_vm_(de)activated()). At least PPC gets the cpus_read_(un)lock() stuff right...
MIPS doesn't appear to have any shenanigans, but kvm_vz_hardware_enable() appears
to be a "heavy" operation, i.e. ideally not something that should be done spuriously.
s390 and PPC are the only sane architectures and don't require explicit enabling
of virtualization.
At a glance, arm64 won't explode, but enabling hardware _twice_ during kvm_init()
is all kinds of gross.
Another wart that we can clean up is the cpus_hardware_enabled mask. I don't see
any reason KVM needs to use a global mask, a per-cpu variable a la kvm_arm_hardware_enabled
would do just fine.
OMG, and there's another bug lurking (I need to stop looking at this code). Commit
5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") added an error
path that can cause VM creation to fail _after_ it has been added to the list, but
doesn't unwind _any_ of the stuff done by kvm_arch_post_init_vm() and beyond.
Rather than trying to rework common KVM to fit all the architectures random needs,
I think we should instead overhaul the entire mess. And we should do that ASAP
ahead of TDX, though obviously with an eye toward not sucking for TDX.
Not 100% thought out at this point, but I think we can do:
1. Have x86 snapshot per-CPU user-return MRS on first use (trivial to do by adding
a flag to struct kvm_user_return_msrs, as user_return_msrs is already per-CPU).
2. Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock and
cpu_read_lock().
3. Provide arch hooks that are invoked for "power management" operations (including
CPU hotplug and host reboot, hence the quotes). Note, there's both a platform-
wide PM notifier and a per-CPU notifier...
4. Rename kvm_arch_post_init_vm() to e.g. kvm_arch_add_vm(), call it under
kvm_lock, and pass in kvm_usage_count.
5a. Drop cpus_hardware_enabled and drop the common hardware enable/disable code.
or
5b. Expose kvm_hardware_enable_all() and/or kvm_hardware_enable() so that archs
don't need to implement their own error handling and per-CPU flags.
I.e. give each architecture hooks to handle possible transition points, but otherwise
let arch code decide when and how to do hardware enabling/disabling.
I'm very tempted to vote for (5a); x86 is the only architecture has an error path
in kvm_arch_hardware_enable(), and trying to get common code to play nice with arm's
kvm_arm_hardware_enabled logic is probably going to be weird.
E.g. if we can get the back half kvm_create_vm() to look like the below, then arch
code can enable hardware during kvm_arch_add_vm() if the existing count is zero
without generic KVM needing to worry about when hardware needs to be enabled and
disabled.
r = kvm_arch_init_vm(kvm, type);
if (r)
goto out_err_no_arch_destroy_vm;
r = kvm_init_mmu_notifier(kvm);
if (r)
goto out_err_no_mmu_notifier;
/*
* When the fd passed to this ioctl() is opened it pins the module,
* but try_module_get() also prevents getting a reference if the module
* is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
*/
if (!try_module_get(kvm_chardev_ops.owner)) {
r = -ENODEV;
goto out_err;
}
mutex_lock(&kvm_lock);
cpus_read_lock();
r = kvm_arch_add_vm(kvm, kvm_usage_count);
if (r)
goto out_final;
kvm_usage_count++;
list_add(&kvm->vm_list, &vm_list);
cpus_read_unlock();
mutex_unlock(&kvm_lock);
if (r)
goto out_put_module;
preempt_notifier_inc();
kvm_init_pm_notifier(kvm);
return kvm;
out_final:
cpus_read_unlock();
mutex_unlock(&kvm_lock);
module_put(kvm_chardev_ops.owner);
out_err_no_put_module:
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
if (kvm->mmu_notifier.ops)
mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
#endif
out_err_no_mmu_notifier:
kvm_arch_destroy_vm(kvm);
Powered by blists - more mailing lists