[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2549e6d3-7664-4d12-b84e-ec4a326dec60@suse.com>
Date: Thu, 7 Aug 2025 14:34:01 +0200
From: Vaishali Thakkar <vaishali.thakkar@...e.com>
To: Paolo Bonzini <pbonzini@...hat.com>, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org
Cc: roy.hopkins@...e.com, seanjc@...gle.com, thomas.lendacky@....com,
ashish.kalra@....com, michael.roth@....com, nsaenz@...zon.com,
anelkz@...zon.de, James.Bottomley@...senPartnership.com,
Jörg Rödel <joro@...tes.org>
Subject: Re: [RFC PATCH 00/29] KVM: VM planes
Adding Joerg's functional email id.
On 4/1/25 6:10 PM, Paolo Bonzini wrote:
> I guess April 1st is not the best date to send out such a large series
> after months of radio silence, but here we are.
>
> AMD VMPLs, Intel TDX partitions, Microsoft Hyper-V VTLs, and ARM CCA planes.
> are all examples of virtual privilege level concepts that are exclusive to
> guests. In all these specifications the hypervisor hosts multiple
> copies of a vCPU's register state (or at least of most of it) and provides
> hypercalls or instructions to switch between them.
>
> This is the first draft of the implementation according to the sketch that
> was prepared last year between Linux Plumbers and KVM Forum. The initial
> version of the API was posted last October, and the implementation only
> needed small changes.
>
> Attempts made in the past, mostly in the context of Hyper-V VTLs and SEV-SNP
> VMPLs, fell into two categories:
>
> - use a single vCPU file descriptor, and store multiple copies of the state
> in a single struct kvm_vcpu. This approach requires a lot of changes to
> provide multiple copies of affected fields, especially MMUs and APICs;
> and complex uAPI extensions to direct existing ioctls to a specific
> privilege level. While more or less workable for SEV-SNP VMPLs, that
> was only because the copies of the register state were hidden
> in the VMSA (KVM does not manage it); it showed all its problems when
> applied to Hyper-V VTLs.
>
> The main advantage was that KVM kept the knowledge of the relationship
> between vCPUs that have the same id but belong to different privilege
> levels. This is important in order to accelerate switches in-kernel.
>
> - use multiple VM and vCPU file descriptors, and handle the switch entirely
> in userspace. This got gnarly pretty fast for even more reasons than
> the previous case, for example because VMs could not share anymore
> memslots, including dirty bitmaps and private/shared attributes (a
> substantial problem for SEV-SNP since VMPLs share their ASID).
>
> Opposite to the other case, the total lack of kernel-level sharing of
> register state, and lack of control that vCPUs do not run in parallel,
> is what makes this approach problematic for both kernel and userspace.
> In-kernel implementation of privilege level switch becomes from
> complicated to impossible, and userspace needs a lot of complexity
> as well to ensure that higher-privileged VTLs properly interrupted a
> lower-privileged one.
>
> This design sits squarely in the middle: it gives the initial set of
> VM and vCPU file descriptors the full set of ioctls + struct kvm_run,
> whereas other privilege levels ("planes") instead only support a small
> part of the KVM API. In fact for the vm file descriptor it is only three
> ioctls: KVM_CHECK_EXTENSION, KVM_SIGNAL_MSI, KVM_SET_MEMORY_ATTRIBUTES.
> For vCPUs it is basically KVM_GET/SET_*.
>
> Most notably, memslots and KVM_RUN are *not* included (the choice of
> which plane to run is done via vcpu->run), which solves a lot of
> the problems in both of the previous approaches. Compared to the
> multiple-file-descriptors solution, it gets for free the ability to
> avoid parallel execution of the same vCPUs in different privilege levels.
> Compared to having a single file descriptor churn is more limited, or
> at least can be attacked in small bites. For example in this series
> only per-plane interrupt controllers are switched to use the new struct
> kvm_plane in place of struct kvm, and that's more or less enough in
> the absence of complex interrupt delivery scenarios.
>
> Changes to the userspace API are also relatively small; they boil down
> to the introduction of a single new kind of file descriptor and almost
> entirely fit in common code. Reviewing these VM-wide and architecture-
> independent changes should be the main purpose of this RFC, since
> there are still some things to fix:
>
> - I named some fields "plane" instead of "plane_id" because I expected no
> fields of type struct kvm_plane*, but in retrospect that wasn't a great
> idea.
>
> - online_vcpus counts across all planes but x86 code is still using it to
> deal with TSC synchronization. Probably I will try and make kvmclock
> synchronization per-plane instead of per-VM.
>
Hi Paolo,
Is there still a plan to make kvmclock synchronization per-plane instead
of per-VM? Do you plan to handle it as part of this patchset or you
think it should be handled separately on top of this patchset?
I'm asking as coconut-svsm needs a monotonic clock source which adheres
to wall-clock time. And we have been exploring several approaches to
achieve this. One of the idea is to use kvmclock, provided it can
support a per-plane instance that remains synchronized across planes.
Thanks.
> - we're going to need a struct kvm_vcpu_plane similar to what Roy had in
> https://lore.kernel.org/kvm/cover.1726506534.git.roy.hopkins@suse.com/
> (probably smaller though). Requests are per-plane for example, and I'm
> pretty sure any simplistic solution would have some corner cases where
> it's wrong; but it's a high churn change and I wanted to avoid that
> for this first posting.
>
> There's a handful of locking TODOs where things should be checked more
> carefully, but clearly identifying vCPU data that is not per-plane will
> also simplify locking, thanks to having a single vcpu->mutex for the
> whole plane. So I'm not particularly worried about that; the TDX saga
> hopefully has taught everyone to move in baby steps towards the intended
> direction.
>
> The handling of interrupt priorities is way more complicated than I
> anticipated, unfortunately; everything else seems to fall into place
> decently well---even taking into account the above incompleteness,
> which anyway should not be a blocker for any VTL or VMPL experiments.
> But do shout if anything makes you feel like I was too lazy, and/or you
> want to puke.
>
> Patches 1-2 are documentation and uAPI definitions.
>
> Patches 3-9 are the common code for VM planes, while patches 10-14
> are the common code for vCPU file descriptors on non-default planes.
>
> Patches 15-26 are the x86-specific code, which is organized as follows:
>
> - 15-20: convert APIC code to place its data in the new struct
> kvm_arch_plane instead of struct kvm_arch.
>
> - 21-24: everything else except the new userspace exit, KVM_EXIT_PLANE_EVENT
>
> - 25: KVM_EXIT_PLANE_EVENT, which is used when one plane interrupts another.
>
> - 26: finally make the capability available to userspace
>
> Patches 27-29 finally are the testcases. More are possible and planned,
> but these are enough to say that, despite the missing bits, what exits
> is not _completely_ broken. I also didn't want to write dozens of tests
> before committing to a selftests API.
>
> Available for now at https://git.kernel.org/pub/scm/virt/kvm/kvm.git
> branch planes-20250401. I plan to place it in kvm-coco-queue, for lack
> of a better place, as soon as TDX is merged into kvm/next and I test it
> with the usual battery of kvm-unit-tests and real world guests.
>
> Thanks,
>
> Paolo
>
> Paolo Bonzini (29):
> Documentation: kvm: introduce "VM plane" concept
> KVM: API definitions for plane userspace exit
> KVM: add plane info to structs
> KVM: introduce struct kvm_arch_plane
> KVM: add plane support to KVM_SIGNAL_MSI
> KVM: move mem_attr_array to kvm_plane
> KVM: do not use online_vcpus to test vCPU validity
> KVM: move vcpu_array to struct kvm_plane
> KVM: implement plane file descriptors ioctl and creation
> KVM: share statistics for same vCPU id on different planes
> KVM: anticipate allocation of dirty ring
> KVM: share dirty ring for same vCPU id on different planes
> KVM: implement vCPU creation for extra planes
> KVM: pass plane to kvm_arch_vcpu_create
> KVM: x86: pass vcpu to kvm_pv_send_ipi()
> KVM: x86: split "if" in __kvm_set_or_clear_apicv_inhibit
> KVM: x86: block creating irqchip if planes are active
> KVM: x86: track APICv inhibits per plane
> KVM: x86: move APIC map to kvm_arch_plane
> KVM: x86: add planes support for interrupt delivery
> KVM: x86: add infrastructure to share FPU across planes
> KVM: x86: implement initial plane support
> KVM: x86: extract kvm_post_set_cpuid
> KVM: x86: initialize CPUID for non-default planes
> KVM: x86: handle interrupt priorities for planes
> KVM: x86: enable up to 16 planes
> selftests: kvm: introduce basic test for VM planes
> selftests: kvm: add plane infrastructure
> selftests: kvm: add x86-specific plane test
>
> Documentation/virt/kvm/api.rst | 245 +++++++--
> Documentation/virt/kvm/locking.rst | 3 +
> Documentation/virt/kvm/vcpu-requests.rst | 7 +
> arch/arm64/include/asm/kvm_host.h | 5 +
> arch/arm64/kvm/arm.c | 4 +-
> arch/arm64/kvm/handle_exit.c | 6 +-
> arch/arm64/kvm/hyp/nvhe/gen-hyprel.c | 4 +-
> arch/arm64/kvm/mmio.c | 4 +-
> arch/loongarch/include/asm/kvm_host.h | 5 +
> arch/loongarch/kvm/exit.c | 8 +-
> arch/loongarch/kvm/vcpu.c | 4 +-
> arch/mips/include/asm/kvm_host.h | 5 +
> arch/mips/kvm/emulate.c | 2 +-
> arch/mips/kvm/mips.c | 32 +-
> arch/mips/kvm/vz.c | 18 +-
> arch/powerpc/include/asm/kvm_host.h | 5 +
> arch/powerpc/kvm/book3s.c | 2 +-
> arch/powerpc/kvm/book3s_hv.c | 46 +-
> arch/powerpc/kvm/book3s_hv_rm_xics.c | 8 +-
> arch/powerpc/kvm/book3s_pr.c | 22 +-
> arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
> arch/powerpc/kvm/powerpc.c | 6 +-
> arch/powerpc/kvm/timing.h | 28 +-
> arch/riscv/include/asm/kvm_host.h | 5 +
> arch/riscv/kvm/vcpu.c | 4 +-
> arch/riscv/kvm/vcpu_exit.c | 10 +-
> arch/riscv/kvm/vcpu_insn.c | 16 +-
> arch/riscv/kvm/vcpu_sbi.c | 2 +-
> arch/riscv/kvm/vcpu_sbi_hsm.c | 2 +-
> arch/s390/include/asm/kvm_host.h | 5 +
> arch/s390/kvm/diag.c | 18 +-
> arch/s390/kvm/intercept.c | 20 +-
> arch/s390/kvm/interrupt.c | 48 +-
> arch/s390/kvm/kvm-s390.c | 10 +-
> arch/s390/kvm/priv.c | 60 +--
> arch/s390/kvm/sigp.c | 50 +-
> arch/s390/kvm/vsie.c | 2 +-
> arch/x86/include/asm/kvm_host.h | 46 +-
> arch/x86/kvm/cpuid.c | 57 +-
> arch/x86/kvm/cpuid.h | 2 +
> arch/x86/kvm/debugfs.c | 2 +-
> arch/x86/kvm/hyperv.c | 7 +-
> arch/x86/kvm/i8254.c | 7 +-
> arch/x86/kvm/ioapic.c | 4 +-
> arch/x86/kvm/irq_comm.c | 14 +-
> arch/x86/kvm/kvm_cache_regs.h | 4 +-
> arch/x86/kvm/lapic.c | 147 +++--
> arch/x86/kvm/mmu/mmu.c | 41 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> arch/x86/kvm/svm/sev.c | 4 +-
> arch/x86/kvm/svm/svm.c | 21 +-
> arch/x86/kvm/vmx/tdx.c | 8 +-
> arch/x86/kvm/vmx/vmx.c | 20 +-
> arch/x86/kvm/x86.c | 319 ++++++++---
> arch/x86/kvm/xen.c | 1 +
> include/linux/kvm_host.h | 130 +++--
> include/linux/kvm_types.h | 1 +
> include/uapi/linux/kvm.h | 28 +-
> tools/testing/selftests/kvm/Makefile.kvm | 2 +
> .../testing/selftests/kvm/include/kvm_util.h | 48 ++
> .../selftests/kvm/include/x86/processor.h | 1 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 65 ++-
> .../testing/selftests/kvm/lib/x86/processor.c | 15 +
> tools/testing/selftests/kvm/plane_test.c | 103 ++++
> tools/testing/selftests/kvm/x86/plane_test.c | 270 ++++++++++
> virt/kvm/dirty_ring.c | 5 +-
> virt/kvm/guest_memfd.c | 3 +-
> virt/kvm/irqchip.c | 5 +-
> virt/kvm/kvm_main.c | 500 ++++++++++++++----
> 69 files changed, 1991 insertions(+), 614 deletions(-)
> create mode 100644 tools/testing/selftests/kvm/plane_test.c
> create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c
>
Powered by blists - more mailing lists