linux-kernel - Re: [RFC PATCH 00/29] KVM: VM planes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2549e6d3-7664-4d12-b84e-ec4a326dec60@suse.com>
Date: Thu, 7 Aug 2025 14:34:01 +0200
From: Vaishali Thakkar <vaishali.thakkar@...e.com>
To: Paolo Bonzini <pbonzini@...hat.com>, linux-kernel@...r.kernel.org,
 kvm@...r.kernel.org
Cc: roy.hopkins@...e.com, seanjc@...gle.com, thomas.lendacky@....com,
 ashish.kalra@....com, michael.roth@....com, nsaenz@...zon.com,
 anelkz@...zon.de, James.Bottomley@...senPartnership.com,
 Jörg Rödel <joro@...tes.org>
Subject: Re: [RFC PATCH 00/29] KVM: VM planes

Adding Joerg's functional email id.

On 4/1/25 6:10 PM, Paolo Bonzini wrote:
> I guess April 1st is not the best date to send out such a large series
> after months of radio silence, but here we are.
> 
> AMD VMPLs, Intel TDX partitions, Microsoft Hyper-V VTLs, and ARM CCA planes.
> are all examples of virtual privilege level concepts that are exclusive to
> guests.  In all these specifications the hypervisor hosts multiple
> copies of a vCPU's register state (or at least of most of it) and provides
> hypercalls or instructions to switch between them.
> 
> This is the first draft of the implementation according to the sketch that
> was prepared last year between Linux Plumbers and KVM Forum.  The initial
> version of the API was posted last October, and the implementation only
> needed small changes.
> 
> Attempts made in the past, mostly in the context of Hyper-V VTLs and SEV-SNP
> VMPLs, fell into two categories:
> 
> - use a single vCPU file descriptor, and store multiple copies of the state
>   in a single struct kvm_vcpu.  This approach requires a lot of changes to
>   provide multiple copies of affected fields, especially MMUs and APICs;
>   and complex uAPI extensions to direct existing ioctls to a specific
>   privilege level.  While more or less workable for SEV-SNP VMPLs, that
>   was only because the copies of the register state were hidden
>   in the VMSA (KVM does not manage it); it showed all its problems when
>   applied to Hyper-V VTLs.
> 
>   The main advantage was that KVM kept the knowledge of the relationship
>   between vCPUs that have the same id but belong to different privilege
>   levels.  This is important in order to accelerate switches in-kernel.
> 
> - use multiple VM and vCPU file descriptors, and handle the switch entirely
>   in userspace.  This got gnarly pretty fast for even more reasons than
>   the previous case, for example because VMs could not share anymore
>   memslots, including dirty bitmaps and private/shared attributes (a
>   substantial problem for SEV-SNP since VMPLs share their ASID).
> 
>   Opposite to the other case, the total lack of kernel-level sharing of
>   register state, and lack of control that vCPUs do not run in parallel,
>   is what makes this approach problematic for both kernel and userspace.
>   In-kernel implementation of privilege level switch becomes from
>   complicated to impossible, and userspace needs a lot of complexity
>   as well to ensure that higher-privileged VTLs properly interrupted a
>   lower-privileged one.
> 
> This design sits squarely in the middle: it gives the initial set of
> VM and vCPU file descriptors the full set of ioctls + struct kvm_run,
> whereas other privilege levels ("planes") instead only support a small
> part of the KVM API.  In fact for the vm file descriptor it is only three
> ioctls: KVM_CHECK_EXTENSION, KVM_SIGNAL_MSI, KVM_SET_MEMORY_ATTRIBUTES.
> For vCPUs it is basically KVM_GET/SET_*.
> 
> Most notably, memslots and KVM_RUN are *not* included (the choice of
> which plane to run is done via vcpu->run), which solves a lot of
> the problems in both of the previous approaches.  Compared to the
> multiple-file-descriptors solution, it gets for free the ability to
> avoid parallel execution of the same vCPUs in different privilege levels.
> Compared to having a single file descriptor churn is more limited, or
> at least can be attacked in small bites.  For example in this series
> only per-plane interrupt controllers are switched to use the new struct
> kvm_plane in place of struct kvm, and that's more or less enough in
> the absence of complex interrupt delivery scenarios.
> 
> Changes to the userspace API are also relatively small; they boil down
> to the introduction of a single new kind of file descriptor and almost
> entirely fit in common code.  Reviewing these VM-wide and architecture-
> independent changes should be the main purpose of this RFC, since 
> there are still some things to fix:
> 
> - I named some fields "plane" instead of "plane_id" because I expected no
>   fields of type struct kvm_plane*, but in retrospect that wasn't a great
>   idea.
> 
> - online_vcpus counts across all planes but x86 code is still using it to
>   deal with TSC synchronization.  Probably I will try and make kvmclock
>   synchronization per-plane instead of per-VM.
> 

Hi Paolo,

Is there still a plan to make kvmclock synchronization per-plane instead
of per-VM? Do you plan to handle it as part of this patchset or you
think it should be handled separately on top of this patchset?

I'm asking as coconut-svsm needs a monotonic clock source which adheres
to wall-clock time. And we have been exploring several approaches to
achieve this. One of the idea is to use kvmclock, provided it can
support a per-plane instance that remains synchronized across planes.

Thanks.


> - we're going to need a struct kvm_vcpu_plane similar to what Roy had in
>   https://lore.kernel.org/kvm/cover.1726506534.git.roy.hopkins@suse.com/
>   (probably smaller though).  Requests are per-plane for example, and I'm
>   pretty sure any simplistic solution would have some corner cases where
>   it's wrong; but it's a high churn change and I wanted to avoid that
>   for this first posting.
> 
> There's a handful of locking TODOs where things should be checked more
> carefully, but clearly identifying vCPU data that is not per-plane will
> also simplify locking, thanks to having a single vcpu->mutex for the
> whole plane.  So I'm not particularly worried about that; the TDX saga
> hopefully has taught everyone to move in baby steps towards the intended
> direction.
> 
> The handling of interrupt priorities is way more complicated than I
> anticipated, unfortunately; everything else seems to fall into place
> decently well---even taking into account the above incompleteness,
> which anyway should not be a blocker for any VTL or VMPL experiments.
> But do shout if anything makes you feel like I was too lazy, and/or you
> want to puke.
> 
> Patches 1-2 are documentation and uAPI definitions.
> 
> Patches 3-9 are the common code for VM planes, while patches 10-14
> are the common code for vCPU file descriptors on non-default planes.
> 
> Patches 15-26 are the x86-specific code, which is organized as follows:
> 
> - 15-20: convert APIC code to place its data in the new struct
> kvm_arch_plane instead of struct kvm_arch.
> 
> - 21-24: everything else except the new userspace exit, KVM_EXIT_PLANE_EVENT
> 
> - 25: KVM_EXIT_PLANE_EVENT, which is used when one plane interrupts another.
> 
> - 26: finally make the capability available to userspace
> 
> Patches 27-29 finally are the testcases.  More are possible and planned,
> but these are enough to say that, despite the missing bits, what exits
> is not _completely_ broken.  I also didn't want to write dozens of tests
> before committing to a selftests API.
> 
> Available for now at https://git.kernel.org/pub/scm/virt/kvm/kvm.git
> branch planes-20250401.  I plan to place it in kvm-coco-queue, for lack
> of a better place, as soon as TDX is merged into kvm/next and I test it
> with the usual battery of kvm-unit-tests and real world guests.
> 
> Thanks,
> 
> Paolo
> 
> Paolo Bonzini (29):
>   Documentation: kvm: introduce "VM plane" concept
>   KVM: API definitions for plane userspace exit
>   KVM: add plane info to structs
>   KVM: introduce struct kvm_arch_plane
>   KVM: add plane support to KVM_SIGNAL_MSI
>   KVM: move mem_attr_array to kvm_plane
>   KVM: do not use online_vcpus to test vCPU validity
>   KVM: move vcpu_array to struct kvm_plane
>   KVM: implement plane file descriptors ioctl and creation
>   KVM: share statistics for same vCPU id on different planes
>   KVM: anticipate allocation of dirty ring
>   KVM: share dirty ring for same vCPU id on different planes
>   KVM: implement vCPU creation for extra planes
>   KVM: pass plane to kvm_arch_vcpu_create
>   KVM: x86: pass vcpu to kvm_pv_send_ipi()
>   KVM: x86: split "if" in __kvm_set_or_clear_apicv_inhibit
>   KVM: x86: block creating irqchip if planes are active
>   KVM: x86: track APICv inhibits per plane
>   KVM: x86: move APIC map to kvm_arch_plane
>   KVM: x86: add planes support for interrupt delivery
>   KVM: x86: add infrastructure to share FPU across planes
>   KVM: x86: implement initial plane support
>   KVM: x86: extract kvm_post_set_cpuid
>   KVM: x86: initialize CPUID for non-default planes
>   KVM: x86: handle interrupt priorities for planes
>   KVM: x86: enable up to 16 planes
>   selftests: kvm: introduce basic test for VM planes
>   selftests: kvm: add plane infrastructure
>   selftests: kvm: add x86-specific plane test
> 
>  Documentation/virt/kvm/api.rst                | 245 +++++++--
>  Documentation/virt/kvm/locking.rst            |   3 +
>  Documentation/virt/kvm/vcpu-requests.rst      |   7 +
>  arch/arm64/include/asm/kvm_host.h             |   5 +
>  arch/arm64/kvm/arm.c                          |   4 +-
>  arch/arm64/kvm/handle_exit.c                  |   6 +-
>  arch/arm64/kvm/hyp/nvhe/gen-hyprel.c          |   4 +-
>  arch/arm64/kvm/mmio.c                         |   4 +-
>  arch/loongarch/include/asm/kvm_host.h         |   5 +
>  arch/loongarch/kvm/exit.c                     |   8 +-
>  arch/loongarch/kvm/vcpu.c                     |   4 +-
>  arch/mips/include/asm/kvm_host.h              |   5 +
>  arch/mips/kvm/emulate.c                       |   2 +-
>  arch/mips/kvm/mips.c                          |  32 +-
>  arch/mips/kvm/vz.c                            |  18 +-
>  arch/powerpc/include/asm/kvm_host.h           |   5 +
>  arch/powerpc/kvm/book3s.c                     |   2 +-
>  arch/powerpc/kvm/book3s_hv.c                  |  46 +-
>  arch/powerpc/kvm/book3s_hv_rm_xics.c          |   8 +-
>  arch/powerpc/kvm/book3s_pr.c                  |  22 +-
>  arch/powerpc/kvm/book3s_pr_papr.c             |   2 +-
>  arch/powerpc/kvm/powerpc.c                    |   6 +-
>  arch/powerpc/kvm/timing.h                     |  28 +-
>  arch/riscv/include/asm/kvm_host.h             |   5 +
>  arch/riscv/kvm/vcpu.c                         |   4 +-
>  arch/riscv/kvm/vcpu_exit.c                    |  10 +-
>  arch/riscv/kvm/vcpu_insn.c                    |  16 +-
>  arch/riscv/kvm/vcpu_sbi.c                     |   2 +-
>  arch/riscv/kvm/vcpu_sbi_hsm.c                 |   2 +-
>  arch/s390/include/asm/kvm_host.h              |   5 +
>  arch/s390/kvm/diag.c                          |  18 +-
>  arch/s390/kvm/intercept.c                     |  20 +-
>  arch/s390/kvm/interrupt.c                     |  48 +-
>  arch/s390/kvm/kvm-s390.c                      |  10 +-
>  arch/s390/kvm/priv.c                          |  60 +--
>  arch/s390/kvm/sigp.c                          |  50 +-
>  arch/s390/kvm/vsie.c                          |   2 +-
>  arch/x86/include/asm/kvm_host.h               |  46 +-
>  arch/x86/kvm/cpuid.c                          |  57 +-
>  arch/x86/kvm/cpuid.h                          |   2 +
>  arch/x86/kvm/debugfs.c                        |   2 +-
>  arch/x86/kvm/hyperv.c                         |   7 +-
>  arch/x86/kvm/i8254.c                          |   7 +-
>  arch/x86/kvm/ioapic.c                         |   4 +-
>  arch/x86/kvm/irq_comm.c                       |  14 +-
>  arch/x86/kvm/kvm_cache_regs.h                 |   4 +-
>  arch/x86/kvm/lapic.c                          | 147 +++--
>  arch/x86/kvm/mmu/mmu.c                        |  41 +-
>  arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
>  arch/x86/kvm/svm/sev.c                        |   4 +-
>  arch/x86/kvm/svm/svm.c                        |  21 +-
>  arch/x86/kvm/vmx/tdx.c                        |   8 +-
>  arch/x86/kvm/vmx/vmx.c                        |  20 +-
>  arch/x86/kvm/x86.c                            | 319 ++++++++---
>  arch/x86/kvm/xen.c                            |   1 +
>  include/linux/kvm_host.h                      | 130 +++--
>  include/linux/kvm_types.h                     |   1 +
>  include/uapi/linux/kvm.h                      |  28 +-
>  tools/testing/selftests/kvm/Makefile.kvm      |   2 +
>  .../testing/selftests/kvm/include/kvm_util.h  |  48 ++
>  .../selftests/kvm/include/x86/processor.h     |   1 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  65 ++-
>  .../testing/selftests/kvm/lib/x86/processor.c |  15 +
>  tools/testing/selftests/kvm/plane_test.c      | 103 ++++
>  tools/testing/selftests/kvm/x86/plane_test.c  | 270 ++++++++++
>  virt/kvm/dirty_ring.c                         |   5 +-
>  virt/kvm/guest_memfd.c                        |   3 +-
>  virt/kvm/irqchip.c                            |   5 +-
>  virt/kvm/kvm_main.c                           | 500 ++++++++++++++----
>  69 files changed, 1991 insertions(+), 614 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/plane_test.c
>  create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c
>