[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <42386a37-9c7f-4f9a-a95f-15236ae29481@redhat.com>
Date: Tue, 8 Apr 2025 17:36:08 +0200
From: Paolo Bonzini <pbonzini@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>, Joerg Roedel <joro@...tes.org>,
David Woodhouse <dwmw2@...radead.org>, Lu Baolu <baolu.lu@...ux.intel.com>
Cc: kvm@...r.kernel.org, iommu@...ts.linux.dev, linux-kernel@...r.kernel.org,
Maxim Levitsky <mlevitsk@...hat.com>,
Joao Martins <joao.m.martins@...cle.com>, David Matlack <dmatlack@...gle.com>
Subject: Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
On 4/4/25 21:38, Sean Christopherson wrote:
> TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
> general. This needs more testing on AMD with device posted IRQs.
>
> This applies on the small series that adds a enable_device_posted_irqs
> module param (the prep work for that is also prep work for this):
>
> https://lore.kernel.org/all/20250401161804.842968-1-seanjc@google.com
>
> Fix a variety of bugs related to device posted IRQs, especially on the
> AMD side, and clean up KVM's implementation, which IMO is in the running
> for Most Convoluted Code in KVM.
>
> Stating the obvious, this series is comically large. I'm posting it as a
> single series, at least for the first round of reviews, to build the
> (mostly) full picture of the end goal (it's not the true end goal; there's
> still more cleanups that can be done). And because properly testing most
> of the code would be futile until almost the end of the series (so. many.
> bugs.).
>
> Batch #1 (patches 1-10) fixes bugs of varying severity.
I started reviewing these, I guess patches 1-7 could be queued for 6.15?
And maybe also patch 2 from
https://lore.kernel.org/all/20250401161804.842968-1-seanjc@google.com/.
Paolo
> Batch #2 is mostly SVM specific:
>
> - Cleans up various warts and bugs in the IRTE tracking
> - Fixes AVIC to not reject large VMs (honor KVM's ABI)
> - Wire up AVIC to enable_ipiv to support disabling IPI virtualization while
> still utilizing device posted interrupts, and to workaround erratum #1235.
>
> Batch #3 overhauls the guts of IRQ bypass in KVM, and moves the vast majority
> of the logic to common x86; only the code that needs to communicate with the
> IOMMU is truly vendor specific.
>
> Batch #4 is more SVM/AVIC cleanups that are made possible by batch #3.
>
> Batch #5 adds WARNs and drops dead code after all the previous cleanups and
> fixes (I don't want to add the WARNs earlier; I don't any point in adding
> WARNs in code that's known to be broken).
>
> Batch #6 is yet more SVM/AVIC cleanups, with the specific goal of configuring
> IRTEs to generate GA log interrupts if and only if KVM actually needs a wake
> event.
>
> This series is well tested except for one notable gap: I was not able to
> fully test the AMD IOMMU changes. Long story short, getting upstream
> kernels into our full test environments is practically infeasible. And
> exposing a device or VF on systems that are available to developers is a
> bit of a mess.
>
> The device the selftest (see the last patch) uses is an internel test VF
> that's hosted on a smart NIC using non-production (test-only) firmware.
> Unfortunately, only some of our developer systems have the right NIC, and
> for unknown reasons I couldn't get the test firmware to install cleanly on
> Rome systems. I was able to get it functional on Milan (and Intel CPUs),
> but APIC virtualization is disabled on Milan. Thanks to KVM's force_avic
> I could test the KVM flows, but the IOMMU was having none of my attempts
> to force enable APIC virtualization against its will.
>
> Through hackery (see the penultimate patch), I was able to gain a decent
> amount of confidence in the IOMMU changes (and the interface between KVM
> and the IOMMU).
>
> For initial development of the series, I also cobbled together a "mock"
> IRQ bypass device, to allow testing in a VM.
>
> https://github.com/sean-jc/linux.git x86/mock_irqbypass_producer
>
> Note, the diffstat is misleading due to the last two DO NOT MERGE patches
> adding 1k+ LoC. Without those, this series removes ~80 LoC (substantially
> more if comments are ignored).
>
> 21 files changed, 577 insertions(+), 655 deletions(-)
>
> Maxim Levitsky (2):
> KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
> KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
>
> Sean Christopherson (65):
> KVM: SVM: Allocate IR data using atomic allocation
> KVM: x86: Reset IRTE to host control if *new* route isn't postable
> KVM: x86: Explicitly treat routing entry type changes as changes
> KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer
> iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
> iommu/amd: WARN if KVM attempts to set vCPU affinity without posted
> intrrupts
> KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added
> KVM: x86: Pass new routing entries and irqfd when updating IRTEs
> KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
> KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
> KVM: SVM: Delete IRTE link from previous vCPU irrespective of new
> routing
> KVM: SVM: Drop pointless masking of default APIC base when setting
> V_APIC_BAR
> KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA
> masks
> KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
> KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
> KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU
> creation
> KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
> KVM: SVM: Track AVIC tables as natively sized pointers, not "struct
> pages"
> KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
> KVM: VMX: Move enable_ipiv knob to common x86
> KVM: VMX: Suppress PI notifications whenever the vCPU is put
> KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores
> IRQ blocking
> iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
> iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
> iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
> KVM: SVM: Get vCPU info for IRTE using new routing entry
> KVM: SVM: Stop walking list of routing table entries when updating
> IRTE
> KVM: VMX: Stop walking list of routing table entries when updating
> IRTE
> KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
> KVM: x86: Nullify irqfd->producer after updating IRTEs
> KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
> KVM: x86: Move posted interrupt tracepoint to common code
> KVM: SVM: Clean up return handling in avic_pi_update_irte()
> iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel
> structs
> KVM: Don't WARN if updating IRQ bypass route fails
> KVM: Fold kvm_arch_irqfd_route_changed() into
> kvm_arch_update_irqfd_routing()
> KVM: x86: Track irq_bypass_vcpu in common x86 code
> KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being
> targeted
> KVM: x86: Don't update IRTE entries when old and new routes were !MSI
> KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR
> metadata
> KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
> iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
> iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
> iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
> iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC
> is inhibited
> KVM: SVM: Don't check for assigned device(s) when updating affinity
> KVM: SVM: Don't check for assigned device(s) when activating AVIC
> KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
> KVM: SVM: Process all IRTEs on affinity change even if one update
> fails
> KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
> KVM: x86: Drop superfluous "has assigned device" check in
> kvm_pi_update_irte()
> KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
> KVM: x86: WARN if IRQ bypass routing is updated without in-kernel
> local APIC
> KVM: SVM: WARN if ir_list is non-empty at vCPU free
> KVM: x86: Decouple device assignment from IRQ bypass
> KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ
> bypass
> KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
> iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC
> support
> KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
> KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
> KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
> iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
> KVM: SVM: Generate GA log IRQs only if the associated vCPUs is
> blocking
> *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support
> *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test
>
> arch/x86/include/asm/irq_remapping.h | 17 +-
> arch/x86/include/asm/kvm-x86-ops.h | 2 +-
> arch/x86/include/asm/kvm_host.h | 20 +-
> arch/x86/include/asm/svm.h | 13 +-
> arch/x86/kvm/svm/avic.c | 707 ++++++++----------
> arch/x86/kvm/svm/svm.c | 6 +
> arch/x86/kvm/svm/svm.h | 24 +-
> arch/x86/kvm/trace.h | 19 +-
> arch/x86/kvm/vmx/capabilities.h | 1 -
> arch/x86/kvm/vmx/main.c | 2 +-
> arch/x86/kvm/vmx/posted_intr.c | 150 ++--
> arch/x86/kvm/vmx/posted_intr.h | 11 +-
> arch/x86/kvm/vmx/vmx.c | 2 -
> arch/x86/kvm/x86.c | 124 ++-
> drivers/iommu/amd/amd_iommu_types.h | 1 -
> drivers/iommu/amd/init.c | 8 +-
> drivers/iommu/amd/iommu.c | 171 +++--
> drivers/iommu/intel/irq_remapping.c | 10 +-
> include/linux/amd-iommu.h | 25 +-
> include/linux/kvm_host.h | 9 +-
> include/linux/kvm_irqfd.h | 4 +
> tools/testing/selftests/kvm/Makefile.kvm | 2 +
> .../selftests/kvm/include/vfio_pci_util.h | 149 ++++
> .../selftests/kvm/include/x86/processor.h | 21 +
> .../testing/selftests/kvm/lib/vfio_pci_util.c | 201 +++++
> tools/testing/selftests/kvm/mercury_device.h | 118 +++
> tools/testing/selftests/kvm/vfio_irq_test.c | 429 +++++++++++
> virt/kvm/eventfd.c | 22 +-
> 28 files changed, 1610 insertions(+), 658 deletions(-)
> create mode 100644 tools/testing/selftests/kvm/include/vfio_pci_util.h
> create mode 100644 tools/testing/selftests/kvm/lib/vfio_pci_util.c
> create mode 100644 tools/testing/selftests/kvm/mercury_device.h
> create mode 100644 tools/testing/selftests/kvm/vfio_irq_test.c
>
>
> base-commit: 5f9f498ea14ffe15390aa46fb85375e7c901bce3
Powered by blists - more mailing lists