lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <njhjud3e6wbdftzr3ziyuh5bhyvc5ndt5qvmg7rlvh5isoop2l@f2uxctws2c7d>
Date: Mon, 22 Dec 2025 09:16:55 +0000
From: Ankit Soni <Ankit.Soni@....com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Marc Zyngier <maz@...nel.org>, Oliver Upton <oliver.upton@...ux.dev>,
	Paolo Bonzini <pbonzini@...hat.com>, Joerg Roedel <joro@...tes.org>, "David
 Woodhouse" <dwmw2@...radead.org>, Lu Baolu <baolu.lu@...ux.intel.com>,
	<linux-arm-kernel@...ts.infradead.org>, <kvmarm@...ts.linux.dev>,
	<kvm@...r.kernel.org>, <iommu@...ts.linux.dev>,
	<linux-kernel@...r.kernel.org>, Sairaj Kodilkar <sarunkod@....com>, "Vasant
 Hegde" <vasant.hegde@....com>, Maxim Levitsky <mlevitsk@...hat.com>, "Joao
 Martins" <joao.m.martins@...cle.com>, Francesco Lavra
	<francescolavra.fl@...il.com>, David Matlack <dmatlack@...gle.com>, "Naveen
 Rao" <Naveen.Rao@....com>
Subject: Re: [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across
 IRTE updates in IOMMU

On Wed, Jun 11, 2025 at 03:45:41PM -0700, Sean Christopherson wrote:
> Now that svm_ir_list_add() isn't overloaded with all manner of weird
> things, fold it into avic_pi_update_irte(), and more importantly take
> ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info
> that's shoved into the IRTE is fresh.  While preemption (and IRQs) is
> disabled on the task performing the IRTE update, thanks to irqfds.lock,
> that task doesn't hold the vCPU's mutex, i.e. preemption being disabled
> is irrelevant.
> 
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>  arch/x86/kvm/svm/avic.c | 55 +++++++++++++++++------------------------
>  1 file changed, 22 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index f1e9f0dd43e8..4747fb09aca4 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
>  
>  int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>  			unsigned int host_irq, uint32_t guest_irq,
>  			struct kvm_vcpu *vcpu, u32 vector)
> @@ -823,8 +797,18 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>  			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
>  			.vector = vector,
>  		};
> +		struct vcpu_svm *svm = to_svm(vcpu);
> +		u64 entry;
>  		int ret;
>  
> +		/*
> +		 * Prevent the vCPU from being scheduled out or migrated until
> +		 * the IRTE is updated and its metadata has been added to the
> +		 * list of IRQs being posted to the vCPU, to ensure the IRTE
> +		 * isn't programmed with stale pCPU/IsRunning information.
> +		 */
> +		guard(spinlock_irqsave)(&svm->ir_list_lock);
> +

Hi,

I’m seeing a lockdep warning about a possible circular locking dependency
involving svm->ir_list_lock and irq_desc_lock when using AMD SVM with AVIC
enabled and a VFIO passthrough device, on 6.19-rc2.

Environment
===========

  - Kernel: 6.19.0-rc2
  - QEMU: 10.1.94
  - CPU: AMD EPYC 9965
  - Modules involved: kvm_amd, kvm, vfio_pci, vfio, irqbypass, mlx5_core
  - Workload: QEMU guest with an mlx5 PCI device passed through.

Lockdep warning
===============

The warning is:

  ======================================================
  WARNING: possible circular locking dependency detected
  6.19.0-rc2 #20 Tainted: G            E
  ------------------------------------------------------
  CPU 58/KVM/28597 is trying to acquire lock:
    ff12c47d4b1f34c0 (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0

    but task is already holding lock:
    ff12c49b28552110 (&svm->ir_list_lock){....}-{2:2}, at: avic_pi_update_irte+0x147/0x270 [kvm_amd]

    which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

    -> #3 (&svm->ir_list_lock){....}-{2:2}:
         _raw_spin_lock_irqsave+0x4e/0xb0
         __avic_vcpu_put+0x7a/0x150 [kvm_amd]
         avic_vcpu_put+0x50/0x70 [kvm_amd]
         svm_vcpu_put+0x38/0x70 [kvm_amd]
         kvm_arch_vcpu_put+0x21b/0x330 [kvm]
         kvm_sched_out+0x62/0x90 [kvm]
         __schedule+0x8d3/0x1d10
         __cond_resched+0x5c/0x80
         __mutex_lock+0x83/0x10b0
         mutex_lock_nested+0x1b/0x30
         kvm_hv_set_msr_common+0x199/0x12a0 [kvm]
         kvm_set_msr_common+0x468/0x1310 [kvm]
         svm_set_msr+0x645/0x730 [kvm_amd]
         __kvm_set_msr+0xa3/0x2f0 [kvm]
         kvm_set_msr_ignored_check+0x23/0x1b0 [kvm]
         do_set_msr+0x76/0xd0 [kvm]
         msr_io+0xbe/0x1c0 [kvm]
         kvm_arch_vcpu_ioctl+0x700/0x2090 [kvm]
         kvm_vcpu_ioctl+0x632/0xc60 [kvm]
         __x64_sys_ioctl+0xa5/0x100
         x64_sys_call+0x1243/0x26b0
         do_syscall_64+0x93/0x1470
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

    -> #2 (&rq->__lock){-.-.}-{2:2}:
         _raw_spin_lock_nested+0x32/0x80
         raw_spin_rq_lock_nested+0x22/0xa0
         task_rq_lock+0x5f/0x150
         cgroup_move_task+0x46/0x110
         css_set_move_task+0xe1/0x240
         cgroup_post_fork+0x98/0x2d0
         copy_process+0x1ea8/0x2330
         kernel_clone+0xa7/0x440
         user_mode_thread+0x63/0x90
         rest_init+0x28/0x200
         start_kernel+0xae0/0xcd0
         x86_64_start_reservations+0x18/0x30
         x86_64_start_kernel+0xfd/0x150
         common_startup_64+0x13e/0x141

    -> #1 (&p->pi_lock){-.-.}-{2:2}:
         _raw_spin_lock_irqsave+0x4e/0xb0
         try_to_wake_up+0x59/0xaa0
         wake_up_process+0x15/0x30
         irq_do_set_affinity+0x145/0x270
         irq_set_affinity_locked+0x172/0x250
         irq_set_affinity+0x47/0x80
         write_irq_affinity.isra.0+0xfe/0x120
         irq_affinity_proc_write+0x1d/0x30
         proc_reg_write+0x69/0xa0
         vfs_write+0x110/0x560
         ksys_write+0x77/0x100
         __x64_sys_write+0x19/0x30
         x64_sys_call+0x79/0x26b0
         do_syscall_64+0x93/0x1470
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

    -> #0 (&irq_desc_lock_class){-.-.}-{2:2}:
         __lock_acquire+0x1595/0x2640
         lock_acquire+0xc4/0x2c0
         _raw_spin_lock_irqsave+0x4e/0xb0
         __irq_get_desc_lock+0x58/0xa0
         irq_set_vcpu_affinity+0x4a/0x100
         avic_pi_update_irte+0x170/0x270 [kvm_amd]
         kvm_pi_update_irte+0xea/0x220 [kvm]
         kvm_arch_irq_bypass_add_producer+0x9b/0xb0 [kvm]
         __connect+0x5f/0x100 [irqbypass]
         irq_bypass_register_producer+0xe4/0xb90 [irqbypass]
         vfio_msi_set_vector_signal+0x1b0/0x330 [vfio_pci_core]
         vfio_msi_set_block+0x5a/0xd0 [vfio_pci_core]
         vfio_pci_set_msi_trigger+0x19e/0x260 [vfio_pci_core]
         vfio_pci_set_irqs_ioctl+0x46/0x140 [vfio_pci_core]
         vfio_pci_core_ioctl+0x6ea/0xc20 [vfio_pci_core]
         vfio_device_fops_unl_ioctl+0xb1/0x9d0 [vfio]
         __x64_sys_ioctl+0xa5/0x100
         x64_sys_call+0x1243/0x26b0
         do_syscall_64+0x93/0x1470
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  Chain exists of:
    &irq_desc_lock_class --> &rq->__lock --> &svm->ir_list_lock

  Possible unsafe locking scenario:

        CPU0                            CPU1
        ----                            ----
   lock(&svm->ir_list_lock);
                                      lock(&rq->__lock);
                                      lock(&svm->ir_list_lock);
   lock(&irq_desc_lock_class);

        *** DEADLOCK ***

At the point of the warning, the following locks are held:

  #0: &vdev->igate           (vfio_pci_core)
  #1: lock#10                (irqbypass)
  #2: &kvm->irqfds.lock      (kvm)
  #3: &svm->ir_list_lock     (kvm_amd)

and the stack backtrace has:

  __irq_get_desc_lock
  irq_set_vcpu_affinity
  avic_pi_update_irte               [kvm_amd]
  kvm_pi_update_irte                [kvm]
  kvm_arch_irq_bypass_add_producer  [kvm]
  __connect                         [irqbypass]
  irq_bypass_register_producer      [irqbypass]
  vfio_msi_set_vector_signal        [vfio_pci_core]
  vfio_pci_set_irqs_ioctl           [vfio_pci_core]
  vfio_pci_core_ioctl               [vfio_pci_core]
  vfio_device_fops_unl_ioctl        [vfio]
  __x64_sys_ioctl
  x64_sys_call
  do_syscall_64
  entry_SYSCALL_64_after_hwframe

So lockdep sees:

  &irq_desc_lock_class -> &rq->__lock -> &svm->ir_list_lock

while avic_pi_update_irte() currently holds svm->ir_list_lock and then
takes irq_desc_lock via irq_set_vcpu_affinity(), which creates the
potential inversion.

Reproduction
============

Host:

  - AMD EPYC + AVIC enabled
  - Kernel 6.19-rc2 with lockdep
  - VFIO passthrough of an mlx5 device (mlx5_core loaded)

Launch command with Passthrough device, with AVIC mode.

The warning triggers when enabling MSI/MSI-X for the passthrough device
from the guest, i.e. via VFIO ioctl on the host that goes through
irq_bypass and eventually calls avic_pi_update_irte().

I can reproduce this reliably when starting the guest with the VFIO
device assigned after every host reboot.

Questions
=========

  - Is this lockdep warning expected/benign in this code path, or does it
    indicate a real potential deadlock between svm->ir_list_lock and
    irq_desc_lock with AVIC + irq_bypass + VFIO?

  - If this is considered a real issue, is the expected direction to:
      * change the locking around avic_pi_update_irte()/svm->ir_list_lock, or
      * adjust how irq_bypass / VFIO interacts with vCPU affinity updates
        on AMD/AVIC, or
      * annotate the locking somehow if lockdep is over-reporting here?

I’m happy to:

  - Provide my full .config
  - Share the exact QEMU command line
  - Test any proposed patches or instrumentation

Thanks,
Ankit Soni
Ankit.Soni@....com

> -- 
> 2.50.0.rc1.591.g9c95f17f64-goog
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ