[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <njhjud3e6wbdftzr3ziyuh5bhyvc5ndt5qvmg7rlvh5isoop2l@f2uxctws2c7d>
Date: Mon, 22 Dec 2025 09:16:55 +0000
From: Ankit Soni <Ankit.Soni@....com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Marc Zyngier <maz@...nel.org>, Oliver Upton <oliver.upton@...ux.dev>,
Paolo Bonzini <pbonzini@...hat.com>, Joerg Roedel <joro@...tes.org>, "David
Woodhouse" <dwmw2@...radead.org>, Lu Baolu <baolu.lu@...ux.intel.com>,
<linux-arm-kernel@...ts.infradead.org>, <kvmarm@...ts.linux.dev>,
<kvm@...r.kernel.org>, <iommu@...ts.linux.dev>,
<linux-kernel@...r.kernel.org>, Sairaj Kodilkar <sarunkod@....com>, "Vasant
Hegde" <vasant.hegde@....com>, Maxim Levitsky <mlevitsk@...hat.com>, "Joao
Martins" <joao.m.martins@...cle.com>, Francesco Lavra
<francescolavra.fl@...il.com>, David Matlack <dmatlack@...gle.com>, "Naveen
Rao" <Naveen.Rao@....com>
Subject: Re: [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across
IRTE updates in IOMMU
On Wed, Jun 11, 2025 at 03:45:41PM -0700, Sean Christopherson wrote:
> Now that svm_ir_list_add() isn't overloaded with all manner of weird
> things, fold it into avic_pi_update_irte(), and more importantly take
> ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info
> that's shoved into the IRTE is fresh. While preemption (and IRQs) is
> disabled on the task performing the IRTE update, thanks to irqfds.lock,
> that task doesn't hold the vCPU's mutex, i.e. preemption being disabled
> is irrelevant.
>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
> arch/x86/kvm/svm/avic.c | 55 +++++++++++++++++------------------------
> 1 file changed, 22 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index f1e9f0dd43e8..4747fb09aca4 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
>
> int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> unsigned int host_irq, uint32_t guest_irq,
> struct kvm_vcpu *vcpu, u32 vector)
> @@ -823,8 +797,18 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> .vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
> .vector = vector,
> };
> + struct vcpu_svm *svm = to_svm(vcpu);
> + u64 entry;
> int ret;
>
> + /*
> + * Prevent the vCPU from being scheduled out or migrated until
> + * the IRTE is updated and its metadata has been added to the
> + * list of IRQs being posted to the vCPU, to ensure the IRTE
> + * isn't programmed with stale pCPU/IsRunning information.
> + */
> + guard(spinlock_irqsave)(&svm->ir_list_lock);
> +
Hi,
I’m seeing a lockdep warning about a possible circular locking dependency
involving svm->ir_list_lock and irq_desc_lock when using AMD SVM with AVIC
enabled and a VFIO passthrough device, on 6.19-rc2.
Environment
===========
- Kernel: 6.19.0-rc2
- QEMU: 10.1.94
- CPU: AMD EPYC 9965
- Modules involved: kvm_amd, kvm, vfio_pci, vfio, irqbypass, mlx5_core
- Workload: QEMU guest with an mlx5 PCI device passed through.
Lockdep warning
===============
The warning is:
======================================================
WARNING: possible circular locking dependency detected
6.19.0-rc2 #20 Tainted: G E
------------------------------------------------------
CPU 58/KVM/28597 is trying to acquire lock:
ff12c47d4b1f34c0 (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0
but task is already holding lock:
ff12c49b28552110 (&svm->ir_list_lock){....}-{2:2}, at: avic_pi_update_irte+0x147/0x270 [kvm_amd]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&svm->ir_list_lock){....}-{2:2}:
_raw_spin_lock_irqsave+0x4e/0xb0
__avic_vcpu_put+0x7a/0x150 [kvm_amd]
avic_vcpu_put+0x50/0x70 [kvm_amd]
svm_vcpu_put+0x38/0x70 [kvm_amd]
kvm_arch_vcpu_put+0x21b/0x330 [kvm]
kvm_sched_out+0x62/0x90 [kvm]
__schedule+0x8d3/0x1d10
__cond_resched+0x5c/0x80
__mutex_lock+0x83/0x10b0
mutex_lock_nested+0x1b/0x30
kvm_hv_set_msr_common+0x199/0x12a0 [kvm]
kvm_set_msr_common+0x468/0x1310 [kvm]
svm_set_msr+0x645/0x730 [kvm_amd]
__kvm_set_msr+0xa3/0x2f0 [kvm]
kvm_set_msr_ignored_check+0x23/0x1b0 [kvm]
do_set_msr+0x76/0xd0 [kvm]
msr_io+0xbe/0x1c0 [kvm]
kvm_arch_vcpu_ioctl+0x700/0x2090 [kvm]
kvm_vcpu_ioctl+0x632/0xc60 [kvm]
__x64_sys_ioctl+0xa5/0x100
x64_sys_call+0x1243/0x26b0
do_syscall_64+0x93/0x1470
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #2 (&rq->__lock){-.-.}-{2:2}:
_raw_spin_lock_nested+0x32/0x80
raw_spin_rq_lock_nested+0x22/0xa0
task_rq_lock+0x5f/0x150
cgroup_move_task+0x46/0x110
css_set_move_task+0xe1/0x240
cgroup_post_fork+0x98/0x2d0
copy_process+0x1ea8/0x2330
kernel_clone+0xa7/0x440
user_mode_thread+0x63/0x90
rest_init+0x28/0x200
start_kernel+0xae0/0xcd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xfd/0x150
common_startup_64+0x13e/0x141
-> #1 (&p->pi_lock){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x4e/0xb0
try_to_wake_up+0x59/0xaa0
wake_up_process+0x15/0x30
irq_do_set_affinity+0x145/0x270
irq_set_affinity_locked+0x172/0x250
irq_set_affinity+0x47/0x80
write_irq_affinity.isra.0+0xfe/0x120
irq_affinity_proc_write+0x1d/0x30
proc_reg_write+0x69/0xa0
vfs_write+0x110/0x560
ksys_write+0x77/0x100
__x64_sys_write+0x19/0x30
x64_sys_call+0x79/0x26b0
do_syscall_64+0x93/0x1470
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&irq_desc_lock_class){-.-.}-{2:2}:
__lock_acquire+0x1595/0x2640
lock_acquire+0xc4/0x2c0
_raw_spin_lock_irqsave+0x4e/0xb0
__irq_get_desc_lock+0x58/0xa0
irq_set_vcpu_affinity+0x4a/0x100
avic_pi_update_irte+0x170/0x270 [kvm_amd]
kvm_pi_update_irte+0xea/0x220 [kvm]
kvm_arch_irq_bypass_add_producer+0x9b/0xb0 [kvm]
__connect+0x5f/0x100 [irqbypass]
irq_bypass_register_producer+0xe4/0xb90 [irqbypass]
vfio_msi_set_vector_signal+0x1b0/0x330 [vfio_pci_core]
vfio_msi_set_block+0x5a/0xd0 [vfio_pci_core]
vfio_pci_set_msi_trigger+0x19e/0x260 [vfio_pci_core]
vfio_pci_set_irqs_ioctl+0x46/0x140 [vfio_pci_core]
vfio_pci_core_ioctl+0x6ea/0xc20 [vfio_pci_core]
vfio_device_fops_unl_ioctl+0xb1/0x9d0 [vfio]
__x64_sys_ioctl+0xa5/0x100
x64_sys_call+0x1243/0x26b0
do_syscall_64+0x93/0x1470
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Chain exists of:
&irq_desc_lock_class --> &rq->__lock --> &svm->ir_list_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&svm->ir_list_lock);
lock(&rq->__lock);
lock(&svm->ir_list_lock);
lock(&irq_desc_lock_class);
*** DEADLOCK ***
At the point of the warning, the following locks are held:
#0: &vdev->igate (vfio_pci_core)
#1: lock#10 (irqbypass)
#2: &kvm->irqfds.lock (kvm)
#3: &svm->ir_list_lock (kvm_amd)
and the stack backtrace has:
__irq_get_desc_lock
irq_set_vcpu_affinity
avic_pi_update_irte [kvm_amd]
kvm_pi_update_irte [kvm]
kvm_arch_irq_bypass_add_producer [kvm]
__connect [irqbypass]
irq_bypass_register_producer [irqbypass]
vfio_msi_set_vector_signal [vfio_pci_core]
vfio_pci_set_irqs_ioctl [vfio_pci_core]
vfio_pci_core_ioctl [vfio_pci_core]
vfio_device_fops_unl_ioctl [vfio]
__x64_sys_ioctl
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
So lockdep sees:
&irq_desc_lock_class -> &rq->__lock -> &svm->ir_list_lock
while avic_pi_update_irte() currently holds svm->ir_list_lock and then
takes irq_desc_lock via irq_set_vcpu_affinity(), which creates the
potential inversion.
Reproduction
============
Host:
- AMD EPYC + AVIC enabled
- Kernel 6.19-rc2 with lockdep
- VFIO passthrough of an mlx5 device (mlx5_core loaded)
Launch command with Passthrough device, with AVIC mode.
The warning triggers when enabling MSI/MSI-X for the passthrough device
from the guest, i.e. via VFIO ioctl on the host that goes through
irq_bypass and eventually calls avic_pi_update_irte().
I can reproduce this reliably when starting the guest with the VFIO
device assigned after every host reboot.
Questions
=========
- Is this lockdep warning expected/benign in this code path, or does it
indicate a real potential deadlock between svm->ir_list_lock and
irq_desc_lock with AVIC + irq_bypass + VFIO?
- If this is considered a real issue, is the expected direction to:
* change the locking around avic_pi_update_irte()/svm->ir_list_lock, or
* adjust how irq_bypass / VFIO interacts with vCPU affinity updates
on AMD/AVIC, or
* annotate the locking somehow if lockdep is over-reporting here?
I’m happy to:
- Provide my full .config
- Share the exact QEMU command line
- Test any proposed patches or instrumentation
Thanks,
Ankit Soni
Ankit.Soni@....com
> --
> 2.50.0.rc1.591.g9c95f17f64-goog
>
Powered by blists - more mailing lists