linux-kernel - Re: [PATCH] PCI: Fix might_sleep() lockdep warning in pcim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240904115942.7ff0cfcb.alex.williamson@redhat.com>
Date: Wed, 4 Sep 2024 11:59:42 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: Philipp Stanner <pstanner@...hat.com>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, Krzysztof Wilczyński <kwilczynski@...nel.org>, Damien Le Moal
 <dlemoal@...nel.org>, linux-pci@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] PCI: Fix might_sleep() lockdep warning in pcim_intx()

On Wed,  4 Sep 2024 09:25:42 +0200
Philipp Stanner <pstanner@...hat.com> wrote:

> commit 25216afc9db5 ("PCI: Add managed pcim_intx()") moved the
> allocation step for pci_intx()'s device resource from
> pcim_enable_device() to pcim_intx(). As before, pcim_enable_device()
> sets pci_dev.is_managed to true; and it is never set to false again.
> 
> Under some circumstances it can happen that drivers share a struct
> pci_dev. If one driver uses pcim_enable_device() and the other doesn't,
> this causes the other driver to run into managed pcim_intx(), which will
> try to allocate when called for the first time.

I don't think "share" is the correct way to describe this.  The struct
pci_dev is never shared between drivers, it's more of a lifecycle
issue.  The struct pci_dev persists for the life of the device, but
is_managed is relative to the driver that is currently bound to the
device.  The issue is that the devres infrastructure doesn't clear this
flag when the managed driver that set it releases the struct pci_dev.

As we discussed in the other thread, this is a latent issue that has
existed for some time, but it's only when pcim_intx() starts allocating
memory, as introduced in the Fixes commit below, that the calling
requirements for pcim_intx() are narrowed and become incompatible (and
visible via lockdep) with existing drivers.

> Allocations might sleep, so calling pci_intx() while holding spinlocks
> becomes then invalid, which causes lockdep warnings and could cause
> deadlocks.
> 
> Have pcim_enable_device()'s release function, pcim_disable_device(), set
> pci_dev.is_managed to false so that subsequent drivers using the same
> struct pci_dev do not run implicitly into managed code.
> 
> Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
> Reported-by: Alex Williamson <alex.williamson@...hat.com>
> Closes: https://lore.kernel.org/all/20240903094431.63551744.alex.williamson@redhat.com/
> Signed-off-by: Philipp Stanner <pstanner@...hat.com>
> ---
> @Alex:
> Please have a look whether this fixes your issue.

Yes, this works for me.

Tested-by: Alex Williamson <alex.williamson@...hat.com>

> Might also be cool to include your lockdep warning in the commit
> message, if you can provide that.

See below.  Thanks,

Alex

========================================================
WARNING: possible irq lock inversion dependency detected
6.11.0-rc6+ #59 Tainted: G        W         
--------------------------------------------------------
CPU 0/KVM/1537 just changed the state of lock:
ffffa0f0cff965f0 (&vdev->irqlock){-...}-{2:2}, at:
vfio_intx_handler+0x21/0xd0 [vfio_pci_core] but this lock took another,
HARDIRQ-unsafe lock in the past: (fs_reclaim){+.+.}-{0:0}


and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               local_irq_disable();
                               lock(&vdev->irqlock);
                               lock(fs_reclaim);
  <Interrupt>
    lock(&vdev->irqlock);

 *** DEADLOCK ***

1 lock held by CPU 0/KVM/1537:
 #0: ffffa0f0eaffc3b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x86/0x9a0 [kvm]

the shortest dependencies between 2nd lock and 1st lock:
 -> (fs_reclaim){+.+.}-{0:0} {
    HARDIRQ-ON-W at:
                      lock_acquire+0xba/0x2c0
                      fs_reclaim_acquire+0x99/0xf0
                      __kmalloc_node_noprof+0x93/0x470
                      alloc_cpumask_var_node+0x21/0x30
                      smp_prepare_cpus_common+0x90/0x140
                      native_smp_prepare_cpus+0xa/0xc0
                      kernel_init_freeable+0x23d/0x5a0
                      kernel_init+0x16/0x1d0
                      ret_from_fork+0x2d/0x50
                      ret_from_fork_asm+0x1a/0x30
    SOFTIRQ-ON-W at:
                      lock_acquire+0xba/0x2c0
                      fs_reclaim_acquire+0x99/0xf0
                      __kmalloc_node_noprof+0x93/0x470
                      alloc_cpumask_var_node+0x21/0x30
                      smp_prepare_cpus_common+0x90/0x140
                      native_smp_prepare_cpus+0xa/0xc0
                      kernel_init_freeable+0x23d/0x5a0
                      kernel_init+0x16/0x1d0
                      ret_from_fork+0x2d/0x50
                      ret_from_fork_asm+0x1a/0x30
    INITIAL USE at:
                     lock_acquire+0xba/0x2c0
                     fs_reclaim_acquire+0x99/0xf0
                     __kmalloc_node_noprof+0x93/0x470
                     alloc_cpumask_var_node+0x21/0x30
                     smp_prepare_cpus_common+0x90/0x140
                     native_smp_prepare_cpus+0xa/0xc0
                     kernel_init_freeable+0x23d/0x5a0
                     kernel_init+0x16/0x1d0
                     ret_from_fork+0x2d/0x50
                     ret_from_fork_asm+0x1a/0x30
  }
  ... key      at: [<ffffffffb2250300>] __fs_reclaim_map+0x0/0x28
  ... acquired at:
   fs_reclaim_acquire+0x99/0xf0
   __kmalloc_node_track_caller_noprof+0x8f/0x460
   __devres_alloc_node+0x42/0x80
   pcim_intx+0xb6/0xe0
   pci_intx+0x67/0x80
   __vfio_pci_intx_mask+0x70/0xe0 [vfio_pci_core]
   vfio_pci_set_intx_mask+0x40/0x50 [vfio_pci_core]
   vfio_pci_core_ioctl+0x74e/0xf50 [vfio_pci_core]
   vfio_device_fops_unl_ioctl+0xa5/0x870 [vfio]
   __x64_sys_ioctl+0x90/0xd0
   do_syscall_64+0x8e/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> (&vdev->irqlock){-...}-{2:2} {
   IN-HARDIRQ-W at:
                    lock_acquire+0xba/0x2c0
                    _raw_spin_lock_irqsave+0x3b/0x60
                    vfio_intx_handler+0x21/0xd0 [vfio_pci_core]
                    __handle_irq_event_percpu+0x81/0x260
                    handle_irq_event+0x34/0x70
                    handle_fasteoi_irq+0x91/0x210
                    __common_interrupt+0x6f/0x140
                    common_interrupt+0xb3/0xd0
                    asm_common_interrupt+0x22/0x40
                    vmx_do_interrupt_irqoff+0x10/0x20 [kvm_intel]
                    vmx_handle_exit_irqoff+0xc3/0x220 [kvm_intel]
                    kvm_arch_vcpu_ioctl_run+0x12e9/0x1d20 [kvm]
                    kvm_vcpu_ioctl+0x239/0x9a0 [kvm]
                    __x64_sys_ioctl+0x90/0xd0
                    do_syscall_64+0x8e/0x180
                    entry_SYSCALL_64_after_hwframe+0x76/0x7e
   INITIAL USE at:
                   lock_acquire+0xba/0x2c0
                   _raw_spin_lock_irqsave+0x3b/0x60
                   __vfio_pci_intx_mask+0x30/0xe0 [vfio_pci_core]
                   vfio_pci_set_intx_mask+0x40/0x50 [vfio_pci_core]
                   vfio_pci_core_ioctl+0x74e/0xf50 [vfio_pci_core]
                   vfio_device_fops_unl_ioctl+0xa5/0x870 [vfio]
                   __x64_sys_ioctl+0x90/0xd0
                   do_syscall_64+0x8e/0x180
                   entry_SYSCALL_64_after_hwframe+0x76/0x7e
 }
 ... key      at: [<ffffffffc0a68800>] __key.90+0x0/0x10 [vfio_pci_core]
 ... acquired at:
   __lock_acquire+0x9b0/0x2130
   lock_acquire+0xba/0x2c0
   _raw_spin_lock_irqsave+0x3b/0x60
   vfio_intx_handler+0x21/0xd0 [vfio_pci_core]
   __handle_irq_event_percpu+0x81/0x260
   handle_irq_event+0x34/0x70
   handle_fasteoi_irq+0x91/0x210
   __common_interrupt+0x6f/0x140
   common_interrupt+0xb3/0xd0
   asm_common_interrupt+0x22/0x40
   vmx_do_interrupt_irqoff+0x10/0x20 [kvm_intel]
   vmx_handle_exit_irqoff+0xc3/0x220 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x12e9/0x1d20 [kvm]
   kvm_vcpu_ioctl+0x239/0x9a0 [kvm]
   __x64_sys_ioctl+0x90/0xd0
   do_syscall_64+0x8e/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e


stack backtrace:
CPU: 4 UID: 107 PID: 1537 Comm: CPU 0/KVM Tainted: G        W          6.11.0-rc6+ #59
Tainted: [W]=WARN
Hardware name: HP HP ProDesk 600 G3 MT/829D, BIOS P02 Ver. 02.44 09/13/2022
Call Trace:
 <IRQ>
 dump_stack_lvl+0x62/0x90
 mark_lock+0x3b7/0x960
 __lock_acquire+0x9b0/0x2130
 lock_acquire+0xba/0x2c0
 ? vfio_intx_handler+0x21/0xd0 [vfio_pci_core]
 _raw_spin_lock_irqsave+0x3b/0x60
 ? vfio_intx_handler+0x21/0xd0 [vfio_pci_core]
 vfio_intx_handler+0x21/0xd0 [vfio_pci_core]
 __handle_irq_event_percpu+0x81/0x260
 handle_irq_event+0x34/0x70
 handle_fasteoi_irq+0x91/0x210
 __common_interrupt+0x6f/0x140
 common_interrupt+0xb3/0xd0
 </IRQ>
 <TASK>
 asm_common_interrupt+0x22/0x40
RIP: 0010:vmx_do_interrupt_irqoff+0x10/0x20 [kvm_intel]
Code: ff ff ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 ff d7 <0f> 1f 00 48 89 ec 5d c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90
RSP: 0018:ffffbc4d8231ba78 EFLAGS: 00000082
RAX: 0000000000000240 RBX: ffffa0f0eaffc300 RCX: 0000000080000000
RDX: ffffffff00000000 RSI: 0000000080000022 RDI: ffffffffb1000240
RBP: ffffbc4d8231ba78 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0f0ea835000
R13: ffffa0f0e0d00000 R14: ffffa0f0eaffc300 R15: 0000000000000000
 ? irq_entries_start+0x10/0x660
 vmx_handle_exit_irqoff+0xc3/0x220 [kvm_intel]
 kvm_arch_vcpu_ioctl_run+0x12e9/0x1d20 [kvm]
 ? kvm_vcpu_ioctl+0x239/0x9a0 [kvm]
 kvm_vcpu_ioctl+0x239/0x9a0 [kvm]
 ? find_held_lock+0x2b/0x80
 __x64_sys_ioctl+0x90/0xd0
 do_syscall_64+0x8e/0x180
 ? lockdep_hardirqs_on+0x77/0x100
 ? do_syscall_64+0x9a/0x180
 ? vmx_vcpu_put+0xf6/0x270 [kvm_intel]
 ? kvm_arch_vcpu_put+0x191/0x270 [kvm]
 ? find_held_lock+0x2b/0x80
 ? kvm_vcpu_ioctl+0x197/0x9a0 [kvm]
 ? lock_release+0x132/0x290
 ? rcu_is_watching+0xd/0x40
 ? kfree+0x231/0x360
 ? __mutex_unlock_slowpath+0x2a/0x260
 ? kvm_vcpu_ioctl+0x1a7/0x9a0 [kvm]
 ? find_held_lock+0x2b/0x80
 ? user_return_notifier_unregister+0x3c/0x70
 ? do_syscall_64+0x9a/0x180
 ? lockdep_hardirqs_on+0x77/0x100
 ? do_syscall_64+0x9a/0x180
 ? syscall_exit_to_user_mode+0x1c2/0x2b0
 ? do_syscall_64+0x9a/0x180
 ? lockdep_hardirqs_on+0x77/0x100
 ? do_syscall_64+0x9a/0x180
 ? lockdep_hardirqs_on+0x77/0x100
 ? do_syscall_64+0x9a/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fd030fa13ed
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00007fd029bf6420 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000055b4692ec540 RCX: 00007fd030fa13ed
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000018
RBP: 00007fd029bf6470 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd029c006c0
R13: ffffffffffff7a90 R14: 0000000000000000 R15: 00007ffe459ab200
 </TASK>