[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7dtl5qum4mfgjosj2mkfqu5u5tu7p2roi2et3env4lhrccmiqi@asemffaeeflr>
Date: Thu, 6 Nov 2025 11:55:38 +0100
From: "Winiarski, Michal" <michal.winiarski@...el.com>
To: "Tian, Kevin" <kevin.tian@...el.com>
CC: Alex Williamson <alex@...zbot.org>, "De Marchi, Lucas"
<lucas.demarchi@...el.com>, Thomas Hellström
<thomas.hellstrom@...ux.intel.com>, "Vivi, Rodrigo" <rodrigo.vivi@...el.com>,
Jason Gunthorpe <jgg@...pe.ca>, Yishai Hadas <yishaih@...dia.com>, "Shameer
Kolothum" <skolothumtho@...dia.com>, "intel-xe@...ts.freedesktop.org"
<intel-xe@...ts.freedesktop.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"Brost, Matthew" <matthew.brost@...el.com>, "Wajdeczko, Michal"
<Michal.Wajdeczko@...el.com>, "dri-devel@...ts.freedesktop.org"
<dri-devel@...ts.freedesktop.org>, Jani Nikula <jani.nikula@...ux.intel.com>,
Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>, Tvrtko Ursulin
<tursulin@...ulin.net>, David Airlie <airlied@...il.com>, Simona Vetter
<simona@...ll.ch>, "Laguna, Lukasz" <lukasz.laguna@...el.com>, "Christoph
Hellwig" <hch@...radead.org>
Subject: Re: [PATCH v4 28/28] vfio/xe: Add device specific vfio_pci driver
variant for Intel graphics
On Thu, Nov 06, 2025 at 09:20:36AM +0100, Tian, Kevin wrote:
> > From: Winiarski, Michal <michal.winiarski@...el.com>
> > Sent: Wednesday, November 5, 2025 11:10 PM
> >
> > In addition to generic VFIO PCI functionality, the driver implements
> > VFIO migration uAPI, allowing userspace to enable migration for Intel
> > Graphics SR-IOV Virtual Functions.
> > The driver binds to VF device, and uses API exposed by Xe driver bound
> > to PF device to control VF device state and transfer the migration data.
>
> "The driver binds to VF device and uses API exposed by Xe driver to
> transfer the VF migration data under the control of PF device."
Ok.
>
> > +config XE_VFIO_PCI
> > + tristate "VFIO support for Intel Graphics"
> > + depends on DRM_XE
> > + select VFIO_PCI_CORE
> > + help
> > + This option enables vendor-specific VFIO driver for Intel Graphics.
> > + In addition to generic VFIO PCI functionality, it implements VFIO
> > + migration uAPI allowing userspace to enable migration for
> > + Intel Graphics SR-IOV Virtual Functions supported by the Xe driver.
>
> hmm another "vendor-specific"...
Ooops. I'll switch to "device specific driver variant" naming here as
well.
>
> > +struct xe_vfio_pci_core_device {
> > + struct vfio_pci_core_device core_device;
> > + struct xe_device *xe;
> > + /* VF number used by PF, Xe HW/FW components use vfid indexing
> > starting from 1 */
>
> Having both PF and Xe HW/FW is a bit noising. could be:
>
> /* PF internal control uses vfid index starting from 1 */
Ok.
>
> > +
> > +static void xe_vfio_pci_state_mutex_lock(struct xe_vfio_pci_core_device
> > *xe_vdev)
> > +{
> > + mutex_lock(&xe_vdev->state_mutex);
> > +}
> > +
> > +/*
> > + * This function is called in all state_mutex unlock cases to
> > + * handle a 'deferred_reset' if exists.
> > + */
> > +static void xe_vfio_pci_state_mutex_unlock(struct xe_vfio_pci_core_device
> > *xe_vdev)
> > +{
> > +again:
> > + spin_lock(&xe_vdev->reset_lock);
> > + if (xe_vdev->deferred_reset) {
> > + xe_vdev->deferred_reset = false;
> > + spin_unlock(&xe_vdev->reset_lock);
> > + xe_vfio_pci_reset(xe_vdev);
> > + goto again;
> > + }
> > + mutex_unlock(&xe_vdev->state_mutex);
> > + spin_unlock(&xe_vdev->reset_lock);
> > +}
>
> this deferred_reset logic is a mlx unique thing. See:
>
> https://lore.kernel.org/kvm/20240220132459.GM13330@nvidia.com/
Interesting, that doesn't match my observations.
[] ======================================================
[] WARNING: possible circular locking dependency detected
[] 6.18.0-rc3-xe+ #90 Tainted: G S U
[] ------------------------------------------------------
[] qemu-system-x86/4375 is trying to acquire lock:
[] ff1100015af3ec30 (&migf->lock){+.+.}-{3:3}, at: xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[]
but task is already holding lock:
[] ff1100014c541a58 (&xe_vdev->state_mutex){+.+.}-{3:3}, at: xe_vfio_pci_set_device_state+0x43/0x440 [xe_vfio_pci]
[]
which lock already depends on the new lock.
[]
the existing dependency chain (in reverse order) is:
[]
-> #3 (&xe_vdev->state_mutex){+.+.}-{3:3}:
[] __mutex_lock+0xba/0x1110
[] mutex_lock_nested+0x1b/0x30
[] xe_vfio_pci_reset_done+0x2b/0xc0 [xe_vfio_pci]
[] pci_try_reset_function+0xd7/0x150
[] vfio_pci_core_ioctl+0x7f1/0xf20 [vfio_pci_core]
[] vfio_device_fops_unl_ioctl+0x163/0x310 [vfio]
[] __se_sys_ioctl+0x71/0xc0
[] __x64_sys_ioctl+0x1d/0x30
[] x64_sys_call+0x6ac/0xe50
[] do_syscall_64+0xa1/0x560
[] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]
-> #2 (&vdev->memory_lock){++++}-{3:3}:
[] down_read+0x41/0x180
[] vfio_pci_mmap_huge_fault+0xec/0x310 [vfio_pci_core]
[] handle_mm_fault+0x8aa/0x13b0
[] fixup_user_fault+0x124/0x280
[] vaddr_get_pfns+0x1d2/0x420 [vfio_iommu_type1]
[] vfio_pin_pages_remote+0x173/0x610 [vfio_iommu_type1]
[] vfio_pin_map_dma+0xfd/0x340 [vfio_iommu_type1]
[] vfio_iommu_type1_ioctl+0xfdf/0x1290 [vfio_iommu_type1]
[] vfio_fops_unl_ioctl+0x106/0x340 [vfio]
[] __se_sys_ioctl+0x71/0xc0
[] __x64_sys_ioctl+0x1d/0x30
[] x64_sys_call+0x6ac/0xe50
[] do_syscall_64+0xa1/0x560
[] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]
-> #1 (&mm->mmap_lock){++++}-{3:3}:
[] __might_fault+0x56/0x90
[] _copy_to_user+0x23/0x70
[] xe_sriov_migration_data_read+0x17b/0x3f0 [xe]
[] xe_sriov_vfio_data_read+0x44/0x60 [xe]
[] xe_vfio_pci_save_read+0x55/0x80 [xe_vfio_pci]
[] vfs_read+0xc0/0x300
[] ksys_read+0x79/0xf0
[] __x64_sys_read+0x1b/0x30
[] x64_sys_call+0xcc4/0xe50
[] do_syscall_64+0xa1/0x560
[] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]
-> #0 (&migf->lock){+.+.}-{3:3}:
[] __lock_acquire+0x1aff/0x3450
[] lock_acquire+0xde/0x280
[] __mutex_lock+0xba/0x1110
[] mutex_lock_nested+0x1b/0x30
[] xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[] vfio_ioctl_device_feature_mig_device_state+0x9c/0x1b0 [vfio]
[] vfio_device_fops_unl_ioctl+0x289/0x310 [vfio]
[] __se_sys_ioctl+0x71/0xc0
[] __x64_sys_ioctl+0x1d/0x30
[] x64_sys_call+0x6ac/0xe50
[] do_syscall_64+0xa1/0x560
[] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]
other info that might help us debug this:
[] Chain exists of:
&migf->lock --> &vdev->memory_lock --> &xe_vdev->state_mutex
[] Possible unsafe locking scenario:
[] CPU0 CPU1
[] ---- ----
[] lock(&xe_vdev->state_mutex);
[] lock(&vdev->memory_lock);
[] lock(&xe_vdev->state_mutex);
[] lock(&migf->lock);
[]
*** DEADLOCK ***
[] 1 lock held by qemu-system-x86/4375:
[] #0: ff1100014c541a58 (&xe_vdev->state_mutex){+.+.}-{3:3}, at: xe_vfio_pci_set_device_state+0x43/0x440 [xe_vfio_pci]
[]
stack backtrace:
[] CPU: 18 UID: 0 PID: 4375 Comm: qemu-system-x86 Tainted: G S U 6.18.0-rc3-xe+ #90 PREEMPT(voluntary)
[] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
[] Hardware name: Intel Corporation WHITLEY/WHITLEY, BIOS SE5C6200.86B.0027.P18.2206090856 06/09/2022
[] Call Trace:
[] <TASK>
[] __dump_stack+0x19/0x30
[] dump_stack_lvl+0x66/0x90
[] dump_stack+0x10/0x14
[] print_circular_bug+0x2fd/0x310
[] check_noncircular+0x139/0x160
[] __lock_acquire+0x1aff/0x3450
[] ? vprintk_emit+0x3ec/0x560
[] ? dev_vprintk_emit+0x162/0x1c0
[] lock_acquire+0xde/0x280
[] ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[] ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[] __mutex_lock+0xba/0x1110
[] ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[] mutex_lock_nested+0x1b/0x30
[] xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[] vfio_ioctl_device_feature_mig_device_state+0x9c/0x1b0 [vfio]
[] vfio_device_fops_unl_ioctl+0x289/0x310 [vfio]
[] __se_sys_ioctl+0x71/0xc0
[] ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
[] __x64_sys_ioctl+0x1d/0x30
[] x64_sys_call+0x6ac/0xe50
[] do_syscall_64+0xa1/0x560
[] ? __lock_acquire+0x73f/0x3450
[] ? __lock_acquire+0x73f/0x3450
[] ? __lock_acquire+0x73f/0x3450
[] ? lock_release+0x10b/0x340
[] ? wp_page_reuse+0x82/0x100
[] ? lock_release+0x10b/0x340
[] ? wp_page_reuse+0xcc/0x100
[] ? lock_acquire+0xde/0x280
[] ? count_memcg_event_mm+0x20/0x170
[] ? lock_is_held_type+0x8f/0x140
[] ? lock_release+0x10b/0x340
[] ? count_memcg_event_mm+0x20/0x170
[] ? count_memcg_event_mm+0x20/0x170
[] ? count_memcg_event_mm+0x20/0x170
[] ? count_memcg_event_mm+0x114/0x170
[] ? handle_mm_fault+0x1300/0x13b0
[] ? handle_mm_fault+0x3c/0x13b0
[] ? lock_vma_under_rcu+0x8c/0x230
[] ? lock_release+0x10b/0x340
[] ? exc_page_fault+0x77/0xf0
[] ? irqentry_exit_to_user_mode+0x100/0x130
[] ? irqentry_exit+0x31/0x80
[] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[] RIP: 0033:0x70dff032eb1d
[] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[] RSP: 002b:00007ffcc0367ff0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[] RAX: ffffffffffffffda RBX: 00005748e046d600 RCX: 000070dff032eb1d
[] RDX: 00007ffcc0368080 RSI: 0000000000003b75 RDI: 000000000000001d
[] RBP: 00007ffcc0368040 R08: 00000005748df663 R09: 0000000000000007
[] R10: 00005748df663060 R11: 0000000000000246 R12: 0000000000000001
[] R13: 0000000000000000 R14: 00005748e055f0b0 R15: 00007ffcc0368080
[] </TASK>
In short:
0: set_device_state
xe_vdev->state_mutex : migf->lock
1: data_read
migf->lock : mm->mmap_lock
2: vfio_pin_dma
mm->mmap_lock : vdev->memory_lock
3: vfio_pci_ioctl_reset
vdev->memory_lock : xe_vdev->state_mutex
In other words:
set_device_state takes xe_vdev->state_mutex and blocks on migf->lock,
data_read takes migf->lock and blocks on mm->mmap_lock
vfio_pin_dma takes mm->mmap_lock and blocks on vdev->memory_lock
reset takes vdev->memory_lock and blocks on xe_vdev->state_mutex
copy_to_user/copy_from_user doesn't have to be called under state_mutex,
it just needs to be taken under migf->lock.
The deferred reset trick exists because migf->lock needs to be taken
under state_mutex as part of reset_done callback, which completes the
chain and triggers the lockdep splat.
To me, it looks like something generic, that will have impact on any
device specific driver variant.
What am I missing?
I wonder if drivers that don't implement the deferred reset trick were
ever executed with lockdep enabled.
(BTW: Looking at it in more depth again - I do need to revisit the
disable_fd flow on xe-vfio-pci side, so do expect small changes on that
front in next revision)
Thanks,
-Michał
Powered by blists - more mailing lists