linux-kernel - Re: [PATCH v4 28/28] vfio/xe: Add device specific vfio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7dtl5qum4mfgjosj2mkfqu5u5tu7p2roi2et3env4lhrccmiqi@asemffaeeflr>
Date: Thu, 6 Nov 2025 11:55:38 +0100
From: "Winiarski, Michal" <michal.winiarski@...el.com>
To: "Tian, Kevin" <kevin.tian@...el.com>
CC: Alex Williamson <alex@...zbot.org>, "De Marchi, Lucas"
	<lucas.demarchi@...el.com>, Thomas Hellström
	<thomas.hellstrom@...ux.intel.com>, "Vivi, Rodrigo" <rodrigo.vivi@...el.com>,
	Jason Gunthorpe <jgg@...pe.ca>, Yishai Hadas <yishaih@...dia.com>, "Shameer
 Kolothum" <skolothumtho@...dia.com>, "intel-xe@...ts.freedesktop.org"
	<intel-xe@...ts.freedesktop.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Brost, Matthew" <matthew.brost@...el.com>, "Wajdeczko, Michal"
	<Michal.Wajdeczko@...el.com>, "dri-devel@...ts.freedesktop.org"
	<dri-devel@...ts.freedesktop.org>, Jani Nikula <jani.nikula@...ux.intel.com>,
	Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>, Tvrtko Ursulin
	<tursulin@...ulin.net>, David Airlie <airlied@...il.com>, Simona Vetter
	<simona@...ll.ch>, "Laguna, Lukasz" <lukasz.laguna@...el.com>, "Christoph
 Hellwig" <hch@...radead.org>
Subject: Re: [PATCH v4 28/28] vfio/xe: Add device specific vfio_pci driver
 variant for Intel graphics

On Thu, Nov 06, 2025 at 09:20:36AM +0100, Tian, Kevin wrote:
> > From: Winiarski, Michal <michal.winiarski@...el.com>
> > Sent: Wednesday, November 5, 2025 11:10 PM
> > 
> > In addition to generic VFIO PCI functionality, the driver implements
> > VFIO migration uAPI, allowing userspace to enable migration for Intel
> > Graphics SR-IOV Virtual Functions.
> > The driver binds to VF device, and uses API exposed by Xe driver bound
> > to PF device to control VF device state and transfer the migration data.
> 
> "The driver binds to VF device and uses API exposed by Xe driver to
> transfer the VF migration data under the control of PF device."

Ok.

> 
> > +config XE_VFIO_PCI
> > +	tristate "VFIO support for Intel Graphics"
> > +	depends on DRM_XE
> > +	select VFIO_PCI_CORE
> > +	help
> > +	  This option enables vendor-specific VFIO driver for Intel Graphics.
> > +	  In addition to generic VFIO PCI functionality, it implements VFIO
> > +	  migration uAPI allowing userspace to enable migration for
> > +	  Intel Graphics SR-IOV Virtual Functions supported by the Xe driver.
> 
> hmm another "vendor-specific"...

Ooops. I'll switch to "device specific driver variant" naming here as
well.

> 
> > +struct xe_vfio_pci_core_device {
> > +	struct vfio_pci_core_device core_device;
> > +	struct xe_device *xe;
> > +	/* VF number used by PF, Xe HW/FW components use vfid indexing
> > starting from 1 */
> 
> Having both PF and Xe HW/FW is a bit noising. could be:
> 
> /* PF internal control uses vfid index starting from 1 */

Ok.

> 
> > +
> > +static void xe_vfio_pci_state_mutex_lock(struct xe_vfio_pci_core_device
> > *xe_vdev)
> > +{
> > +	mutex_lock(&xe_vdev->state_mutex);
> > +}
> > +
> > +/*
> > + * This function is called in all state_mutex unlock cases to
> > + * handle a 'deferred_reset' if exists.
> > + */
> > +static void xe_vfio_pci_state_mutex_unlock(struct xe_vfio_pci_core_device
> > *xe_vdev)
> > +{
> > +again:
> > +	spin_lock(&xe_vdev->reset_lock);
> > +	if (xe_vdev->deferred_reset) {
> > +		xe_vdev->deferred_reset = false;
> > +		spin_unlock(&xe_vdev->reset_lock);
> > +		xe_vfio_pci_reset(xe_vdev);
> > +		goto again;
> > +	}
> > +	mutex_unlock(&xe_vdev->state_mutex);
> > +	spin_unlock(&xe_vdev->reset_lock);
> > +}
> 
> this deferred_reset logic is a mlx unique thing. See:
> 
> https://lore.kernel.org/kvm/20240220132459.GM13330@nvidia.com/

Interesting, that doesn't match my observations.

[] ======================================================
[] WARNING: possible circular locking dependency detected
[] 6.18.0-rc3-xe+ #90 Tainted: G S   U
[] ------------------------------------------------------
[] qemu-system-x86/4375 is trying to acquire lock:
[] ff1100015af3ec30 (&migf->lock){+.+.}-{3:3}, at: xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[]
   but task is already holding lock:
[] ff1100014c541a58 (&xe_vdev->state_mutex){+.+.}-{3:3}, at: xe_vfio_pci_set_device_state+0x43/0x440 [xe_vfio_pci]
[]
   which lock already depends on the new lock.

[]
   the existing dependency chain (in reverse order) is:
[]
   -> #3 (&xe_vdev->state_mutex){+.+.}-{3:3}:
[]        __mutex_lock+0xba/0x1110
[]        mutex_lock_nested+0x1b/0x30
[]        xe_vfio_pci_reset_done+0x2b/0xc0 [xe_vfio_pci]
[]        pci_try_reset_function+0xd7/0x150
[]        vfio_pci_core_ioctl+0x7f1/0xf20 [vfio_pci_core]
[]        vfio_device_fops_unl_ioctl+0x163/0x310 [vfio]
[]        __se_sys_ioctl+0x71/0xc0
[]        __x64_sys_ioctl+0x1d/0x30
[]        x64_sys_call+0x6ac/0xe50
[]        do_syscall_64+0xa1/0x560
[]        entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]
   -> #2 (&vdev->memory_lock){++++}-{3:3}:
[]        down_read+0x41/0x180
[]        vfio_pci_mmap_huge_fault+0xec/0x310 [vfio_pci_core]
[]        handle_mm_fault+0x8aa/0x13b0
[]        fixup_user_fault+0x124/0x280
[]        vaddr_get_pfns+0x1d2/0x420 [vfio_iommu_type1]
[]        vfio_pin_pages_remote+0x173/0x610 [vfio_iommu_type1]
[]        vfio_pin_map_dma+0xfd/0x340 [vfio_iommu_type1]
[]        vfio_iommu_type1_ioctl+0xfdf/0x1290 [vfio_iommu_type1]
[]        vfio_fops_unl_ioctl+0x106/0x340 [vfio]
[]        __se_sys_ioctl+0x71/0xc0
[]        __x64_sys_ioctl+0x1d/0x30
[]        x64_sys_call+0x6ac/0xe50
[]        do_syscall_64+0xa1/0x560
[]        entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]
   -> #1 (&mm->mmap_lock){++++}-{3:3}:
[]        __might_fault+0x56/0x90
[]        _copy_to_user+0x23/0x70
[]        xe_sriov_migration_data_read+0x17b/0x3f0 [xe]
[]        xe_sriov_vfio_data_read+0x44/0x60 [xe]
[]        xe_vfio_pci_save_read+0x55/0x80 [xe_vfio_pci]
[]        vfs_read+0xc0/0x300
[]        ksys_read+0x79/0xf0
[]        __x64_sys_read+0x1b/0x30
[]        x64_sys_call+0xcc4/0xe50
[]        do_syscall_64+0xa1/0x560
[]        entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]
   -> #0 (&migf->lock){+.+.}-{3:3}:
[]        __lock_acquire+0x1aff/0x3450
[]        lock_acquire+0xde/0x280
[]        __mutex_lock+0xba/0x1110
[]        mutex_lock_nested+0x1b/0x30
[]        xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[]        vfio_ioctl_device_feature_mig_device_state+0x9c/0x1b0 [vfio]
[]        vfio_device_fops_unl_ioctl+0x289/0x310 [vfio]
[]        __se_sys_ioctl+0x71/0xc0
[]        __x64_sys_ioctl+0x1d/0x30
[]        x64_sys_call+0x6ac/0xe50
[]        do_syscall_64+0xa1/0x560
[]        entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]
   other info that might help us debug this:

[] Chain exists of:
     &migf->lock --> &vdev->memory_lock --> &xe_vdev->state_mutex

[]  Possible unsafe locking scenario:

[]        CPU0                    CPU1
[]        ----                    ----
[]   lock(&xe_vdev->state_mutex);
[]                                lock(&vdev->memory_lock);
[]                                lock(&xe_vdev->state_mutex);
[]   lock(&migf->lock);
[]
    *** DEADLOCK ***

[] 1 lock held by qemu-system-x86/4375:
[]  #0: ff1100014c541a58 (&xe_vdev->state_mutex){+.+.}-{3:3}, at: xe_vfio_pci_set_device_state+0x43/0x440 [xe_vfio_pci]
[]
   stack backtrace:
[] CPU: 18 UID: 0 PID: 4375 Comm: qemu-system-x86 Tainted: G S   U              6.18.0-rc3-xe+ #90 PREEMPT(voluntary)
[] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
[] Hardware name: Intel Corporation WHITLEY/WHITLEY, BIOS SE5C6200.86B.0027.P18.2206090856 06/09/2022
[] Call Trace:
[]  <TASK>
[]  __dump_stack+0x19/0x30
[]  dump_stack_lvl+0x66/0x90
[]  dump_stack+0x10/0x14
[]  print_circular_bug+0x2fd/0x310
[]  check_noncircular+0x139/0x160
[]  __lock_acquire+0x1aff/0x3450
[]  ? vprintk_emit+0x3ec/0x560
[]  ? dev_vprintk_emit+0x162/0x1c0
[]  lock_acquire+0xde/0x280
[]  ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[]  ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[]  __mutex_lock+0xba/0x1110
[]  ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[]  mutex_lock_nested+0x1b/0x30
[]  xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
[]  vfio_ioctl_device_feature_mig_device_state+0x9c/0x1b0 [vfio]
[]  vfio_device_fops_unl_ioctl+0x289/0x310 [vfio]
[]  __se_sys_ioctl+0x71/0xc0
[]  ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
[]  __x64_sys_ioctl+0x1d/0x30
[]  x64_sys_call+0x6ac/0xe50
[]  do_syscall_64+0xa1/0x560
[]  ? __lock_acquire+0x73f/0x3450
[]  ? __lock_acquire+0x73f/0x3450
[]  ? __lock_acquire+0x73f/0x3450
[]  ? lock_release+0x10b/0x340
[]  ? wp_page_reuse+0x82/0x100
[]  ? lock_release+0x10b/0x340
[]  ? wp_page_reuse+0xcc/0x100
[]  ? lock_acquire+0xde/0x280
[]  ? count_memcg_event_mm+0x20/0x170
[]  ? lock_is_held_type+0x8f/0x140
[]  ? lock_release+0x10b/0x340
[]  ? count_memcg_event_mm+0x20/0x170
[]  ? count_memcg_event_mm+0x20/0x170
[]  ? count_memcg_event_mm+0x20/0x170
[]  ? count_memcg_event_mm+0x114/0x170
[]  ? handle_mm_fault+0x1300/0x13b0
[]  ? handle_mm_fault+0x3c/0x13b0
[]  ? lock_vma_under_rcu+0x8c/0x230
[]  ? lock_release+0x10b/0x340
[]  ? exc_page_fault+0x77/0xf0
[]  ? irqentry_exit_to_user_mode+0x100/0x130
[]  ? irqentry_exit+0x31/0x80
[]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[] RIP: 0033:0x70dff032eb1d
[] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[] RSP: 002b:00007ffcc0367ff0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[] RAX: ffffffffffffffda RBX: 00005748e046d600 RCX: 000070dff032eb1d
[] RDX: 00007ffcc0368080 RSI: 0000000000003b75 RDI: 000000000000001d
[] RBP: 00007ffcc0368040 R08: 00000005748df663 R09: 0000000000000007
[] R10: 00005748df663060 R11: 0000000000000246 R12: 0000000000000001
[] R13: 0000000000000000 R14: 00005748e055f0b0 R15: 00007ffcc0368080
[]  </TASK>

In short:

0: set_device_state
xe_vdev->state_mutex : migf->lock
1: data_read
migf->lock : mm->mmap_lock
2: vfio_pin_dma
mm->mmap_lock : vdev->memory_lock
3: vfio_pci_ioctl_reset
vdev->memory_lock : xe_vdev->state_mutex

In other words:
set_device_state takes xe_vdev->state_mutex and blocks on migf->lock,
data_read takes migf->lock and blocks on mm->mmap_lock
vfio_pin_dma takes mm->mmap_lock and blocks on vdev->memory_lock
reset takes vdev->memory_lock and blocks on xe_vdev->state_mutex

copy_to_user/copy_from_user doesn't have to be called under state_mutex,
it just needs to be taken under migf->lock.
The deferred reset trick exists because migf->lock needs to be taken
under state_mutex as part of reset_done callback, which completes the
chain and triggers the lockdep splat.

To me, it looks like something generic, that will have impact on any
device specific driver variant.
What am I missing?

I wonder if drivers that don't implement the deferred reset trick were
ever executed with lockdep enabled.

(BTW: Looking at it in more depth again - I do need to revisit the
disable_fd flow on xe-vfio-pci side, so do expect small changes on that
front in next revision)

Thanks,
-Michał