linux-kernel - RE: [PATCH v4 28/28] vfio/xe: Add device specific vfio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN9PR11MB52768763573DF22AB978C8228CC3A@BN9PR11MB5276.namprd11.prod.outlook.com>
Date: Fri, 7 Nov 2025 03:10:33 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: "Winiarski, Michal" <michal.winiarski@...el.com>
CC: Alex Williamson <alex@...zbot.org>, "De Marchi, Lucas"
	<lucas.demarchi@...el.com>, Thomas Hellström
	<thomas.hellstrom@...ux.intel.com>, "Vivi, Rodrigo" <rodrigo.vivi@...el.com>,
	Jason Gunthorpe <jgg@...pe.ca>, Yishai Hadas <yishaih@...dia.com>, "Shameer
 Kolothum" <skolothumtho@...dia.com>, "intel-xe@...ts.freedesktop.org"
	<intel-xe@...ts.freedesktop.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Brost, Matthew" <matthew.brost@...el.com>, "Wajdeczko, Michal"
	<Michal.Wajdeczko@...el.com>, "dri-devel@...ts.freedesktop.org"
	<dri-devel@...ts.freedesktop.org>, Jani Nikula <jani.nikula@...ux.intel.com>,
	Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>, Tvrtko Ursulin
	<tursulin@...ulin.net>, David Airlie <airlied@...il.com>, Simona Vetter
	<simona@...ll.ch>, "Laguna, Lukasz" <lukasz.laguna@...el.com>, "Christoph
 Hellwig" <hch@...radead.org>, "Cabiddu, Giovanni"
	<giovanni.cabiddu@...el.com>, Brett Creeley <brett.creeley@....com>
Subject: RE: [PATCH v4 28/28] vfio/xe: Add device specific vfio_pci driver
 variant for Intel graphics

> From: Winiarski, Michal <michal.winiarski@...el.com>
> Sent: Thursday, November 6, 2025 6:56 PM
> 
> On Thu, Nov 06, 2025 at 09:20:36AM +0100, Tian, Kevin wrote:
> >
> > this deferred_reset logic is a mlx unique thing. See:
> >
> > https://lore.kernel.org/kvm/20240220132459.GM13330@nvidia.com/
> 
> Interesting, that doesn't match my observations.
> 
> [] ======================================================
> [] WARNING: possible circular locking dependency detected
> [] 6.18.0-rc3-xe+ #90 Tainted: G S   U
> [] ------------------------------------------------------
> [] qemu-system-x86/4375 is trying to acquire lock:
> [] ff1100015af3ec30 (&migf->lock){+.+.}-{3:3}, at:
> xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
> []
>    but task is already holding lock:
> [] ff1100014c541a58 (&xe_vdev->state_mutex){+.+.}-{3:3}, at:
> xe_vfio_pci_set_device_state+0x43/0x440 [xe_vfio_pci]
> []
>    which lock already depends on the new lock.
> 
> []
>    the existing dependency chain (in reverse order) is:
> []
>    -> #3 (&xe_vdev->state_mutex){+.+.}-{3:3}:
> []        __mutex_lock+0xba/0x1110
> []        mutex_lock_nested+0x1b/0x30
> []        xe_vfio_pci_reset_done+0x2b/0xc0 [xe_vfio_pci]
> []        pci_try_reset_function+0xd7/0x150
> []        vfio_pci_core_ioctl+0x7f1/0xf20 [vfio_pci_core]
> []        vfio_device_fops_unl_ioctl+0x163/0x310 [vfio]
> []        __se_sys_ioctl+0x71/0xc0
> []        __x64_sys_ioctl+0x1d/0x30
> []        x64_sys_call+0x6ac/0xe50
> []        do_syscall_64+0xa1/0x560
> []        entry_SYSCALL_64_after_hwframe+0x4b/0x53
> []
>    -> #2 (&vdev->memory_lock){++++}-{3:3}:
> []        down_read+0x41/0x180
> []        vfio_pci_mmap_huge_fault+0xec/0x310 [vfio_pci_core]
> []        handle_mm_fault+0x8aa/0x13b0
> []        fixup_user_fault+0x124/0x280
> []        vaddr_get_pfns+0x1d2/0x420 [vfio_iommu_type1]
> []        vfio_pin_pages_remote+0x173/0x610 [vfio_iommu_type1]
> []        vfio_pin_map_dma+0xfd/0x340 [vfio_iommu_type1]
> []        vfio_iommu_type1_ioctl+0xfdf/0x1290 [vfio_iommu_type1]
> []        vfio_fops_unl_ioctl+0x106/0x340 [vfio]
> []        __se_sys_ioctl+0x71/0xc0
> []        __x64_sys_ioctl+0x1d/0x30
> []        x64_sys_call+0x6ac/0xe50
> []        do_syscall_64+0xa1/0x560
> []        entry_SYSCALL_64_after_hwframe+0x4b/0x53
> []
>    -> #1 (&mm->mmap_lock){++++}-{3:3}:
> []        __might_fault+0x56/0x90
> []        _copy_to_user+0x23/0x70
> []        xe_sriov_migration_data_read+0x17b/0x3f0 [xe]
> []        xe_sriov_vfio_data_read+0x44/0x60 [xe]
> []        xe_vfio_pci_save_read+0x55/0x80 [xe_vfio_pci]
> []        vfs_read+0xc0/0x300
> []        ksys_read+0x79/0xf0
> []        __x64_sys_read+0x1b/0x30
> []        x64_sys_call+0xcc4/0xe50
> []        do_syscall_64+0xa1/0x560
> []        entry_SYSCALL_64_after_hwframe+0x4b/0x53
> []
>    -> #0 (&migf->lock){+.+.}-{3:3}:
> []        __lock_acquire+0x1aff/0x3450
> []        lock_acquire+0xde/0x280
> []        __mutex_lock+0xba/0x1110
> []        mutex_lock_nested+0x1b/0x30
> []        xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
> []        vfio_ioctl_device_feature_mig_device_state+0x9c/0x1b0 [vfio]
> []        vfio_device_fops_unl_ioctl+0x289/0x310 [vfio]
> []        __se_sys_ioctl+0x71/0xc0
> []        __x64_sys_ioctl+0x1d/0x30
> []        x64_sys_call+0x6ac/0xe50
> []        do_syscall_64+0xa1/0x560
> []        entry_SYSCALL_64_after_hwframe+0x4b/0x53
> []
>    other info that might help us debug this:
> 
> [] Chain exists of:
>      &migf->lock --> &vdev->memory_lock --> &xe_vdev->state_mutex
> 
> []  Possible unsafe locking scenario:
> 
> []        CPU0                    CPU1
> []        ----                    ----
> []   lock(&xe_vdev->state_mutex);
> []                                lock(&vdev->memory_lock);
> []                                lock(&xe_vdev->state_mutex);
> []   lock(&migf->lock);
> []
>     *** DEADLOCK ***
> 
> [] 1 lock held by qemu-system-x86/4375:
> []  #0: ff1100014c541a58 (&xe_vdev->state_mutex){+.+.}-{3:3}, at:
> xe_vfio_pci_set_device_state+0x43/0x440 [xe_vfio_pci]
> []
>    stack backtrace:
> [] CPU: 18 UID: 0 PID: 4375 Comm: qemu-system-x86 Tainted: G S   U
> 6.18.0-rc3-xe+ #90 PREEMPT(voluntary)
> [] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
> [] Hardware name: Intel Corporation WHITLEY/WHITLEY, BIOS
> SE5C6200.86B.0027.P18.2206090856 06/09/2022
> [] Call Trace:
> []  <TASK>
> []  __dump_stack+0x19/0x30
> []  dump_stack_lvl+0x66/0x90
> []  dump_stack+0x10/0x14
> []  print_circular_bug+0x2fd/0x310
> []  check_noncircular+0x139/0x160
> []  __lock_acquire+0x1aff/0x3450
> []  ? vprintk_emit+0x3ec/0x560
> []  ? dev_vprintk_emit+0x162/0x1c0
> []  lock_acquire+0xde/0x280
> []  ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
> []  ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
> []  __mutex_lock+0xba/0x1110
> []  ? xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
> []  mutex_lock_nested+0x1b/0x30
> []  xe_vfio_pci_set_device_state+0x22b/0x440 [xe_vfio_pci]
> []  vfio_ioctl_device_feature_mig_device_state+0x9c/0x1b0 [vfio]
> []  vfio_device_fops_unl_ioctl+0x289/0x310 [vfio]
> []  __se_sys_ioctl+0x71/0xc0
> []  ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
> []  __x64_sys_ioctl+0x1d/0x30
> []  x64_sys_call+0x6ac/0xe50
> []  do_syscall_64+0xa1/0x560
> []  ? __lock_acquire+0x73f/0x3450
> []  ? __lock_acquire+0x73f/0x3450
> []  ? __lock_acquire+0x73f/0x3450
> []  ? lock_release+0x10b/0x340
> []  ? wp_page_reuse+0x82/0x100
> []  ? lock_release+0x10b/0x340
> []  ? wp_page_reuse+0xcc/0x100
> []  ? lock_acquire+0xde/0x280
> []  ? count_memcg_event_mm+0x20/0x170
> []  ? lock_is_held_type+0x8f/0x140
> []  ? lock_release+0x10b/0x340
> []  ? count_memcg_event_mm+0x20/0x170
> []  ? count_memcg_event_mm+0x20/0x170
> []  ? count_memcg_event_mm+0x20/0x170
> []  ? count_memcg_event_mm+0x114/0x170
> []  ? handle_mm_fault+0x1300/0x13b0
> []  ? handle_mm_fault+0x3c/0x13b0
> []  ? lock_vma_under_rcu+0x8c/0x230
> []  ? lock_release+0x10b/0x340
> []  ? exc_page_fault+0x77/0xf0
> []  ? irqentry_exit_to_user_mode+0x100/0x130
> []  ? irqentry_exit+0x31/0x80
> []  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [] RIP: 0033:0x70dff032eb1d
> [] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00
> 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff
> 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> [] RSP: 002b:00007ffcc0367ff0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [] RAX: ffffffffffffffda RBX: 00005748e046d600 RCX: 000070dff032eb1d
> [] RDX: 00007ffcc0368080 RSI: 0000000000003b75 RDI: 000000000000001d
> [] RBP: 00007ffcc0368040 R08: 00000005748df663 R09: 0000000000000007
> [] R10: 00005748df663060 R11: 0000000000000246 R12: 0000000000000001
> [] R13: 0000000000000000 R14: 00005748e055f0b0 R15: 00007ffcc0368080
> []  </TASK>
> 
> In short:
> 
> 0: set_device_state
> xe_vdev->state_mutex : migf->lock
> 1: data_read
> migf->lock : mm->mmap_lock
> 2: vfio_pin_dma
> mm->mmap_lock : vdev->memory_lock
> 3: vfio_pci_ioctl_reset
> vdev->memory_lock : xe_vdev->state_mutex

oh that's a good spot!

Previous deadlock requires 3 parties, due to copy_from/to_user()
under state_mutex:

0: set_device_state
vdev->state_mutex : mm->mmap_lock
2: vfio_pin_dma
mm->mmap_lock : vdev->memory_lock
3: vfio_pci_ioctl_reset
vdev->memory_lock : vdev->state_mutex

Now with migf->lock and the additional path of data_read it becomes
a 4-parties game. and looks it's common.

> 
> In other words:
> set_device_state takes xe_vdev->state_mutex and blocks on migf->lock,
> data_read takes migf->lock and blocks on mm->mmap_lock
> vfio_pin_dma takes mm->mmap_lock and blocks on vdev->memory_lock
> reset takes vdev->memory_lock and blocks on xe_vdev->state_mutex
> 
> copy_to_user/copy_from_user doesn't have to be called under state_mutex,
> it just needs to be taken under migf->lock.
> The deferred reset trick exists because migf->lock needs to be taken
> under state_mutex as part of reset_done callback, which completes the
> chain and triggers the lockdep splat.

this chain doesn't even reach migf->lock in the reset path. It's triggered
already, when acquiring state_mutex.

> 
> To me, it looks like something generic, that will have impact on any
> device specific driver variant.
> What am I missing?
> 
> I wonder if drivers that don't implement the deferred reset trick were
> ever executed with lockdep enabled.
> 

@Jason, @Yishai, @Shameer, @Giovanni, @Brett:

Sounds it's a right thing to pull back the deferred reset trick into
every driver. anything overlooked?