[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bad26563-33b1-47d2-a62e-723bbb77690b@lunn.ch>
Date: Mon, 8 Apr 2024 22:07:26 +0200
From: Andrew Lunn <andrew@...n.ch>
To: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>,
kurt@...utronix.de
Cc: netdev@...r.kernel.org, linux-leds@...r.kernel.org,
regressions@...ts.linux.dev
Subject: Re: 6.9-rc2: Deadlock on unbinding network device from a driver
(regression)
Hi Kurt
Heiner just fixed a similar problem with Realtek driver. You might
need the same fix here.
Andrew
On Mon, Apr 08, 2024 at 09:22:05PM +0200, Marek Marczykowski-Górecki wrote:
> Hi,
>
> After updating to 6.9-rc2 I can no longer unbind device from the igc
> driver. "echo" into "unbind" file hangs, and via sysrq "w" I get this
> call trace:
>
> [ 84.553112] Call Trace:
> [ 84.553118] <TASK>
> [ 84.553123] __schedule+0x23b/0x5c0
> [ 84.553134] schedule+0x27/0xa0
> [ 84.553142] schedule_preempt_disabled+0x15/0x30
> [ 84.553152] __mutex_lock.constprop.0+0x34c/0x6a0
> [ 84.553165] unregister_netdevice_notifier+0x25/0xc0
> [ 84.553178] netdev_trig_deactivate+0x1e/0x60 [ledtrig_netdev]
> [ 84.553195] led_trigger_set+0x105/0x340
> [ 84.553206] led_classdev_unregister+0x4a/0x110
> [ 84.553219] release_nodes+0x3d/0xb0
> [ 84.553229] devres_release_all+0x8c/0xc0
> [ 84.553238] device_del+0x27a/0x3f0
> [ 84.553248] unregister_netdevice_many_notify+0x46a/0x6a0
> [ 84.553260] unregister_netdevice_queue+0xf0/0x130
> [ 84.553271] unregister_netdev+0x1c/0x30
> [ 84.553280] igc_remove+0xe3/0x1d0 [igc]
> [ 84.553298] pci_device_remove+0x3f/0xb0
> [ 84.553308] device_release_driver_internal+0x19f/0x200
> [ 84.553320] unbind_store+0xa1/0xb0
> [ 84.553329] kernfs_fop_write_iter+0x11f/0x200
> [ 84.553341] vfs_write+0x293/0x460
> [ 84.553351] ksys_write+0x6f/0xf0
> [ 84.553360] do_syscall_64+0x87/0x170
> [ 84.553368] ? syscall_exit_work+0xf3/0x120
> [ 84.553378] ? syscall_exit_to_user_mode+0x69/0x220
> [ 84.553389] ? do_syscall_64+0x96/0x170
> [ 84.553397] ? do_syscall_64+0x96/0x170
> [ 84.553404] ? do_syscall_64+0x96/0x170
> [ 84.553412] ? do_syscall_64+0x96/0x170
> [ 84.553420] ? __irq_exit_rcu+0x4b/0xb0
> [ 84.553429] entry_SYSCALL_64_after_hwframe+0x71/0x79
> [ 84.553439] RIP: 0033:0x7b46ae7c5ee4
> [ 84.553446] RSP: 002b:00007ffe580c2dd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
> [ 84.553460] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007b46ae7c5ee4
> [ 84.553474] RDX: 000000000000000d RSI: 00006458ac50b4b0 RDI: 0000000000000001
> [ 84.553487] RBP: 00007ffe580c2e00 R08: 0000000000000073 R09: 0000000000000001
> [ 84.553500] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000d
> [ 84.553514] R13: 00006458ac50b4b0 R14: 00007b46ae8965c0 R15: 00007b46ae893f20
> [ 84.553528] </TASK>
>
> It worked fine on 6.8.4.
>
> Similar issue happens on few other systems, including one with Realtek
> RTL8111/8168/8411 device, so it may be not specific to the igc driver
> but some common API (LED trigger?). The issue does not affect a system
> with e1000e driver.
>
> Lockdep says:
>
> [ 18.589322] ======================================================
> [ 18.589329] WARNING: possible circular locking dependency detected
> [ 18.589335] 6.9.0-rc2-1.qubes.fc32.x86_64 #378 Not tainted
> [ 18.589340] ------------------------------------------------------
> [ 18.589347] prepare-suspend/1145 is trying to acquire lock:
> [ 18.589352] ffff897494bc37b8 (&led_cdev->trigger_lock){+.+.}-{3:3}, at: led_classdev_unregister+0x32/0x110
> [ 18.589367]
> [ 18.589367] but task is already holding lock:
> [ 18.589373] ffffffffb034dfa8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20
> [ 18.589384]
> [ 18.589384] which lock already depends on the new lock.
> [ 18.589384]
> [ 18.589391]
> [ 18.589391] the existing dependency chain (in reverse order) is:
> [ 18.589399]
> [ 18.589399] -> #1 (rtnl_mutex){+.+.}-{3:3}:
> [ 18.589407] __mutex_lock+0xb2/0xbd0
> [ 18.589413] set_device_name+0x2d/0x140 [ledtrig_netdev]
> [ 18.589423] netdev_trig_activate+0x1a6/0x220 [ledtrig_netdev]
> [ 18.589432] led_trigger_set+0x20f/0x340
> [ 18.589438] led_trigger_register+0x16d/0x1a0
> [ 18.589443] do_one_initcall+0x6f/0x3d0
> [ 18.589451] do_init_module+0x60/0x240
> [ 18.589459] init_module_from_file+0x86/0xc0
> [ 18.589465] idempotent_init_module+0x126/0x2c0
> [ 18.589471] __x64_sys_finit_module+0x5a/0xb0
> [ 18.589477] do_syscall_64+0x96/0x190
> [ 18.589482] entry_SYSCALL_64_after_hwframe+0x71/0x79
> [ 18.589490]
> [ 18.589490] -> #0 (&led_cdev->trigger_lock){+.+.}-{3:3}:
> [ 18.589498] __lock_acquire+0x13e7/0x2180
> [ 18.589505] lock_acquire+0xd5/0x2f0
> [ 18.589510] down_write+0x2a/0xc0
> [ 18.589515] led_classdev_unregister+0x32/0x110
> [ 18.589522] devres_release_all+0xb5/0x110
> [ 18.589530] device_del+0x275/0x3f0
> [ 18.589535] unregister_netdevice_many_notify+0x5ba/0x870
> [ 18.589543] unregister_netdevice_queue+0xf3/0x130
> [ 18.589549] unregister_netdev+0x18/0x20
> [ 18.589555] igc_remove+0xe1/0x1c0 [igc]
> [ 18.589566] pci_device_remove+0x3b/0xb0
> [ 18.589574] device_release_driver_internal+0x1a5/0x210
> [ 18.589581] unbind_store+0x9d/0xb0
> [ 18.589587] kernfs_fop_write_iter+0x15b/0x210
> [ 18.589595] vfs_write+0x2bd/0x560
> [ 18.589601] ksys_write+0x71/0xf0
> [ 18.589608] do_syscall_64+0x96/0x190
> [ 18.589614] entry_SYSCALL_64_after_hwframe+0x71/0x79
> [ 18.589620]
> [ 18.589620] other info that might help us debug this:
> [ 18.589620]
> [ 18.589628] Possible unsafe locking scenario:
> [ 18.589628]
> [ 18.589635] CPU0 CPU1
> [ 18.589640] ---- ----
> [ 18.589645] lock(rtnl_mutex);
> [ 18.589650] lock(&led_cdev->trigger_lock);
> [ 18.589657] lock(rtnl_mutex);
> [ 18.589664] lock(&led_cdev->trigger_lock);
> [ 18.589670]
> [ 18.589670] *** DEADLOCK ***
> [ 18.589670]
> [ 18.589676] 4 locks held by prepare-suspend/1145:
> [ 18.589682] #0: ffff8974873a7420 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x71/0xf0
> [ 18.589693] #1: ffff897495886288 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x114/0x210[ 18.589704] #2: ffff8974820991b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x39/0x210
> [ 18.589715] #3: ffffffffb034dfa8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20
> [ 18.589726]
> [ 18.589726] stack backtrace:
> [ 18.589731] CPU: 1 PID: 1145 Comm: prepare-suspend Not tainted 6.9.0-rc2-1.qubes.fc32.x86_64 #378
> [ 18.589741] Hardware name: Xen HVM domU, BIOS 4.17.3 03/12/2024
> [ 18.589748] Call Trace:
> [ 18.589752] <TASK>
> [ 18.589755] dump_stack_lvl+0x73/0xb0
> [ 18.589761] check_noncircular+0x148/0x160
> [ 18.589766] ? stack_trace_save+0x4a/0x70
> [ 18.589773] __lock_acquire+0x13e7/0x2180
> [ 18.589780] lock_acquire+0xd5/0x2f0
> [ 18.589786] ? led_classdev_unregister+0x32/0x110
> [ 18.589793] down_write+0x2a/0xc0
> [ 18.589798] ? led_classdev_unregister+0x32/0x110
> [ 18.589804] led_classdev_unregister+0x32/0x110
> [ 18.589811] devres_release_all+0xb5/0x110
> [ 18.589816] device_del+0x275/0x3f0
> [ 18.589821] unregister_netdevice_many_notify+0x5ba/0x870
> [ 18.589829] unregister_netdevice_queue+0xf3/0x130
> [ 18.589835] unregister_netdev+0x18/0x20
> [ 18.589840] igc_remove+0xe1/0x1c0 [igc]
> [ 18.589850] pci_device_remove+0x3b/0xb0
> [ 18.589855] device_release_driver_internal+0x1a5/0x210
> [ 18.589861] unbind_store+0x9d/0xb0
> [ 18.589867] kernfs_fop_write_iter+0x15b/0x210
> [ 18.589874] vfs_write+0x2bd/0x560
> [ 18.589880] ksys_write+0x71/0xf0
> [ 18.589886] do_syscall_64+0x96/0x190
> [ 18.589891] ? find_held_lock+0x2b/0x80
> [ 18.589896] ? lock_release+0x143/0x2c0
> [ 18.589902] ? do_user_addr_fault+0x354/0x8a0
> [ 18.589909] ? exc_page_fault+0x126/0x260
> [ 18.589916] entry_SYSCALL_64_after_hwframe+0x71/0x79
> [ 18.589922] RIP: 0033:0x76426194fee4
> [ 18.589927] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d 85 74 0d 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
> [ 18.589946] RSP: 002b:00007ffe69a0ca98 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
> [ 18.589955] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 000076426194fee4
> [ 18.589963] RDX: 000000000000000d RSI: 000058ae60024480 RDI: 0000000000000001
> [ 18.589971] RBP: 00007ffe69a0cac0 R08: 0000000000000000 R09: 0000000000000001
> [ 18.589979] R10: 0000000000000004 R11: 0000000000000202 R12: 000000000000000d
> [ 18.589987] R13: 000058ae60024480 R14: 0000764261a205c0 R15: 0000764261a1df20
> [ 18.589997] </TASK>
>
>
> This is happening in a HVM domain on Xen, with PCI passthrough of
> relevant devices, but I don't think it's related to the issue.
>
> There is some more details on
> https://github.com/QubesOS/qubes-issues/issues/9096.
>
>
> #regzbot introduced: v6.8.4..v6.9-rc2
>
> --
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
Powered by blists - more mailing lists