[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6817efe1-f2c2-4686-bdf1-fca11f066e3a@arm.com>
Date: Fri, 25 Jul 2025 19:12:50 +0100
From: Robin Murphy <robin.murphy@....com>
To: Diederik de Haas <didi.debian@...ow.org>, Lee Jones <lee@...nel.org>,
Pavel Machek <pavel@...nel.org>, Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>
Cc: linux-leds@...r.kernel.org, netdev@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, linux-rockchip@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: BUG: Circular locking dependency on netdev led trigger on NanoPi
R5S
On 2025-07-25 6:48 pm, Diederik de Haas wrote:
> Hi,
>
> I have a FriendlyELEC NanoPi R5S (with rk3568 SoC) and in commit
> 1631cbdb8089 ("arm64: dts: rockchip: Improve LED config for NanoPi R5S")
>
> I tried to improve its LED configuration and that included
> ``linux,default-trigger = "netdev"``
>
> Problem: sometimes I got a 'hung task' error which resulted in the WAN
> port not to come up (that's the only one I use) and logging in via
> serial also didn't work, so pulling the plug was the only remedy.
>
> Robin Murphy quickly identified that it likely had to do with led
> triggers and removing those netdev triggers made the problem go away[1].
> To find out what actually caused it, I built a kernel with PROOF_LOCKING
> and PRINTK_CALLER enabled, which after adding a patch which fixed an
> OOPS [2], showed the underlaying problem:
For the record, I think the actual deadlock condition Diederik's system
hits in practice is a shorter cycle, wherein immediately after acquiring
pernet_ops_rwsem, thread #0 then tries to take rtnl_mutex, which forms a
straight inversion against thread #2 (which holds rtnl_mutex from
devinet_ioctl()).
Thanks,
Robin.
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.16-rc7+unreleased-arm64-cknow #1 Not tainted
> ------------------------------------------------------
> modprobe/936 is trying to acquire lock:
> ffffc943e0edc3b0 (pernet_ops_rwsem){++++}-{4:4}, at: register_netdevice_notifier+0x38/0x148
>
> but task is already holding lock:
> ffff0001f2762248 (&led_cdev->trigger_lock){+.+.}-{4:4}, at: led_trigger_register+0x14c/0x1e0
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&led_cdev->trigger_lock){+.+.}-{4:4}:
> lock_acquire+0x1cc/0x348
> down_write+0x40/0xd8
> led_trigger_set_default+0x5c/0x170
> led_classdev_register_ext+0x340/0x488
> __sdhci_add_host+0x190/0x368 [sdhci]
> dwcmshc_probe+0x2b8/0x6b0 [sdhci_of_dwcmshc]
> platform_probe+0x70/0xe8
> really_probe+0xc8/0x3a0
> __driver_probe_device+0x84/0x160
> driver_probe_device+0x44/0x128
> __device_attach_driver+0xc4/0x170
> bus_for_each_drv+0x90/0xf8
> __device_attach_async_helper+0xc0/0x120
> async_run_entry_fn+0x40/0x180
> process_one_work+0x23c/0x640
> worker_thread+0x1b4/0x360
> kthread+0x150/0x250
> ret_from_fork+0x10/0x20
>
> -> #2 (triggers_list_lock){++++}-{4:4}:
> lock_acquire+0x1cc/0x348
> down_write+0x40/0xd8
> led_trigger_register+0x58/0x1e0
> phy_led_triggers_register+0xf4/0x258 [libphy]
> phy_attach_direct+0x328/0x3a8 [libphy]
> phylink_fwnode_phy_connect+0xb0/0x138 [phylink]
> __stmmac_open+0xec/0x520 [stmmac]
> stmmac_open+0x4c/0xe8 [stmmac]
> __dev_open+0x13c/0x310
> __dev_change_flags+0x1d4/0x260
> netif_change_flags+0x2c/0x80
> dev_change_flags+0x90/0xd0
> devinet_ioctl+0x55c/0x730
> inet_ioctl+0x1e4/0x200
> sock_do_ioctl+0x6c/0x140
> sock_ioctl+0x328/0x3c0
> __arm64_sys_ioctl+0xb4/0x118
> invoke_syscall+0x6c/0x100
> el0_svc_common.constprop.0+0x48/0xf0
> do_el0_svc+0x24/0x38
> el0_svc+0x54/0x1e0
> el0t_64_sync_handler+0x10c/0x140
> el0t_64_sync+0x198/0x1a0
>
> -> #1 (rtnl_mutex){+.+.}-{4:4}:
> lock_acquire+0x1cc/0x348
> __mutex_lock+0xac/0x590
> mutex_lock_nested+0x2c/0x40
> rtnl_lock+0x24/0x38
> register_netdevice_notifier+0x40/0x148
> rtnetlink_init+0x40/0x68
> netlink_proto_init+0x120/0x158
> do_one_initcall+0x88/0x3b8
> kernel_init_freeable+0x2d0/0x340
> kernel_init+0x28/0x160
> ret_from_fork+0x10/0x20
>
> -> #0 (pernet_ops_rwsem){++++}-{4:4}:
> check_prev_add+0x114/0xcb8
> __lock_acquire+0x12e8/0x15f0
> lock_acquire+0x1cc/0x348
> down_write+0x40/0xd8
> register_netdevice_notifier+0x38/0x148
> netdev_trig_activate+0x18c/0x1e8 [ledtrig_netdev]
> led_trigger_set+0x1d4/0x328
> led_trigger_register+0x194/0x1e0
> netdev_led_trigger_init+0x20/0xff8 [ledtrig_netdev]
> do_one_initcall+0x88/0x3b8
> do_init_module+0x5c/0x270
> load_module+0x1ed8/0x2608
> init_module_from_file+0x94/0x100
> idempotent_init_module+0x1e8/0x2f0
> __arm64_sys_finit_module+0x70/0xe8
> invoke_syscall+0x6c/0x100
> el0_svc_common.constprop.0+0x48/0xf0
> do_el0_svc+0x24/0x38
> el0_svc+0x54/0x1e0
> el0t_64_sync_handler+0x10c/0x140
> el0t_64_sync+0x198/0x1a0
>
> other info that might help us debug this:
>
> Chain exists of:
> pernet_ops_rwsem --> triggers_list_lock --> &led_cdev->trigger_lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&led_cdev->trigger_lock);
> lock(triggers_list_lock);
> lock(&led_cdev->trigger_lock);
> lock(pernet_ops_rwsem);
>
> *** DEADLOCK ***
>
> 2 locks held by modprobe/936:
> #0: ffffc943e0d2baa8 (leds_list_lock){++++}-{4:4}, at: led_trigger_register+0x10c/0x1e0
> #1: ffff0001f2762248 (&led_cdev->trigger_lock){+.+.}-{4:4}, at: led_trigger_register+0x14c/0x1e0
>
> stack backtrace:
> CPU: 0 UID: 0 PID: 936 Comm: modprobe Not tainted 6.16-rc7+unreleased-arm64-cknow #1 PREEMPTLAZY Debian 6.16~rc7-2~exp1
> Hardware name: FriendlyElec NanoPi R5S (DT)
> Call trace:
> show_stack+0x34/0xa0 (C)
> dump_stack_lvl+0x70/0x98
> dump_stack+0x18/0x24
> print_circular_bug+0x230/0x280
> check_noncircular+0x174/0x188
> check_prev_add+0x114/0xcb8
> __lock_acquire+0x12e8/0x15f0
> lock_acquire+0x1cc/0x348
> down_write+0x40/0xd8
> register_netdevice_notifier+0x38/0x148
> netdev_trig_activate+0x18c/0x1e8 [ledtrig_netdev]
> led_trigger_set+0x1d4/0x328
> led_trigger_register+0x194/0x1e0
> netdev_led_trigger_init+0x20/0xff8 [ledtrig_netdev]
> do_one_initcall+0x88/0x3b8
> do_init_module+0x5c/0x270
> load_module+0x1ed8/0x2608
> init_module_from_file+0x94/0x100
> idempotent_init_module+0x1e8/0x2f0
> __arm64_sys_finit_module+0x70/0xe8
> invoke_syscall+0x6c/0x100
> el0_svc_common.constprop.0+0x48/0xf0
> do_el0_svc+0x24/0x38
> el0_svc+0x54/0x1e0
> el0t_64_sync_handler+0x10c/0x140
> el0t_64_sync+0x198/0x1a0
> leds-gpio gpio-leds: bus: 'platform': really_probe: bound device to driver leds-gpio
>
> Full serial log can be found at [3] which is quite verbose and the boot
> took way longer then normal as the following was added to cmdline:
> ``dyndbg="file dd.c func really_probe +p" maxcpus=1``
>
> Free free to ask for additional info and/or to run tests.
>
> Cheers,
> Diederik
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git/commit/?h=arm/fixes&id=912b1f2a796ec73530a709b11821cb0c249fb23e
> [2] https://lore.kernel.org/linux-rockchip/f81b88df-9959-4968-a60a-b7efd3d5ea24@arm.com/
> [3] https://paste.sr.ht/~diederik/142e92bfb29bbb58bca18a74cdffc5e0ba79081c
Powered by blists - more mailing lists