[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <475ee9ae8cdca5ce86b708fe0ade7c9d@manjaro.org>
Date: Wed, 30 Jul 2025 21:50:25 +0200
From: Dragan Simic <dsimic@...jaro.org>
To: Robin Murphy <robin.murphy@....com>
Cc: Diederik de Haas <didi.debian@...ow.org>, Lee Jones <lee@...nel.org>,
Pavel Machek <pavel@...nel.org>, Andrew Lunn <andrew+netdev@...n.ch>, "David
S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Jakub
Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
linux-leds@...r.kernel.org, netdev@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, linux-rockchip@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: BUG: Circular locking dependency on netdev led trigger on NanoPi
R5S
Hello Robin and Diederik,
On 2025-07-25 20:12, Robin Murphy wrote:
> On 2025-07-25 6:48 pm, Diederik de Haas wrote:
>> I have a FriendlyELEC NanoPi R5S (with rk3568 SoC) and in commit
>> 1631cbdb8089 ("arm64: dts: rockchip: Improve LED config for NanoPi
>> R5S")
>>
>> I tried to improve its LED configuration and that included
>> ``linux,default-trigger = "netdev"``
>>
>> Problem: sometimes I got a 'hung task' error which resulted in the WAN
>> port not to come up (that's the only one I use) and logging in via
>> serial also didn't work, so pulling the plug was the only remedy.
>>
>> Robin Murphy quickly identified that it likely had to do with led
>> triggers and removing those netdev triggers made the problem go
>> away[1].
>> To find out what actually caused it, I built a kernel with
>> PROOF_LOCKING
>> and PRINTK_CALLER enabled, which after adding a patch which fixed an
>> OOPS [2], showed the underlaying problem:
>
> For the record, I think the actual deadlock condition Diederik's
> system hits in practice is a shorter cycle, wherein immediately after
> acquiring pernet_ops_rwsem, thread #0 then tries to take rtnl_mutex,
> which forms a straight inversion against thread #2 (which holds
> rtnl_mutex from devinet_ioctl()).
Thanks for the bug report and for the additional insights!
I've spent some time digging through the LED subsystem, which I'm
already somewhat familiar with, and I think I've narrowed down the
root cause of this deadlock.
I'll send a preliminary patch soon, after I make sure that the root
cause is identified correctly, and I hope Diederik will be willing
to test the patch. If so, and if the patch checks out to be the
cure, I'll prepare and submit a proper patch, of course.
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.16-rc7+unreleased-arm64-cknow #1 Not tainted
>> ------------------------------------------------------
>> modprobe/936 is trying to acquire lock:
>> ffffc943e0edc3b0 (pernet_ops_rwsem){++++}-{4:4}, at:
>> register_netdevice_notifier+0x38/0x148
>>
>> but task is already holding lock:
>> ffff0001f2762248 (&led_cdev->trigger_lock){+.+.}-{4:4}, at:
>> led_trigger_register+0x14c/0x1e0
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #3 (&led_cdev->trigger_lock){+.+.}-{4:4}:
>> lock_acquire+0x1cc/0x348
>> down_write+0x40/0xd8
>> led_trigger_set_default+0x5c/0x170
>> led_classdev_register_ext+0x340/0x488
>> __sdhci_add_host+0x190/0x368 [sdhci]
>> dwcmshc_probe+0x2b8/0x6b0 [sdhci_of_dwcmshc]
>> platform_probe+0x70/0xe8
>> really_probe+0xc8/0x3a0
>> __driver_probe_device+0x84/0x160
>> driver_probe_device+0x44/0x128
>> __device_attach_driver+0xc4/0x170
>> bus_for_each_drv+0x90/0xf8
>> __device_attach_async_helper+0xc0/0x120
>> async_run_entry_fn+0x40/0x180
>> process_one_work+0x23c/0x640
>> worker_thread+0x1b4/0x360
>> kthread+0x150/0x250
>> ret_from_fork+0x10/0x20
>>
>> -> #2 (triggers_list_lock){++++}-{4:4}:
>> lock_acquire+0x1cc/0x348
>> down_write+0x40/0xd8
>> led_trigger_register+0x58/0x1e0
>> phy_led_triggers_register+0xf4/0x258 [libphy]
>> phy_attach_direct+0x328/0x3a8 [libphy]
>> phylink_fwnode_phy_connect+0xb0/0x138 [phylink]
>> __stmmac_open+0xec/0x520 [stmmac]
>> stmmac_open+0x4c/0xe8 [stmmac]
>> __dev_open+0x13c/0x310
>> __dev_change_flags+0x1d4/0x260
>> netif_change_flags+0x2c/0x80
>> dev_change_flags+0x90/0xd0
>> devinet_ioctl+0x55c/0x730
>> inet_ioctl+0x1e4/0x200
>> sock_do_ioctl+0x6c/0x140
>> sock_ioctl+0x328/0x3c0
>> __arm64_sys_ioctl+0xb4/0x118
>> invoke_syscall+0x6c/0x100
>> el0_svc_common.constprop.0+0x48/0xf0
>> do_el0_svc+0x24/0x38
>> el0_svc+0x54/0x1e0
>> el0t_64_sync_handler+0x10c/0x140
>> el0t_64_sync+0x198/0x1a0
>>
>> -> #1 (rtnl_mutex){+.+.}-{4:4}:
>> lock_acquire+0x1cc/0x348
>> __mutex_lock+0xac/0x590
>> mutex_lock_nested+0x2c/0x40
>> rtnl_lock+0x24/0x38
>> register_netdevice_notifier+0x40/0x148
>> rtnetlink_init+0x40/0x68
>> netlink_proto_init+0x120/0x158
>> do_one_initcall+0x88/0x3b8
>> kernel_init_freeable+0x2d0/0x340
>> kernel_init+0x28/0x160
>> ret_from_fork+0x10/0x20
>>
>> -> #0 (pernet_ops_rwsem){++++}-{4:4}:
>> check_prev_add+0x114/0xcb8
>> __lock_acquire+0x12e8/0x15f0
>> lock_acquire+0x1cc/0x348
>> down_write+0x40/0xd8
>> register_netdevice_notifier+0x38/0x148
>> netdev_trig_activate+0x18c/0x1e8 [ledtrig_netdev]
>> led_trigger_set+0x1d4/0x328
>> led_trigger_register+0x194/0x1e0
>> netdev_led_trigger_init+0x20/0xff8 [ledtrig_netdev]
>> do_one_initcall+0x88/0x3b8
>> do_init_module+0x5c/0x270
>> load_module+0x1ed8/0x2608
>> init_module_from_file+0x94/0x100
>> idempotent_init_module+0x1e8/0x2f0
>> __arm64_sys_finit_module+0x70/0xe8
>> invoke_syscall+0x6c/0x100
>> el0_svc_common.constprop.0+0x48/0xf0
>> do_el0_svc+0x24/0x38
>> el0_svc+0x54/0x1e0
>> el0t_64_sync_handler+0x10c/0x140
>> el0t_64_sync+0x198/0x1a0
>>
>> other info that might help us debug this:
>>
>> Chain exists of:
>> pernet_ops_rwsem --> triggers_list_lock -->
>> &led_cdev->trigger_lock
>>
>> Possible unsafe locking scenario:
>>
>> CPU0 CPU1
>> ---- ----
>> lock(&led_cdev->trigger_lock);
>> lock(triggers_list_lock);
>> lock(&led_cdev->trigger_lock);
>> lock(pernet_ops_rwsem);
>>
>> *** DEADLOCK ***
>>
>> 2 locks held by modprobe/936:
>> #0: ffffc943e0d2baa8 (leds_list_lock){++++}-{4:4}, at:
>> led_trigger_register+0x10c/0x1e0
>> #1: ffff0001f2762248 (&led_cdev->trigger_lock){+.+.}-{4:4}, at:
>> led_trigger_register+0x14c/0x1e0
>>
>> stack backtrace:
>> CPU: 0 UID: 0 PID: 936 Comm: modprobe Not tainted
>> 6.16-rc7+unreleased-arm64-cknow #1 PREEMPTLAZY Debian 6.16~rc7-2~exp1
>> Hardware name: FriendlyElec NanoPi R5S (DT)
>> Call trace:
>> show_stack+0x34/0xa0 (C)
>> dump_stack_lvl+0x70/0x98
>> dump_stack+0x18/0x24
>> print_circular_bug+0x230/0x280
>> check_noncircular+0x174/0x188
>> check_prev_add+0x114/0xcb8
>> __lock_acquire+0x12e8/0x15f0
>> lock_acquire+0x1cc/0x348
>> down_write+0x40/0xd8
>> register_netdevice_notifier+0x38/0x148
>> netdev_trig_activate+0x18c/0x1e8 [ledtrig_netdev]
>> led_trigger_set+0x1d4/0x328
>> led_trigger_register+0x194/0x1e0
>> netdev_led_trigger_init+0x20/0xff8 [ledtrig_netdev]
>> do_one_initcall+0x88/0x3b8
>> do_init_module+0x5c/0x270
>> load_module+0x1ed8/0x2608
>> init_module_from_file+0x94/0x100
>> idempotent_init_module+0x1e8/0x2f0
>> __arm64_sys_finit_module+0x70/0xe8
>> invoke_syscall+0x6c/0x100
>> el0_svc_common.constprop.0+0x48/0xf0
>> do_el0_svc+0x24/0x38
>> el0_svc+0x54/0x1e0
>> el0t_64_sync_handler+0x10c/0x140
>> el0t_64_sync+0x198/0x1a0
>> leds-gpio gpio-leds: bus: 'platform': really_probe: bound device
>> to driver leds-gpio
>>
>> Full serial log can be found at [3] which is quite verbose and the
>> boot
>> took way longer then normal as the following was added to cmdline:
>> ``dyndbg="file dd.c func really_probe +p" maxcpus=1``
>>
>> Free free to ask for additional info and/or to run tests.
>>
>> [1]
>> https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git/commit/?h=arm/fixes&id=912b1f2a796ec73530a709b11821cb0c249fb23e
>> [2]
>> https://lore.kernel.org/linux-rockchip/f81b88df-9959-4968-a60a-b7efd3d5ea24@arm.com/
>> [3]
>> https://paste.sr.ht/~diederik/142e92bfb29bbb58bca18a74cdffc5e0ba79081c
Powered by blists - more mailing lists