linux-kernel - Re: [Bug] "possible deadlock in rtnl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c9b62eaa-e05e-4958-bbf5-73b1e3c46b33@intel.com>
Date: Thu, 22 May 2025 16:05:05 -0700
From: Jacob Keller <jacob.e.keller@...el.com>
To: John <john.cs.hey@...il.com>, "David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, "Paolo
 Abeni" <pabeni@...hat.com>, Joe Damato <jdamato@...tly.com>
CC: Simon Horman <horms@...nel.org>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [Bug] "possible deadlock in rtnl_newlink" in Linux kernel v6.13



On 5/21/2025 5:52 PM, John wrote:
> Dear Linux Kernel Maintainers,
> 
> I hope this message finds you well.
> 
> I am writing to report a potential vulnerability I encountered during
> testing of the Linux Kernel version v6.13.
> 
> Git Commit: ffd294d346d185b70e28b1a28abe367bbfe53c04 (tag: v6.13)
> 
> Bug Location: rtnl_newlink+0x86c/0x1dd0 net/core/rtnetlink.c:4011
> 
> Bug report: https://hastebin.com/share/ajavibofik.bash
> 
> Complete log: https://hastebin.com/share/derufumuxu.perl
> 
> Entire kernel config:  https://hastebin.com/share/lovayaqidu.ini
> 
> Root Cause Analysis:
> The deadlock warning is caused by a circular locking dependency
> between two subsystems:
> 
> Path A (CPU 0):
> Holds rtnl_mutex in rtnl_newlink() →
> Then calls e1000_close() →
> Triggers e1000_down_and_stop() →
> Calls __cancel_work_sync() →
> Tries to flush adapter->reset_task (→ needs work_completion lock)
> 
> Path B (CPU 1):
> Holds work_completion lock while running e1000_reset_task() →
> Then calls e1000_down() →
> Which tries to acquire rtnl_mutex
> These two execution paths result in a circular dependency:
> 

I guess this implies you can't cancel_work_sync while holding RTNL lock?
Hmm. Or maybe its because calling e1000_down from the e1000_reset_task
is a problem.

> CPU 0: rtnl_mutex → work_completion
> CPU 1: work_completion → rtnl_mutex
> 
> This violates lock ordering and can lead to a deadlock under contention.
> This bug represents a classic case of lock inversion between
> networking core (rtnl_mutex) and a device driver (e1000 workqueue
> reset`).
> It is a design-level concurrency flaw that can lead to deadlocks under
> stress or fuzzing workloads.
> 
> At present, I have not yet obtained a minimal reproducer for this
> issue. However, I am actively working on reproducing it, and I will
> promptly share any additional findings or a working reproducer as soon
> as it becomes available.
> 

This is likely a regression in e400c7444d84 ("e1000: Hold RTNL when
e1000_down can be called")

@Joe, thoughts?


> Thank you very much for your time and attention to this matter. I
> truly appreciate the efforts of the Linux kernel community.
> 
> Best regards,
> John
>