lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 19 Aug 2023 15:50:48 -0700
From: Ziqi Zhao <astrajoan@...oo.com>
To: Nikolay Aleksandrov <razor@...ckwall.org>
Cc: arnd@...db.de, bridge@...ts.linux-foundation.org, davem@...emloft.net,
	edumazet@...gle.com, f.fainelli@...il.com, ivan.orlov0322@...il.com,
	keescook@...omium.org, kuba@...nel.org, hkallweit1@...il.com,
	mudongliangabcd@...il.com, nikolay@...dia.com, pabeni@...hat.com,
	roopa@...dia.com, skhan@...uxfoundation.org,
	syzbot+881d65229ca4f9ae8c84@...kaller.appspotmail.com,
	vladimir.oltean@....com, linux-kernel@...r.kernel.org,
	netdev@...r.kernel.org, syzkaller-bugs@...glegroups.com
Subject: Re: [PATCH] net: bridge: Fix refcnt issues in dev_ioctl

On Sat, Aug 19, 2023 at 12:25:15PM +0300, Nikolay Aleksandrov wrote:
Hi Nik,

Thank you so much for reviewing the patch and getting back to me!

> IIRC there was no bug, it was a false-positive. The reference is held a bit
> longer but then released, so the device is deleted later.

> If you reproduced it, is the device later removed or is it really stuck?

I ran the reproducer again without the patch and it seems you are
correct. It was trying to create a very short-lived bridge, then delete
it immediately in the next call. The device in question "wpan4" never
showed up when I polled with `ip link` in the VM, so I'd say it did not
get stuck.

> How would it leak a reference, could you elaborate?
> The reference is always taken and always released after the call.

This was where I got a bit confused too. The system had a timeout of
140 seconds for the unregister_netdevice check. If the bridge in
question was created and deleted repeatedly, the warning would indeed
not be an actual reference leak. But how could its reference show up
after 140 seconds if the bridge's creation and deletion were all within
a couple of milliseconds?

So I let the system run for a bit longer with the reproducer, and after
~200 seconds, the kernel crashed and complained that some tasks had
been waiting for too long (more than 143 seconds) trying to get hold of
the br_ioctl_mutex. This was also quite strange to me, since on the
surface it definitely looked like a deadlock, but the strict locking
order as I described previously should prevent any deadlocks from
happening.

Anyways, I decided to test switching up the lock order, since both the
refcnt warning and the task stall seemed closely related to the above
mentioned locks. When I ran the reproducer again after the patch, both
the warning and the stall issue went away. So I guess the patch is
still relevant in preventing bugs in some extreme cases -- although the
scenario created by the reproducer would probably never happen in real
usages?

Please let me know whether you have any thoughts on how the above
issues were triggered, and what other information I could gather to
further demystify this bug. Thank you again for your help!

Best regards,
Ziqi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ