lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230823093846.7wzrhnqdk2wyqud2@Astras-Ubuntu>
Date: Wed, 23 Aug 2023 02:38:46 -0700
From: Ziqi Zhao <astrajoan@...oo.com>
To: Nikolay Aleksandrov <razor@...ckwall.org>
Cc: arnd@...db.de, bridge@...ts.linux-foundation.org, davem@...emloft.net,
	edumazet@...gle.com, f.fainelli@...il.com, ivan.orlov0322@...il.com,
	keescook@...omium.org, kuba@...nel.org, hkallweit1@...il.com,
	mudongliangabcd@...il.com, nikolay@...dia.com, pabeni@...hat.com,
	roopa@...dia.com, skhan@...uxfoundation.org,
	syzbot+881d65229ca4f9ae8c84@...kaller.appspotmail.com,
	vladimir.oltean@....com, linux-kernel@...r.kernel.org,
	netdev@...r.kernel.org, syzkaller-bugs@...glegroups.com
Subject: Re: [PATCH] net: bridge: Fix refcnt issues in dev_ioctl

On Tue, Aug 22, 2023 at 01:40:45PM +0300, Nikolay Aleksandrov wrote:

> Thank you for testing, but we really need to understand what is going on and
> why the device isn't getting deleted for so long. Currently I don't have the
> time to debug it properly (I'll be able to next week at the earliest). We
> can't apply the patch based only on tests without understanding the
> underlying issue. I'd look into what
> the reproducer is doing exactly and also check the system state while the
> deadlock has happened. Also you can list the currently held locks (if
> CONFIG_LOCKDEP is enabled) via magic sysrq + d for example. See which
> process is holding them, what are their priorities and so on.
> Try to build some theory of how a deadlock might happen and then go
> about proving it. Does the 8021q module have the same problem? It uses
> similar code to set its hook.

Hi Nik,

Thank you so much for the instructions! I was able to obtain a decoded
stacktrace showing the reproducer behavior in my QEMU VM running kernel
6.5-rc4, in case that would give us more context for pinpointing the
problem. Here's a link to the output:

https://pastecat.io/?p=IlKZlflN9j2Z2mspjKe7

Basically, after running the reproducer (line 1854) for about 180
seconnds or so, the unregister_netdevice warning was shown (line 1856),
and then after another 50 seconds, the kernel detected that some tasks
have been stalled for more than 143 seconds (line 1866), so it panicked
on the blocked tasks (line 2116). Before the panic, we did get to see
all the locks held in the system (line 2068), and it did show that many
processes created by the reproducer were contending the br_ioctl_mutex.
I'm now starting to wonder whether this is really a deadlock, or simply
some tasks not being able to grab the lock because so many processes
are trying to acquire it.

Let me know what you think about the situation shown in the above log,
and let's keep in touch for any future debugging. Thank you again for
guiding me through the problem!

Best regards,
Ziqi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ