lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240912191306.0cf81ce3@kernel.org>
Date: Thu, 12 Sep 2024 19:13:06 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Mitchell Augustin <mitchell.augustin@...onical.com>
Cc: "David S. Miller" <davem@...emloft.net>, Eric Dumazet
 <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Jiri Pirko
 <jiri@...nulli.us>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 Lorenzo Bianconi <lorenzo@...nel.org>, Daniel Borkmann
 <daniel@...earbox.net>, netdev@...r.kernel.org,
 linux-kernel@...r.kernel.org, Jacob Martin <jacob.martin@...onical.com>,
 dann frazier <dann.frazier@...onical.com>
Subject: Re: Namespaced network devices not cleaned up properly after
 execution of pmtu.sh kernel selftest

On Wed, 11 Sep 2024 17:20:29 -0500 Mitchell Augustin wrote:
> We recently identified a bug still impacting upstream, triggered
> occasionally by one of the kernel selftests (net/pmtu.sh) that
> sometimes causes the following behavior:
> * One of this tests's namespaced network devices does not get properly
> cleaned up when the namespace is destroyed, evidenced by
> `unregister_netdevice: waiting for veth_A-R1 to become free. Usage
> count = 5` appearing in the dmesg output repeatedly
> * Once we start to see the above `unregister_netdevice` message, an
> un-cancelable hang will occur on subsequent attempts to run `modprobe
> ip6_vti` or `rmmod ip6_vti`

Thanks for the report! We have seen it in our CI as well, it happens
maybe once a day. But as you say on x86 is quite hard to reproduce,
and nothing obvious stood out as a culprit.

> However, I can easily reproduce the issue on an Nvidia Grace/Hopper
> machine (and other platforms with modern CPUs) with the performance
> governor set by doing the following:
> * Install/boot any affected kernel
> * Clone the kernel tree just to get an older version of the test cases
> without subtle timing changes that mask the issue (such as
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/tree/?h=Ubuntu-6.8.0-39.39)
> * cd tools/testing/selftests/net
> * while true; do sudo ./pmtu.sh pmtu_ipv6_ipv6_exception; done

That's exciting! Would you be able to try to cut down the test itself
(is quite long and has a ton of sub-cases). Figure out which sub-cases
trigger this? And maybe with an even quicker repro we'll bisect or
someone will correctly guess the fix?

Somewhat tangentially but if you'd be willing I wouldn't mind if you
were to send patches to break this test up upstream, too. It takes
1h23m to run with various debug kernel options enabled. If we split 
it into multiple smaller tests each running 10min or 20min we can 
then spawn multiple VMs and get the results faster.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ