[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHTA-uZvLg4aW7hMXMxkVwar7a3vL+yR=YOznW3Yresaq3Yd+A@mail.gmail.com>
Date: Fri, 13 Sep 2024 08:45:22 -0500
From: Mitchell Augustin <mitchell.augustin@...onical.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, Jiri Pirko <jiri@...nulli.us>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Lorenzo Bianconi <lorenzo@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
Jacob Martin <jacob.martin@...onical.com>, dann frazier <dann.frazier@...onical.com>
Subject: Re: Namespaced network devices not cleaned up properly after
execution of pmtu.sh kernel selftest
Hi Jakub,
Executing ./pmtu.sh pmtu_ipv6_ipv6_exception manually will only
trigger the pmtu_ipv6_ipv6_exception sub-case, which only takes a
second to run on my machines, so you shouldn't need to run the
entirety of pmtu.sh to trigger the bug. It won't trigger on attempt
#1, but in my experience, when I do it in that while loop, it will
trigger in under a minute reliably.
> Somewhat tangentially but if you'd be willing I wouldn't mind if you
> were to send patches to break this test up upstream, too. It takes
> 1h23m to run with various debug kernel options enabled. If we split
> it into multiple smaller tests each running 10min or 20min we can
> then spawn multiple VMs and get the results faster.
This logical division of tests already exists in pmtu.sh if you pass a
sub-test name in as the first parameter like above, but if you think
there would be value in separating them out further or into different
files not all in pmtu.sh, I would be happy to help with that. Just let
me know.
Regardless, I will go ahead and work on a new regression test that
executes just our quick reproducer for this specific bug and will send
it to this list.
Thanks,
Mitchell Augustin
On Thu, Sep 12, 2024 at 9:13 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Wed, 11 Sep 2024 17:20:29 -0500 Mitchell Augustin wrote:
> > We recently identified a bug still impacting upstream, triggered
> > occasionally by one of the kernel selftests (net/pmtu.sh) that
> > sometimes causes the following behavior:
> > * One of this tests's namespaced network devices does not get properly
> > cleaned up when the namespace is destroyed, evidenced by
> > `unregister_netdevice: waiting for veth_A-R1 to become free. Usage
> > count = 5` appearing in the dmesg output repeatedly
> > * Once we start to see the above `unregister_netdevice` message, an
> > un-cancelable hang will occur on subsequent attempts to run `modprobe
> > ip6_vti` or `rmmod ip6_vti`
>
> Thanks for the report! We have seen it in our CI as well, it happens
> maybe once a day. But as you say on x86 is quite hard to reproduce,
> and nothing obvious stood out as a culprit.
>
> > However, I can easily reproduce the issue on an Nvidia Grace/Hopper
> > machine (and other platforms with modern CPUs) with the performance
> > governor set by doing the following:
> > * Install/boot any affected kernel
> > * Clone the kernel tree just to get an older version of the test cases
> > without subtle timing changes that mask the issue (such as
> > https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/tree/?h=Ubuntu-6.8.0-39.39)
> > * cd tools/testing/selftests/net
> > * while true; do sudo ./pmtu.sh pmtu_ipv6_ipv6_exception; done
>
> That's exciting! Would you be able to try to cut down the test itself
> (is quite long and has a ton of sub-cases). Figure out which sub-cases
> trigger this? And maybe with an even quicker repro we'll bisect or
> someone will correctly guess the fix?
>
> Somewhat tangentially but if you'd be willing I wouldn't mind if you
> were to send patches to break this test up upstream, too. It takes
> 1h23m to run with various debug kernel options enabled. If we split
> it into multiple smaller tests each running 10min or 20min we can
> then spawn multiple VMs and get the results faster.
--
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering
Powered by blists - more mailing lists