lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 17 Nov 2022 15:58:23 +0000
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     linux-kernel@...r.kernel.org, Cong Wang <xiyou.wangcong@...il.com>,
        David Ahern <dsahern@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Jakub Kicinski <kuba@...nel.org>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Jiri Pirko <jiri@...nulli.us>, netdev@...r.kernel.org,
        Paolo Abeni <pabeni@...hat.com>, rcu@...r.kernel.org,
        rostedt@...dmis.org, paulmck@...nel.org, fweisbec@...il.com,
        jiejiang@...gle.com
Subject: Re: [PATCH rcu/dev 3/3] net: Use call_rcu_flush() for dst_destroy_rcu

Hello Eric,

On Wed, Nov 16, 2022 at 07:44:41PM -0800, Eric Dumazet wrote:
> On Wed, Nov 16, 2022 at 7:16 PM Joel Fernandes (Google)
> <joel@...lfernandes.org> wrote:
> >
> > In a networking test on ChromeOS, we find that using the new CONFIG_RCU_LAZY
> > causes a networking test to fail in the teardown phase.
> >
> > The failure happens during: ip netns del <name>
> 
> And ? What happens then next ?

The test is doing the 'ip netns del <name>' and then polling for the
disappearance of a network interface name for upto 5 seconds. I believe it is
using netlink to get a table of interfaces. That polling is timing out.

Here is some more details from the test's owner (copy pasting from another
bug report):
In the cleanup, we remove the netns, and thus will cause the veth pair being
removed automatically, so we use a poll to check that if the veth in the root
netns still exists to know whether the cleanup is done.

Here is a public link to the code that is failing (its in golang):
https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/platform/tast-tests/src/chromiumos/tast/local/network/virtualnet/env/env.go;drc=6c2841d6cc3eadd23e07912ec331943ee33d7de8;l=161

Here is a public link to the line of code in the actual test leading up to the above
path (this is the test that is run:
network.RoutingFallthrough.ipv4_only_primary) :
https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/platform/tast-tests/src/chromiumos/tast/local/bundles/cros/network/routing_fallthrough.go;drc=8fbf2c53960bc8917a6a01fda5405cad7c17201e;l=52

> > Using ftrace, I found the callbacks it was queuing which this series fixes. Use
> > call_rcu_flush() to revert to the old behavior. With that, the test passes.
> 
> What is this test about ? What barrier was used to make it not flaky ?

I provided the links above, let me know if you have any questions.

> Was it depending on some undocumented RCU behavior ?

This is a new RCU feature posted here for significant power-savings on
battery-powered devices:
https://lore.kernel.org/rcu/20221017140726.GG5600@paulmck-ThinkPad-P17-Gen-1/T/#m7a54809b8903b41538850194d67eb34f203c752a

There is also an LPC presentation about the same, I can dig the link if you
are interested.

> Maybe adding a sysctl to force the flush would be better for functional tests ?
> 
> I would rather change the test(s), than adding call_rcu_flush(),
> adding merge conflicts to future backports.

I am not too sure about that, I think a user might expect the network
interface to disappear from the networking tables quickly enough without
dealing with barriers or kernel iternals. However, I added the authors of the
test to this email in the hopes he can provide is point of views as well.

The general approach we are taking with this sort of thing is to use
call_rcu_flush() which is basically the same as call_rcu() for systems with
CALL_RCU_LAZY=n. You can see some examples of that in the patch series link
above. Just to note, CALL_RCU_LAZY depends on CONFIG_RCU_NOCB_CPU so its only
Android and ChromeOS that are using it. I am adding Jie to share any input,
he is from the networking team and knows this test well.

thanks,

 - Joel

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ