linux-kernel - Re: [RFC 4/4] net/ipv4/fib: Don't synchronise

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20190327033344.GW4102@linux.ibm.com>
Date:   Tue, 26 Mar 2019 20:33:44 -0700
From:   "Paul E. McKenney" <paulmck@...ux.ibm.com>
To:     Dmitry Safonov <dima@...sta.com>
Cc:     David Ahern <dsahern@...il.com>, linux-kernel@...r.kernel.org,
        Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
        Alexey Kuznetsov <kuznet@....inr.ac.ru>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Ido Schimmel <idosch@...lanox.com>, netdev@...r.kernel.org
Subject: Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb

On Tue, Mar 26, 2019 at 11:14:43PM +0000, Dmitry Safonov wrote:
> On 3/26/19 3:39 PM, David Ahern wrote:
> > On 3/26/19 9:30 AM, Dmitry Safonov wrote:
> >> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu().
> >> The limit is 128 pages or 512Kb (considering common case with 4Kb
> >> pages).
> >>
> >> Unfortunately, at Arista we have use-scenarios with full view software
> >> forwarding. At the scale of 100K and more routes even on 2 core boxes
> >> the hard-coded limit starts actively shooting in the leg: lockup
> >> detector notices that rtnl_lock is held for seconds.
> >> First reason is previously broken MAX_WORK, that didn't limit pending
> >> balancing work. While fixing it, I've noticed that the bottle-neck is
> >> actually in the number of synchronise_rcu() calls.
> >>
> >> I've tried to fix it with a patch to decrement number of tnodes in rcu
> >> callback, but it hasn't much affected performance.
> >>
> >> One possible way to "fix" it - provide another sysctl to control
> >> sync_pages, but in my POV it's nasty - exposing another realisation
> >> detail into user-space.
> > 
> > well, that was accepted last week. ;-)
> > 
> > commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f
> > Author: David Ahern <dsahern@...il.com>
> > Date:   Wed Mar 20 09:18:59 2019 -0700
> > 
> >     ipv4: Allow amount of dirty memory from fib resizing to be controllable
> > 
> > 
> > Can you see how that change (should backport easily) affects your test
> > case? From my perspective 16MB was the sweet spot.
> 
> FWIW, I would like to +Cc Paul here.
> 
> TLDR; we're looking with David into ways to improve a hardcoded limit
> tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low
> (512Kb). David created a patch to provide sysctl that controls the limit
> and it would solve a problem for both of us. In parallel, I thought that
> exposing this to userspace is not much fun and added a shrinker with
> synchronize_rcu(). I'm not any sure that the latter is actually a sane
> solution..
> Is there any guarantee that memory to-be freed by call_rcu() will get
> freed in OOM conditions? Might there be a chance that we don't need any
> limit here at all?

Yes, unless whatever is causing the OOM is also stalling a CPU or task
that RCU is waiting on.  The extreme case is of course when the OOM is
in fact being caused by the fact that RCU is waiting on a stalled CPU
or task.  Of course, the fact that the CPU or task is being stalled is
a bug in its own right.

So, in the absence of bugs, yes, the memory that was passed to call_rcu()
should be freed within a reasonable length of time, even under OOM
conditions.

> Worth to mention that I don't argue David's patch as I pointed that it
> would (will) solve the problem for us both, but with good intentions
> wondering if we can do something here rather a new sysctl knob.

An intermediate position would be to have a reasonably high setting so
that the sysctl knob almost never needed to be adjusted.

RCU used to detect OOM conditions and work harder to finish the grace
period in those cases, but this was abandoned because it was found not
to make a significant difference in production.  Which might support
the position of assuming that memory passed to call_rcu() gets freed
reasonably quickly even under OOM conditions.

							Thanx, Paul