linux-kernel - Re: [RFC 4/4] net/ipv4/fib: Don't synchronise

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d77c86d7-23da-c301-4443-08e9988ac801@arista.com>
Date:   Tue, 26 Mar 2019 23:14:43 +0000
From:   Dmitry Safonov <dima@...sta.com>
To:     David Ahern <dsahern@...il.com>, linux-kernel@...r.kernel.org,
        "Paul E. McKenney" <paulmck@...ux.ibm.com>
Cc:     Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
        Alexey Kuznetsov <kuznet@....inr.ac.ru>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Ido Schimmel <idosch@...lanox.com>, netdev@...r.kernel.org
Subject: Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb

On 3/26/19 3:39 PM, David Ahern wrote:
> On 3/26/19 9:30 AM, Dmitry Safonov wrote:
>> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu().
>> The limit is 128 pages or 512Kb (considering common case with 4Kb
>> pages).
>>
>> Unfortunately, at Arista we have use-scenarios with full view software
>> forwarding. At the scale of 100K and more routes even on 2 core boxes
>> the hard-coded limit starts actively shooting in the leg: lockup
>> detector notices that rtnl_lock is held for seconds.
>> First reason is previously broken MAX_WORK, that didn't limit pending
>> balancing work. While fixing it, I've noticed that the bottle-neck is
>> actually in the number of synchronise_rcu() calls.
>>
>> I've tried to fix it with a patch to decrement number of tnodes in rcu
>> callback, but it hasn't much affected performance.
>>
>> One possible way to "fix" it - provide another sysctl to control
>> sync_pages, but in my POV it's nasty - exposing another realisation
>> detail into user-space.
> 
> well, that was accepted last week. ;-)
> 
> commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f
> Author: David Ahern <dsahern@...il.com>
> Date:   Wed Mar 20 09:18:59 2019 -0700
> 
>     ipv4: Allow amount of dirty memory from fib resizing to be controllable
> 
> 
> Can you see how that change (should backport easily) affects your test
> case? From my perspective 16MB was the sweet spot.

FWIW, I would like to +Cc Paul here.

TLDR; we're looking with David into ways to improve a hardcoded limit
tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low
(512Kb). David created a patch to provide sysctl that controls the limit
and it would solve a problem for both of us. In parallel, I thought that
exposing this to userspace is not much fun and added a shrinker with
synchronize_rcu(). I'm not any sure that the latter is actually a sane
solution..
Is there any guarantee that memory to-be freed by call_rcu() will get
freed in OOM conditions? Might there be a chance that we don't need any
limit here at all?

Worth to mention that I don't argue David's patch as I pointed that it
would (will) solve the problem for us both, but with good intentions
wondering if we can do something here rather a new sysctl knob.

Thanks,
          Dmitry