lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 26 Mar 2019 15:30:21 +0000
From:   Dmitry Safonov <dima@...sta.com>
To:     linux-kernel@...r.kernel.org
Cc:     Dmitry Safonov <dima@...sta.com>,
        Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
        Alexey Kuznetsov <kuznet@....inr.ac.ru>,
        David Ahern <dsahern@...il.com>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Ido Schimmel <idosch@...lanox.com>, netdev@...r.kernel.org,
        linux-doc@...r.kernel.org, Jonathan Corbet <corbet@....net>
Subject: [RFC 0/4] net/fib: Speed up trie rebalancing for full view

During moving from 3.18 stable kernel to 4.9 on the switches, rebasing
local -specific patches and stuff, it was found that BGP benchmarks for
full view have started to hit the soft lockup detector by holding rtnl
mutex for a couple of seconds on routes changes.

I've found that the hard-coded MAX_WORK doesn't limit the amount of
pending rebalancing work anymore. So that any route adding/removal may
cause massive rebalancing of the trie. And making the hit even worse,
tnodes are freed by RCU with a call to synchronise_rcu() every 512Kb.
That is way too small and even on 2-cores switches is painfully
noticable.

To address those problems, I've introduced sysctl knob to limit the
amount of rebalancing work (by default unlimited as is de facto now).
I've moved synchronise_rcu() into a new shrinker to actually release
memory in OOM.. I believe non-visible to userspace shrinker is better
than a new sysctl knob for the limit (or any hard-coded value).
Though, not sure how sane the result is.
So, I send it as RFC, having qualms that it's not ready for inclusion
as-is yet.

I've looked further into the origin of problems here and was thinking if
it make sense to do the following (rough plan):
1. Introduce a new flag RTNL_FLAG_DUMP_UNLOCKED to dump fib trie without
   rtnl lock. I'm a bit out of context, so probably I miss some obvious
   reasons why lock needs to be held at this point.
2. Add a new fib_lock mutex for updating a trie. I'm not really sure
   that we can always release rtnl for updates, so probably there should
   be a strict locking order: rtnl_lock (when needed) => fib_lock.
3. Correct current documentation, that mentions fib_lock as rwsem.
4. ??

I did some experiments on the plan above, but decided to send this RFC
to get opinions of people who understand more. Maybe, my plan is
nonsense and it's not worth invest amount of time it requires.

I've also looked into changes between v3.18...v4.9, and found the
following patches set: https://www.spinics.net/lists/netdev/msg309586.html
While it has impressive results on lookups, it seems to be the reason of
the regression on the big scale: one of the patches has 5 times penalty
to remove a route on a big scale, another adds 10 times penalty by
calling synchronise_rcu() more frequently (I've marked them by Fixes tag
in the patches).

[I was very lavish on Cc list, please ping me in private if you don't
want to be copied on the next version]

Cc: Alexander Duyck <alexander.h.duyck@...ux.intel.com>
Cc: Alexey Kuznetsov <kuznet@....inr.ac.ru>
Cc: David Ahern <dsahern@...il.com>
Cc: "David S. Miller" <davem@...emloft.net>
Cc: Eric Dumazet <edumazet@...gle.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>
Cc: Ido Schimmel <idosch@...lanox.com>
Cc: netdev@...r.kernel.org

Thanks,
        Dmitry

Dmitry Safonov (4):
  net/ipv4/fib: Remove run-time check in tnode_alloc()
  net/fib: Provide fib_balance_budget sysctl
  net/fib: Check budget before should_{inflate,halve}()
  net/ipv4/fib: Don't synchronise_rcu() every 512Kb

 Documentation/networking/ip-sysctl.txt |   6 ++
 include/net/ip.h                       |   1 +
 net/ipv4/fib_trie.c                    | 121 +++++++++++++++----------
 net/ipv4/sysctl_net_ipv4.c             |   7 ++
 4 files changed, 87 insertions(+), 48 deletions(-)

-- 
2.21.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ