[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200927153552.GA471334@fedic>
Date: Sun, 27 Sep 2020 17:35:52 +0200
From: Baptiste Jonglez <baptiste@...sofnetworks.org>
To: David Ahern <dsahern@...il.com>
Cc: Alarig Le Lay <alarig@...rdarmor.fr>, netdev@...r.kernel.org,
jack@...ilfillan.uk, Vincent Bernat <bernat@...ian.org>,
Oliver <bird-o@...net.de>
Subject: Re: IPv6 regression introduced by commit
3b6761d18bc11f2af2a6fc494e9026d39593f22c
Hi,
We are seeing the same issue, more information below.
On 07-03-20, David Ahern wrote:
> On 3/5/20 1:17 AM, Alarig Le Lay wrote:
> > Hi,
> >
> > On the bird users ML, we discussed a bug we’re facing when having a
> > full table: from time to time all the IPv6 traffic is dropped (and all
> > neighbors are invalidated), after a while it comes back again, then wait
> > a few minutes and it’s dropped again, and so on.
>
> Kernel version?
We are seeing the issue with 4.19 (debian stable) and 5.4 (debian
stable backports from a few months ago). Others reported still seeing
the issue with 5.7:
http://trubka.network.cz/pipermail/bird-users/2020-September/014877.html
http://trubka.network.cz/pipermail/bird-users/2020-September/014881.html
Interestingly, the issue manifests itself in several different ways:
1) failing IPv6 neighbours, what Alarig reported. We are seeing this
on a full-view BGP router with rather low amount of IPv6 traffic
(around 10-20 Mbps)
2) high jitter when forwarding IPv6 traffic: this was in the original
report from Basil and also here: http://trubka.network.cz/pipermail/bird-users/2020-September/014877.html
3) system lockup: the system becomes unresponsive, with messages like:
watchdog: BUG: soft lockup - CPU#X stuck for XXs!
and messages about transmit timeouts from the NIC driver.
This happened to us on a router that has a BGP full view and
handles around 50-100 Mbps of IPv6 traffic, which probably means
lots of route lookups. It happened with both 4.19 and 5.4. On the
other hand, kernel 4.9 runs fine on that exact same router (we are
running debian buster with the old kernel from debian stretch).
When we can't use an older kernel, our current workaround is the
following sysctl config:
net.ipv6.route.gc_thresh = 100000
net.ipv6.route.max_size = 400000
From my understanding, this works because it basically disables the gc
in most cases.
However, the "fib_rt_alloc" field from /proc/net/rt6_stats (6th field)
is steadily increasing: after 2 days of uptime it's at 67k. At some
point it will hit the gc threshold, we'll see what happens.
I am also trying to reproduce the issue locally.
Thanks,
Baptiste
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists