linux-kernel - Re: [PATCH v1] neighbour: Don't let neigh_forced

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=UxS9qxYNdd+kqtW3VRSK=0H9ZPgW=CeSEjfbJXut+PaQ@mail.gmail.com>
Date:   Fri, 1 Dec 2023 09:16:14 -0800
From:   Doug Anderson <dianders@...omium.org>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Judy Hsiao <judyhsiao@...omium.org>,
        David Ahern <dsahern@...nel.org>,
        Simon Horman <horms@...nel.org>,
        Brian Haley <haleyb.dev@...il.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Joel Granados <joel.granados@...il.com>,
        Julian Anastasov <ja@....bg>,
        Leon Romanovsky <leon@...nel.org>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, linux-kernel@...r.kernel.org,
        netdev@...r.kernel.org
Subject: Re: [PATCH v1] neighbour: Don't let neigh_forced_gc() disable
 preemption for long

Hi,

On Fri, Dec 1, 2023 at 7:58 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Fri, Dec 1, 2023 at 4:16 PM Doug Anderson <dianders@...omium.org> wrote:
> >
> > Hi,
> >
> > On Fri, Dec 1, 2023 at 1:10 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Fri, Dec 1, 2023 at 9:39 AM Judy Hsiao <judyhsiao@...omium.org> wrote:
> > > >
> > > > We are seeing cases where neigh_cleanup_and_release() is called by
> > > > neigh_forced_gc() many times in a row with preemption turned off.
> > > > When running on a low powered CPU at a low CPU frequency, this has
> > > > been measured to keep preemption off for ~10 ms. That's not great on a
> > > > system with HZ=1000 which expects tasks to be able to schedule in
> > > > with ~1ms latency.
> > >
> > > This will not work in general, because this code runs with BH blocked.
> > >
> > > jiffies will stay untouched for many more ms on systems with only one CPU.
> > >
> > > I would rather not rely on jiffies here but ktime_get_ns() [1]
> > >
> > > Also if we break the loop based on time, we might be unable to purge
> > > the last elements in gc_list.
> > > We might need to use a second list to make sure to cycle over all
> > > elements eventually.
> > >
> > >
> > > [1]
> > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> > > index df81c1f0a57047e176b7c7e4809d2dae59ba6be5..e2340e6b07735db8cf6e75d23ef09bb4b0db53b4
> > > 100644
> > > --- a/net/core/neighbour.c
> > > +++ b/net/core/neighbour.c
> > > @@ -253,9 +253,11 @@ static int neigh_forced_gc(struct neigh_table *tbl)
> > >  {
> > >         int max_clean = atomic_read(&tbl->gc_entries) -
> > >                         READ_ONCE(tbl->gc_thresh2);
> > > +       u64 tmax = ktime_get_ns() + NSEC_PER_MSEC;
> >
> > It might be nice to make the above timeout based on jiffies. On a
> > HZ=100 system it's probably OK to keep preemption disabled for 10 ms
> > but on a HZ=1000 system you'd want 1 ms. ...so maybe you'd want to use
> > jiffies_to_nsecs(1)?
>
> I do not think so. 10ms would be awfully long.
>
> We have nsec based time service, why downgrading to jiffies resolution ???

Well, the whole issue is that we're disabling preemption, right?
Unless I'm mistaken, on a 1000 HZ system then a task's timeslice is
1ms and on a 100 HZ system then a task's timeslice is 10ms. When we
disable preemption then the problem is that we can keep going past the
end of our timeslice. This is bad for whatever task the system is
trying to schedule instead of us since it will be blocked waiting for
us to re-enable preemption.

So essentially the problem here is really tied to jiffies resolution,
right? Specifically, if jiffies is 100 Hz then it's actually
inefficient to timeout every 1 ms--I think it would be better to use
up our whole timeslice.


> > One worry might be that we disabled preemption _right before_ we were
> > supposed to be scheduled out. In that case we'll end up blocking some
> > other task for another full timeslice, but maybe there's not a lot we
> > can do there?
>
> Can you tell us in which scenario this gc_list can be so big, other
> than fuzzers ?

The place we hit this wasn't actually with fuzzers but with normal
usage in our labs. The only case where it was a really big problem was
when neigh_forced_gc() was scheduled on a "little" CPU (in a
big.LITTLE system) and that little CPU happened to be running at the
lowest CPU frequency. Specifically Judy was testing on sc7180-trogdor
and the lowest CPU Frequency of the "little" CPUs was 300 MHz. Since
the littles are less powerful than the bigs, this is roughly the
equivalent processing power of a big core running at 120 MHz.

FWIW, we are apparently no longer seeing the bad latency after
<https://crrev.com/c/4914309>, which does this:

# Increase kernel neighbor table size.
echo 1024 > /proc/sys/net/ipv4/neigh/default/gc_thresh1
echo 4096 > /proc/sys/net/ipv4/neigh/default/gc_thresh2
echo 8192 > /proc/sys/net/ipv4/neigh/default/gc_thresh3
echo 1024 > /proc/sys/net/ipv6/neigh/default/gc_thresh1
echo 4096 > /proc/sys/net/ipv6/neigh/default/gc_thresh2
echo 8192 > /proc/sys/net/ipv6/neigh/default/gc_thresh3

However, I still believe that we should land something like Judy's
patch because, no matter what kernel tunings we have, the kernel
shouldn't be disabling preemption for so long.

I will also note that, while I don't know the code at all, someone on
our networking team commented this: High CPU usage / latency on
neigh_cleanup_and_release is expected naturally because of our
relatively-noisy lab environment: there are often hundreds if not
thousands of devices + virtual devices in a single L2 network.

-Doug