lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLuY3dg6tpJCosGVOVR2FbT09DGXZpfb+zuW_BnPVf32g@mail.gmail.com>
Date: Fri, 1 Dec 2023 18:35:34 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Doug Anderson <dianders@...omium.org>
Cc: Judy Hsiao <judyhsiao@...omium.org>, David Ahern <dsahern@...nel.org>, 
	Simon Horman <horms@...nel.org>, Brian Haley <haleyb.dev@...il.com>, 
	"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Joel Granados <joel.granados@...il.com>, Julian Anastasov <ja@....bg>, Leon Romanovsky <leon@...nel.org>, 
	Luis Chamberlain <mcgrof@...nel.org>, Paolo Abeni <pabeni@...hat.com>, linux-kernel@...r.kernel.org, 
	netdev@...r.kernel.org
Subject: Re: [PATCH v1] neighbour: Don't let neigh_forced_gc() disable
 preemption for long

On Fri, Dec 1, 2023 at 6:16 PM Doug Anderson <dianders@...omium.org> wrote:
>
> Hi,
>
> On Fri, Dec 1, 2023 at 7:58 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Fri, Dec 1, 2023 at 4:16 PM Doug Anderson <dianders@...omium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Fri, Dec 1, 2023 at 1:10 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Fri, Dec 1, 2023 at 9:39 AM Judy Hsiao <judyhsiao@...omium.org> wrote:
> > > > >
> > > > > We are seeing cases where neigh_cleanup_and_release() is called by
> > > > > neigh_forced_gc() many times in a row with preemption turned off.
> > > > > When running on a low powered CPU at a low CPU frequency, this has
> > > > > been measured to keep preemption off for ~10 ms. That's not great on a
> > > > > system with HZ=1000 which expects tasks to be able to schedule in
> > > > > with ~1ms latency.
> > > >
> > > > This will not work in general, because this code runs with BH blocked.
> > > >
> > > > jiffies will stay untouched for many more ms on systems with only one CPU.
> > > >
> > > > I would rather not rely on jiffies here but ktime_get_ns() [1]
> > > >
> > > > Also if we break the loop based on time, we might be unable to purge
> > > > the last elements in gc_list.
> > > > We might need to use a second list to make sure to cycle over all
> > > > elements eventually.
> > > >
> > > >
> > > > [1]
> > > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> > > > index df81c1f0a57047e176b7c7e4809d2dae59ba6be5..e2340e6b07735db8cf6e75d23ef09bb4b0db53b4
> > > > 100644
> > > > --- a/net/core/neighbour.c
> > > > +++ b/net/core/neighbour.c
> > > > @@ -253,9 +253,11 @@ static int neigh_forced_gc(struct neigh_table *tbl)
> > > >  {
> > > >         int max_clean = atomic_read(&tbl->gc_entries) -
> > > >                         READ_ONCE(tbl->gc_thresh2);
> > > > +       u64 tmax = ktime_get_ns() + NSEC_PER_MSEC;
> > >
> > > It might be nice to make the above timeout based on jiffies. On a
> > > HZ=100 system it's probably OK to keep preemption disabled for 10 ms
> > > but on a HZ=1000 system you'd want 1 ms. ...so maybe you'd want to use
> > > jiffies_to_nsecs(1)?
> >
> > I do not think so. 10ms would be awfully long.
> >
> > We have nsec based time service, why downgrading to jiffies resolution ???
>
> Well, the whole issue is that we're disabling preemption, right?
> Unless I'm mistaken, on a 1000 HZ system then a task's timeslice is
> 1ms and on a 100 HZ system then a task's timeslice is 10ms. When we
> disable preemption then the problem is that we can keep going past the
> end of our timeslice. This is bad for whatever task the system is
> trying to schedule instead of us since it will be blocked waiting for
> us to re-enable preemption.
>
> So essentially the problem here is really tied to jiffies resolution,
> right? Specifically, if jiffies is 100 Hz then it's actually
> inefficient to timeout every 1 ms--I think it would be better to use
> up our whole timeslice.

It is not because a kernel is built with HZ=100 that each thread has
to consume cpu times in 10ms slices.

Process scheduler does not use jiffies at all, but high resolution time service.

Keep in mind this code is run from soft-interrupt, not a dedicated processus.

>
>
> > > One worry might be that we disabled preemption _right before_ we were
> > > supposed to be scheduled out. In that case we'll end up blocking some
> > > other task for another full timeslice, but maybe there's not a lot we
> > > can do there?
> >
> > Can you tell us in which scenario this gc_list can be so big, other
> > than fuzzers ?
>
> The place we hit this wasn't actually with fuzzers but with normal
> usage in our labs. The only case where it was a really big problem was
> when neigh_forced_gc() was scheduled on a "little" CPU (in a
> big.LITTLE system) and that little CPU happened to be running at the
> lowest CPU frequency. Specifically Judy was testing on sc7180-trogdor
> and the lowest CPU Frequency of the "little" CPUs was 300 MHz. Since
> the littles are less powerful than the bigs, this is roughly the
> equivalent processing power of a big core running at 120 MHz.
>
> FWIW, we are apparently no longer seeing the bad latency after
> <https://crrev.com/c/4914309>, which does this:
>
> # Increase kernel neighbor table size.
> echo 1024 > /proc/sys/net/ipv4/neigh/default/gc_thresh1
> echo 4096 > /proc/sys/net/ipv4/neigh/default/gc_thresh2
> echo 8192 > /proc/sys/net/ipv4/neigh/default/gc_thresh3
> echo 1024 > /proc/sys/net/ipv6/neigh/default/gc_thresh1
> echo 4096 > /proc/sys/net/ipv6/neigh/default/gc_thresh2
> echo 8192 > /proc/sys/net/ipv6/neigh/default/gc_thresh3
>
> However, I still believe that we should land something like Judy's
> patch because, no matter what kernel tunings we have, the kernel
> shouldn't be disabling preemption for so long.

Sure, and I suggested a refinement, because as I said jiffies can
stick to a value.

Not sure why a refinement given by a network maintainer is not an option ?

I must be missing something.

>
> I will also note that, while I don't know the code at all, someone on
> our networking team commented this: High CPU usage / latency on
> neigh_cleanup_and_release is expected naturally because of our
> relatively-noisy lab environment: there are often hundreds if not
> thousands of devices + virtual devices in a single L2 network.

Well, I know this code a bit, trust me.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ