linux-kernel - Re: [RFC PATCH net-next 1/2] net: Use SMP threads for backlog NAPI.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230920155754.KzYGXMWh@linutronix.de>
Date:   Wed, 20 Sep 2023 17:57:54 +0200
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Wander Lairson Costa <wander@...hat.com>
Subject: Re: [RFC PATCH net-next 1/2] net: Use SMP threads for backlog NAPI.

On 2023-08-23 15:35:41 [+0200], Paolo Abeni wrote:
> On Mon, 2023-08-14 at 11:35 +0200, Sebastian Andrzej Siewior wrote:
> > @@ -4781,7 +4733,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
> >  		 * We can use non atomic operation since we own the queue lock
> >  		 */
> >  		if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state))
> > -			napi_schedule_rps(sd);
> > +			__napi_schedule_irqoff(&sd->backlog);
> >  		goto enqueue;
> >  	}
> >  	reason = SKB_DROP_REASON_CPU_BACKLOG;
> 
> I *think* that the above could be quite dangerous when cpu ==
> smp_processor_id() - that is, with plain veth usage.
> 
> Currently, each packet runs into the rx path just after
> enqueue_to_backlog()/tx completes.
> 
> With this patch there will be a burst effect, where the backlog thread
> will run after a few (several) packets will be enqueued, when the
> process scheduler will decide - note that the current CPU is already
> hosting a running process, the tx thread.
> 
> The above can cause packet drops (due to limited buffering) or very
> high latency (due to long burst), even in non overload situation, quite
> hard to debug.
> 
> I think the above needs to be an opt-in, but I guess that even RT
> deployments doing some packet forwarding will not be happy with this
> on.

I've been looking at this again and have been thinking what you said
here. I think part of the problem is that we lack a policy/ mechanism
when a DoS is happening and what to do.

Before commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its
job"") when a lot of network packets are processed then processing is
moved to ksoftirqd and continues based on how the scheduler schedules
the SCHED_OTHER ksoftirqd task. This avoids lock-ups of the system and
it can do something else in between. Any interrupt will not continue the
outstanding softirq backlog but wait for ksoftirqd. So it basically
avoids the networking overload. It throttles the throughput if needed.

This isn't the case after that commit. Now, the CPU can be stuck with
processing networking packets if the packets come in fast enough. Even
if ksoftirqd is woken up, the next interrupt (say the timer) will
continue with at least one round.
By using NAPI-threads it is possible to give the control back to the
scheduler which can throttle the NAPI processing in favour of other
threads that ask for CPU. As you pointed out, waking the thread does not
guarantee that it will immediately do the NAPI work. It can be delayed
based on current load on the system.

This could be influenced by assigning the NAPI-thread a SCHED_FIFO
priority. Based on the priority it could be ensured that the thread
starts right away or "later" if something else is more important.
However, this opens the DoS window again: The scheduler will put the
NAPI thread on CPU as long as it asks for it with no throttling.

If we could somehow define a DoS condition once we are overwhelmed with
packets, then we could act on it and throttle it. This in turn would
allow a SCHED_FIFO priority without the fear of a lockup if the system
is flooded with packets.

> Cheers,
> 
> Paolo

Sebastian