linux-kernel - Re: [RFC PATCH 2/2] softirq: Drop the warning from do_softirq_post_smp_call

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO3-PbpbrK6FAACw5TQyBxJ6jgO7_bhLFuPVAziUE+40_o_GnA@mail.gmail.com>
Date:   Wed, 16 Aug 2023 10:15:40 -0500
From:   Yan Zhai <yan@...udflare.com>
To:     Jesper Dangaard Brouer <hawk@...nel.org>
Cc:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        netdev@...r.kernel.org, Paolo Abeni <pabeni@...hat.com>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Wander Lairson Costa <wander@...hat.com>,
        linux-kernel@...r.kernel.org,
        kernel-team <kernel-team@...udflare.com>
Subject: Re: [RFC PATCH 2/2] softirq: Drop the warning from do_softirq_post_smp_call_flush().

On Wed, Aug 16, 2023 at 9:49 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>
>
>
> On 15/08/2023 14.08, Jesper Dangaard Brouer wrote:
> >
> >
> > On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote:
> >> This is an undesired situation and it has been attempted to avoid the
> >> situation in which ksoftirqd becomes scheduled. This changed since
> >> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
> >> and now a threaded interrupt handler will handle soft interrupts at its
> >> end even if ksoftirqd is pending. That means that they will be processed
> >> in the context in which they were raised.
> >
> > $ git describe --contains d15121be74856
> > v6.5-rc1~232^2~4
> >
> > That revert basically removes the "overload" protection that was added
> > to cope with DDoS situations in Aug 2016 (Cc. Cloudflare).  As described
> > in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let
> > ksoftirqd do its job") in UDP overload situations when UDP socket
> > receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost
> > doesn't process packets (because softirq steals CPU/sched time from UDP
> > pid).  Warning Cloudflare (Cc) as this might affect their production
> > use-cases, and I recommend getting involved to evaluate the effect of
> > these changes.
> >
>
> I did some testing on net-next (with commit d15121be74856 ("Revert
> "softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink.
>
> And I observe the old overload issue occur again, where userspace
> process (udp_sink) process very few packets when running on *same* CPU
> as the NAPI-RX/IRQ processing.  The perf report "comm" clearly shows
> that NAPI runs in the context of the "udp_sink" process, stealing its
> sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details
> below).
> What happens are that NAPI takes 64 packets and queue them to the
> udp_sink process *socket*, the udp_sink process *wakeup* process 1
> packet from socket queue and on exit (__local_bh_enable_ip) runs softirq
> that starts NAPI (to again process 64 packets... repeat).
>
I think there are two scenarios to consider:
1. Actual DoS scenario. In this case, we would drop DoS packets
through XDP, which might actually relieve the stress. According to
Marek's blog XDP can indeed drop 10M pps [1] so it might not steal too
much time. This is also something I would like to validate again since
I cannot tell if those tests were performed before or after the
reverted commit.
2. Legit elephant flows (so it should not be just dropped). This one
is closer to what you tested above, and it is a much harder issue
since packets are legit and should not be dropped early at XDP. Let
the scheduler move affected processes away seems to be the non-optimal
but straight answer for now. However, I suspect this would impose an
overload issue for those programmed with RFS or ARFS, since flows
would "follow" the processes. They probably have to force threaded
NAPI for tuning.

[1] https://blog.cloudflare.com/how-to-drop-10-million-packets/

>
> > I do realize/acknowledge that the reverted patch caused other latency
> > issues, given it was a "big-hammer" approach affecting other softirq
> > processing (as can be seen by e.g. the watchdog fixes patches).
> > Thus, the revert makes sense, but how to regain the "overload"
> > protection such that RX networking cannot starve processes reading from
> > the socket? (is this what Sebastian's patchset does?)
> >
>
> I'm no expert in sched / softirq area of the kernel, but I'm willing to
> help out testing different solution that can regain the "overload"
> protection e.g. avoid packet processing "falls-of-an-edge" (and thus
> opens the kernel to be DDoS'ed easily).
> Is this what Sebastian's patchset does?
>
>
> >
> > Thread link for people Cc'ed:
> > https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r
>
> --Jesper
> (some testlab results below)
>
> [udp_sink]
> https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
>
>
> When udp_sink runs on same CPU and NAPI/softirq
>   - UdpInDatagrams: 2,948 packets/sec
>
> $ nstat -n && sleep 1 && nstat
> #kernel
> IpInReceives                    2831056            0.0
> IpInDelivers                    2831053            0.0
> UdpInDatagrams                  2948               0.0
> UdpInErrors                     2828118            0.0
> UdpRcvbufErrors                 2828118            0.0
> IpExtInOctets                   130206496          0.0
> IpExtInNoECTPkts                2830576            0.0
>
> When udp_sink runs on another CPU than NAPI-RX.
>   - UdpInDatagrams: 1,722,307 pps
>
> $ nstat -n && sleep 1 && nstat
> #kernel
> IpInReceives                    2318560            0.0
> IpInDelivers                    2318562            0.0
> UdpInDatagrams                  1722307            0.0
> UdpInErrors                     596280             0.0
> UdpRcvbufErrors                 596280             0.0
> IpExtInOctets                   106634256          0.0
> IpExtInNoECTPkts                2318136            0.0
>
>


-- 

Yan