lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <22d992aa-2b65-0de3-b88c-fd216ae0218e@redhat.com>
Date:   Wed, 16 Aug 2023 23:02:34 +0200
From:   Jesper Dangaard Brouer <jbrouer@...hat.com>
To:     Yan Zhai <yan@...udflare.com>,
        Jesper Dangaard Brouer <hawk@...nel.org>
Cc:     brouer@...hat.com,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        netdev@...r.kernel.org, Paolo Abeni <pabeni@...hat.com>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Wander Lairson Costa <wander@...hat.com>,
        linux-kernel@...r.kernel.org,
        kernel-team <kernel-team@...udflare.com>
Subject: Re: [RFC PATCH 2/2] softirq: Drop the warning from
 do_softirq_post_smp_call_flush().



On 16/08/2023 17.15, Yan Zhai wrote:
> On Wed, Aug 16, 2023 at 9:49 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>>
>> On 15/08/2023 14.08, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote:
>>>> This is an undesired situation and it has been attempted to avoid the
>>>> situation in which ksoftirqd becomes scheduled. This changed since
>>>> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
>>>> and now a threaded interrupt handler will handle soft interrupts at its
>>>> end even if ksoftirqd is pending. That means that they will be processed
>>>> in the context in which they were raised.
>>>
>>> $ git describe --contains d15121be74856
>>> v6.5-rc1~232^2~4
>>>
>>> That revert basically removes the "overload" protection that was added
>>> to cope with DDoS situations in Aug 2016 (Cc. Cloudflare).  As described
>>> in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let
>>> ksoftirqd do its job") in UDP overload situations when UDP socket
>>> receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost
>>> doesn't process packets (because softirq steals CPU/sched time from UDP
>>> pid).  Warning Cloudflare (Cc) as this might affect their production
>>> use-cases, and I recommend getting involved to evaluate the effect of
>>> these changes.
>>>
>>
>> I did some testing on net-next (with commit d15121be74856 ("Revert
>> "softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink.
>>
>> And I observe the old overload issue occur again, where userspace
>> process (udp_sink) process very few packets when running on *same* CPU
>> as the NAPI-RX/IRQ processing.  The perf report "comm" clearly shows
>> that NAPI runs in the context of the "udp_sink" process, stealing its
>> sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details
>> below).
>> What happens are that NAPI takes 64 packets and queue them to the
>> udp_sink process *socket*, the udp_sink process *wakeup* process 1
>> packet from socket queue and on exit (__local_bh_enable_ip) runs softirq
>> that starts NAPI (to again process 64 packets... repeat).
>>
> I think there are two scenarios to consider:
 >
> 1. Actual DoS scenario. In this case, we would drop DoS packets
> through XDP, which might actually relieve the stress. According to
> Marek's blog XDP can indeed drop 10M pps [1] so it might not steal too
> much time. This is also something I would like to validate again since

Yes, using XDP to drop packet will/should relieve the stress, as it
basically can discard some of the 64 packets processed by NAPI vs the 1
packet received by userspace (that re-trigger NAPI), giving a better 
balance.

> I cannot tell if those tests were performed before or after the
> reverted commit.

Marek's tests will likely contain the patch 4cd13c21b207 ("softirq: Let
ksoftirqd do its job") as blog is from 2018 and patch from 2016, but
shouldn't matter much.


> 2. Legit elephant flows (so it should not be just dropped). This one
> is closer to what you tested above, and it is a much harder issue
> since packets are legit and should not be dropped early at XDP. Let
> the scheduler move affected processes away seems to be the non-optimal
> but straight answer for now. However, I suspect this would impose an
> overload issue for those programmed with RFS or ARFS, since flows
> would "follow" the processes. They probably have to force threaded
> NAPI for tuning.
>

True, this is the case I don't know how to solve.

For UDP packets it is NOT optimal to let the process "follow"/run on the 
NAPI-RX CPU. For TCP traffic it is faster to run on same CPU, which 
could be related to GRO effect, or simply that tcp_recvmsg gets a stream 
of data (before it invokes __local_bh_enable_ip causing do_softirq).

I have also tested with netperf UDP packets[2] in a scenario that 
doesn't cause "overload" and CPU have idle cycles.  When UDP-netserver 
is running on same CPU as NAPI then I see approx 38% (82020/216362) 
UdpRcvbufErrors [3] (and separate CPUs 2.8%).  Sure, I could increase 
buffer size, but the point is NAPI can enqueue 64 packet and UDP 
receiver dequeue 1 packet.

This reminded me that kernel have a recvmmsg (extra "m") syscall for 
multiple packets.  I tested this (as udop_sink have support), but no 
luck. This is because internally in the kernel (do_recvmmsg) is just a 
loop over ___sys_recvmsg/__skb_recv_udp, which have a BH-spinlock per 
packet that invokes __local_bh_enable_ip/do_softirq.  I guess, we/netdev 
could fix recvmmsg() to bulk-dequeue from socket queue (BH-socket unlock 
is triggering __local_bh_enable_ip/do_softirq) and then have a solution 
for UDP(?).


[2] netperf -H 198.18.1.1 -D1 -l 1200 -t UDP_STREAM -T 0,0 -- -m 1472 -N -n

[3]
$ nstat -n && sleep 1 && nstat
#kernel
IpInReceives                    216362             0.0
IpInDelivers                    216354             0.0
UdpInDatagrams                  134356             0.0
UdpInErrors                     82020              0.0
UdpRcvbufErrors                 82020              0.0
IpExtInOctets                   324600000          0.0
IpExtInNoECTPkts                216400             0.0


> [1] https://blog.cloudflare.com/how-to-drop-10-million-packets/
> 
>>
>>> I do realize/acknowledge that the reverted patch caused other latency
>>> issues, given it was a "big-hammer" approach affecting other softirq
>>> processing (as can be seen by e.g. the watchdog fixes patches).
>>> Thus, the revert makes sense, but how to regain the "overload"
>>> protection such that RX networking cannot starve processes reading from
>>> the socket? (is this what Sebastian's patchset does?)
>>>
>>
>> I'm no expert in sched / softirq area of the kernel, but I'm willing to
>> help out testing different solution that can regain the "overload"
>> protection e.g. avoid packet processing "falls-of-an-edge" (and thus
>> opens the kernel to be DDoS'ed easily).
>> Is this what Sebastian's patchset does?
>>
>>
>>>
>>> Thread link for people Cc'ed:
>>> https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r
>>
>> --Jesper
>> (some testlab results below)
>>
>> [udp_sink]
>> https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
>>
>>
>> When udp_sink runs on same CPU and NAPI/softirq
>>    - UdpInDatagrams: 2,948 packets/sec
>>
>> $ nstat -n && sleep 1 && nstat
>> #kernel
>> IpInReceives                    2831056            0.0
>> IpInDelivers                    2831053            0.0
>> UdpInDatagrams                  2948               0.0
>> UdpInErrors                     2828118            0.0
>> UdpRcvbufErrors                 2828118            0.0
>> IpExtInOctets                   130206496          0.0
>> IpExtInNoECTPkts                2830576            0.0
>>
>> When udp_sink runs on another CPU than NAPI-RX.
>>    - UdpInDatagrams: 1,722,307 pps
>>
>> $ nstat -n && sleep 1 && nstat
>> #kernel
>> IpInReceives                    2318560            0.0
>> IpInDelivers                    2318562            0.0
>> UdpInDatagrams                  1722307            0.0
>> UdpInErrors                     596280             0.0
>> UdpRcvbufErrors                 596280             0.0
>> IpExtInOctets                   106634256          0.0
>> IpExtInNoECTPkts                2318136            0.0
>>
>>
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ