linux-kernel - Re: [PATCH] softirq: let ksoftirqd do its job

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160901123849.2c9aa5e1@redhat.com>
Date:   Thu, 1 Sep 2016 12:38:49 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Rick Jones <rick.jones2@....com>
Cc:     brouer@...hat.com, Eric Dumazet <eric.dumazet@...il.com>,
        Peter Zijlstra <peterz@...radead.org>,
        David Miller <davem@...emloft.net>,
        Rik van Riel <riel@...hat.com>,
        Paolo Abeni <pabeni@...hat.com>,
        Hannes Frederic Sowa <hannes@...hat.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        netdev <netdev@...r.kernel.org>, Jonathan Corbet <corbet@....net>
Subject: Re: [PATCH] softirq: let ksoftirqd do its job

On Wed, 31 Aug 2016 16:29:56 -0700 Rick Jones <rick.jones2@....com> wrote:
> On 08/31/2016 04:11 PM, Eric Dumazet wrote:
> > On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:  
> >> With regard to drops, are both of you sure you're using the same socket
> >> buffer sizes?  
> >
> > Does it really matter ?  
> 
> At least at points in the past I have seen different drop counts at the 
> SO_RCVBUF based on using (sometimes much) larger sizes.  The hypothesis 
> I was operating under at the time was that this dealt with those 
> situations where the netserver was held-off from running for "a little 
> while" from time to time.  It didn't change things for a sustained 
> overload situation though.

Yes, Rick, your hypothesis corresponds to my measurements.  The
userspace program is held-off from running for "a little while" from
time to time.  I've measured this with perf sched record/latency.  It
is sort of a natural scheduler characteristic.
 The userspace UDP socket program consume/need more cycles to perform
its jobs, than kernel softirqd. Thus the UDP-prog use up its sched
time-slice, and periodically ksoftirq get schedule multiple times,
because UDP-prog don't have any credits any-longer.

WARNING: Do not increase socket queue size to pamper over this issue,
it is the WRONG solution, it will give horrible latency issues.

With above warning, I can tell your, yes you are also right about
increasing the socket buffer size, can be used to mitigate/hide the
packet drops.  You can even increase the socket size so much, that the
drop problem "goes-away".  The queue simply need to be deep enough to
absorb the worst/maximum time UDP-prog was scheduled out.  The hidden
effect to make this work (to not contradict queue theory) is that this
also slows-down/cost-more-cycles for ksoftirqd/NAPI as it cost more to
enqueue (instead of dropping packets on a full queue).

You can measure the sched "Maximum delay" using:
 sudo perf sched record -C 0 sleep 10
 sudo perf sched latency

On my setup I measured "Maximum delay" of approx 9 ms.  Given I can
see an incoming packet rate of 2.4Mpps (880Kpps reach UDP-prog), and
knowing network stack use skb->truesize (approx 2048 bytes on this
driver), I can calculate that I need approx 45MBytes buffer
((2.4*10^6)*(9/1000)*2048 = 44.2Mb)

The PPS measurement comes from:

 $ nstat > /dev/null && sleep 1 && nstat
 #kernel
 IpInReceives                    2335926            0.0
 IpInDelivers                    2335925            0.0
 UdpInDatagrams                  880086             0.0
 UdpInErrors                     1455850            0.0
 UdpRcvbufErrors                 1455850            0.0
 IpExtInOctets                   107453056          0.0

Changing queue size to 50MBytes :
 sysctl -w net/core/rmem_max=$((50*1024*1024)) ;\
 sysctl -w net.core.rmem_default=$((50*1024*1024))

New result looks "nice", with no drops, and 1.42Mpps delivered to
UDP-prog, but in reality it is not nice for latency...

 $ nstat > /dev/null && sleep 1 && nstat
 #kernel
 IpInReceives                    1425013            0.0
 IpInDelivers                    1425017            0.0
 UdpInDatagrams                  1432139            0.0
 IpExtInOctets                   65539328           0.0
 IpExtInNoECTPkts                1424771            0.0

Tracking of queue size, max, min and average::

 while (true); do netstat -uan | grep '0.0.0.0:9'; sleep 0.3; done |
  awk 'BEGIN {max=0;min=0xffffffff;sum=0;n=0} \
   {if ($2 > max) max=$2;
    if ($2 < min) min=$2;
    n++; sum+=$2;
    printf "%s Recv-Q: %d max: %d min: %d ave: %.3f\n",$1,$2,max,min,sum/n;}';
 Result:
  udp Recv-Q: 23624832 max: 47058176 min: 4352 ave: 25092687.698

I see max queue of 47MBytes, and worse an average standing queue of
25Mbytes, which is really bad for the latency seen by the
application. And having this much outstanding memory is also bad for
CPU cache size effects, and stressing the memory allocator.
 I'm actually using this huge queue "misconfig" to stress the page
allocator and my page_pool implementation into worse case situations ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer