lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1374533052.4990.89.camel@edumazet-glaptop>
Date:	Mon, 22 Jul 2013 15:44:12 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Rick Jones <rick.jones2@...com>
Cc:	David Miller <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>,
	Yuchung Cheng <ycheng@...gle.com>,
	Neal Cardwell <ncardwell@...gle.com>,
	Michael Kerrisk <mtk.manpages@...il.com>
Subject: Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option

Hi Rick

> Netperf is perhaps a "best case" for this as it has no think time and 
> will not itself build-up a queue of data internally.
> 
> The 18% increase in service demand is troubling.

Its not troubling at such high speed. (Note also I had better throughput
in my (single) test)

Process scheduler cost is abysmal (Or more exactly when cpu enters idle
mode I presume).

Adding a context switch for every TSO packet is obviously not something
you want if you want to pump 20Gbps on a single tcp socket. I guess that
real application would not use 16KB send()s either.

I chose extreme parameters to show that the patch had acceptable impact.
(128KB are only 2 TSO packets)

The main targets of this patch are servers handling hundred to million
of sockets, or any machine with RAM constraints. This would also permit
better autotuning in the future. Our current 4MB limit is a bit small in
some cases.

Allowing the socket write queue to queue more bytes is better for
throughput/cpu cycles, as long as you have enough RAM.


> 
> It would be good to hit that with the confidence intervals (eg -i 30,3 
> and perhaps -i 99,<somthing other than the default of 5>) or do many 
> separate runs to get an idea of the variation.  Presumably remote 
> service demand is not of interest, so for the confidence intervals bit 
> you might drop the -C and keep only the -c in which case, netperf will 
> not be trying to hit the confidence interval remote CPU utilization 
> along with local CPU and throughput
> 

Well, I am sure a lot of netperf tests can be done, thanks for the
input ! I am removing the -C ;)

The -i30,3 runs are usually very very very slow :(

> Why are there more context switches with the lowat set to 128KB?  Is the 
> SO_SNDBUF growth in the first case the reason? Otherwise I would have 
> thought that netperf would have been context switching back and forth at 
> at "socket full" just as often as "at 128KB." You might then also 
> compare before and after with a fixed socket buffer size

It seems to me normal to get one context switch per TSO packet, instead
of _no_ context switches when the cpu is so busy it never has to put the
netperf thread to sleep. softirq handling is removing packets from write
queue at the same speed than application can add new ones ;)

> 
> Anything interesting happen when the send size is larger than the lowat?

Let's see ;)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat ./netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. 
Local       Remote      Local   Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Recv Socket Send    Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size    (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                              %     Method %      Method                          
3056328     6291456     262144  20.00   16311.69   10^6bits/s  2.97  S      -1.00  U      0.359   -1.000  usec/KB  

 Performance counter stats for './netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K':

      89301.211847 task-clock                #    0.446 CPUs utilized          
           349,509 context-switches          #    0.004 M/sec                  
               179 CPU-migrations            #    0.002 K/sec                  
               453 page-faults               #    0.005 K/sec                  
   242,819,453,514 cycles                    #    2.719 GHz                     [81.82%]
   199,273,454,019 stalled-cycles-frontend   #   82.07% frontend cycles idle    [84.27%]
    50,268,984,648 stalled-cycles-backend    #   20.70% backend  cycles idle    [67.76%]
    53,781,450,212 instructions              #    0.22  insns per cycle        
                                             #    3.71  stalled cycles per insn [83.77%]
     8,738,372,177 branches                  #   97.853 M/sec                   [82.99%]
       119,158,960 branch-misses             #    1.36% of all branches         [83.17%]

     200.032331409 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat ./netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. 
Local       Remote      Local   Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Recv Socket Send    Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size    (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                              %     Method %      Method                          
1862520     6291456     262144  20.00   17464.08   10^6bits/s  3.98  S      -1.00  U      0.448   -1.000  usec/KB  

 Performance counter stats for './netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K':

     111290.768845 task-clock                #    0.556 CPUs utilized          
         2,818,205 context-switches          #    0.025 M/sec                  
               201 CPU-migrations            #    0.002 K/sec                  
               453 page-faults               #    0.004 K/sec                  
   297,763,550,604 cycles                    #    2.676 GHz                     [83.35%]
   246,839,427,685 stalled-cycles-frontend   #   82.90% frontend cycles idle    [83.25%]
    75,450,669,370 stalled-cycles-backend    #   25.34% backend  cycles idle    [66.69%]
    63,464,955,178 instructions              #    0.21  insns per cycle        
                                             #    3.89  stalled cycles per insn [83.38%]
    10,564,139,626 branches                  #   94.924 M/sec                   [83.39%]
       248,015,797 branch-misses             #    2.35% of all branches         [83.32%]

     200.028775802 seconds time elapsed


14.091 context switches per second...

Interesting how it actually increases throughput !



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ