lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iL0gYoR1k5vccAcsBe=y+dr_MZNJZ0P4Q9J9-V1+HXGQA@mail.gmail.com>
Date:   Wed, 19 Oct 2016 07:10:44 +0200
From:   Eric Dumazet <edumazet@...gle.com>
To:     kernel test robot <xiaolong.ye@...el.com>
Cc:     "David S. Miller" <davem@...emloft.net>,
        Yuchung Cheng <ycheng@...gle.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>, lkp@...org
Subject: Re: [lkp] [tcp] 3613b3dbd1: netpipe.bigger_5M_Mbps.avg -5.5% regression

On Wed, Oct 19, 2016 at 3:54 AM, kernel test robot
<xiaolong.ye@...el.com> wrote:
>
> FYI, we noticed a -5.5% regression of netpipe.bigger_5M_Mbps.avg due to commit:
>
> commit 3613b3dbd1ade9a6a626dae1f608c57638eb5e8a ("tcp: prepare skbs for better sack shifting")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>
> in testcase: netpipe
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 6G
> with following parameters:
>
>         test: tcp
>         cluster: cs-localhost
>
> NetPIPE is a protocol independent performance tool that encapsulates the best of ttcp and netperf and visually represents the network performance under a variety of conditions.
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
>         git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
>         cd lkp-tests
>         bin/lkp install job.yaml  # job file is attached in this email
>         bin/lkp run     job.yaml
>
> =========================================================================================
> cluster/compiler/kconfig/rootfs/tbox_group/test/testcase:
>   cs-localhost/gcc-6/x86_64-rhel-7.2/debian-x86_64-2016-08-31.cgz/vm-lkp-a08/tcp/netpipe
>
> commit:
>   e812bd905a (" wireless-drivers-next patches for 4.9")
>   3613b3dbd1 ("tcp: prepare skbs for better sack shifting")
>
> e812bd905a5cf00f 3613b3dbd1ade9a6a626dae1f6
> ---------------- --------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>          %stddev     %change         %stddev
>              \          |                \
>       2427 ą  0%      -5.5%       2294 ą  0%  netpipe.bigger_5M_Mbps.avg
>       1764 ą  1%      +5.0%       1853 ą  0%  netpipe.time.elapsed_time
>       1764 ą  1%      +5.0%       1853 ą  0%  netpipe.time.elapsed_time.max
>      84.75 ą  0%      -2.1%      83.00 ą  0%  netpipe.time.percent_of_cpu_this_job_got
>       1497 ą  1%      +3.2%       1546 ą  0%  netpipe.time.system_time
>    1378335 ą  1%     +28.6%    1772455 ą  1%  netpipe.time.voluntary_context_switches
>      14140 ą  3%      +6.6%      15079 ą  2%  interrupts.CAL:Function_call_interrupts
>     393.25 ą  2%     -29.1%     279.00 ą  2%  slabinfo.kmalloc-2048.active_objs
>     412.00 ą  1%     -32.3%     279.00 ą  2%  slabinfo.kmalloc-2048.num_objs
>       4069 ą  0%     +17.5%       4779 ą  0%  vmstat.system.cs
>       2961 ą  0%      +5.3%       3118 ą  0%  vmstat.system.in
>     229439 ą 94%   +6926.4%   16121408 ą172%  latency_stats.avg.max
>     141971 ą160%  +11239.9%   16099418 ą172%  latency_stats.avg.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
>      90183 ą153%     -55.4%      40178 ą 84%  latency_stats.avg.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
>     839.25 ą 21%   +1114.0%      10188 ą125%  latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
>     414419 ą 86%   +7661.1%   32163484 ą172%  latency_stats.max.max
>     257867 ą166%  +12362.7%   32137192 ą172%  latency_stats.max.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
>      81137 ą155%     -94.7%       4265 ą 45%  latency_stats.max.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
>      90183 ą153%     -55.4%      40178 ą 84%  latency_stats.max.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
>     839.25 ą 21%   +1114.0%      10188 ą125%  latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
>       6232 ą 87%    -100.0%       0.00 ą -1%  latency_stats.sum.call_rwsem_down_read_failed.__do_page_fault.trace_do_page_fault.do_async_page_fault.async_page_fault.skb_copy_datagram_iter.tcp_rcv_established.tcp_v4_do_rcv.tcp_prequeue_process.tcp_recvmsg.inet_recvmsg.sock_recvmsg
>       1107 ą 62%    +964.9%      11788 ą 55%  latency_stats.sum.call_rwsem_down_read_failed.proc_pid_cmdline_read.__vfs_read.vfs_read.SyS_read.do_syscall_64.return_from_SYSCALL_64
>     276091 ą166%  +11542.5%   32143926 ą172%  latency_stats.sum.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
>      97392 ą128%     -82.9%      16660 ą 29%  latency_stats.sum.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
>      90183 ą153%     -55.4%      40178 ą 84%  latency_stats.sum.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
>     839.25 ą 21%   +1114.0%      10188 ą125%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
>       3561 ą 11%     +12.9%       4020 ą  6%  sched_debug.cpu.curr->pid.avg
>       9783 ą  6%     +11.8%      10942 ą  3%  sched_debug.cpu.curr->pid.max
>       4222 ą  7%     +11.3%       4701 ą  3%  sched_debug.cpu.curr->pid.stddev
>    3988122 ą 13%     +24.4%    4962088 ą  4%  sched_debug.cpu.max_idle_balance_cost.max
>    1330791 ą 13%     +30.8%    1740923 ą  4%  sched_debug.cpu.max_idle_balance_cost.stddev
>     576379 ą  1%      +9.7%     632485 ą  3%  sched_debug.cpu.nr_load_updates.max
>     987546 ą  1%     +20.0%    1184637 ą  1%  sched_debug.cpu.nr_switches.avg
>    1074939 ą  2%     +23.9%    1332366 ą  3%  sched_debug.cpu.nr_switches.max
>      67934 ą 31%    +110.9%     143278 ą  5%  sched_debug.cpu.nr_switches.stddev
>     -15.31 ą-32%     -41.6%      -8.93 ą-57%  sched_debug.cpu.nr_uninterruptible.min
>     990916 ą  1%     +20.2%    1190867 ą  1%  sched_debug.cpu.sched_count.avg
>    1079929 ą  2%     +24.2%    1341786 ą  3%  sched_debug.cpu.sched_count.max
>      69632 ą 31%    +111.3%     147114 ą  5%  sched_debug.cpu.sched_count.stddev
>     386683 ą  1%     +25.2%     483992 ą  2%  sched_debug.cpu.sched_goidle.avg
>     424674 ą  1%     +29.4%     549482 ą  1%  sched_debug.cpu.sched_goidle.max
>      33344 ą 23%    +107.5%      69183 ą  9%  sched_debug.cpu.sched_goidle.stddev
>     513012 ą  1%     +19.1%     610887 ą  1%  sched_debug.cpu.ttwu_count.avg
>     570351 ą  5%     +24.2%     708599 ą  5%  sched_debug.cpu.ttwu_count.max
>      40059 ą 42%    +112.3%      85064 ą 14%  sched_debug.cpu.ttwu_count.stddev
>
>
>
>
> Thanks,
> Xiaolong

Thanks for the report.

I could not reproduce your results, assuming it is netperf over loopback.

Problem with synthetic tests like that is that a small change can
cascade effects.

In your stats I can see a high number of context switches that a
normal netperf does not exhibit.

Also, maybe the problem is because you are using VM, maybe some NUMA
pinning or something...


Small hack to be able to enable/disable my patch, without having to
reboot and thus have  less variability.

edumazet@...ihi:/ssd/git/net-next$ git diff net/ipv4/tcp.c
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3251fe71f39f..55afe7b16016 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1035,7 +1035,7 @@ EXPORT_SYMBOL(tcp_sendpage);
  */
 static int linear_payload_sz(bool first_skb)
 {
-       if (first_skb)
+       if (first_skb || (netdev_max_backlog & 1))
                return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
        return 0;
 }

Then :

lpk51:~# echo 1000 >/proc/sys/net/core/netdev_max_backlog
lpk51:~# perf stat -a -C0,1 ./netperf -H 127.0.0.1 -Cc -T0,1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
127.0.0.1 () port 0 AF_INET : cpu bind
port 0 protocol 6
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00      45854.74   5.90     5.90     0.337   0.337

 Performance counter stats for 'system wide':

      20005.989547      task-clock (msec)         #    2.000 CPUs
utilized
         1,568,042      context-switches          #    0.078 M/sec
                 2      cpu-migrations            #    0.000 K/sec
               121      page-faults               #    0.006 K/sec
    69,260,921,315      cycles                    #    3.462 GHz
              [83.33%]
    49,197,585,858      stalled-cycles-frontend   #   71.03% frontend
cycles idle    [83.34%]
    30,069,741,080      stalled-cycles-backend    #   43.42% backend
cycles idle    [66.67%]
    32,341,104,986      instructions              #    0.47  insns per
cycle
                                                  #    1.52  stalled
cycles per insn [83.34%]
     5,887,445,456      branches                  #  294.284 M/sec
              [83.34%]
        65,533,193      branch-misses             #    1.11% of all
branches         [83.33%]

      10.003057093 seconds time elapsed

lpk51:~# echo 1001 >/proc/sys/net/core/netdev_max_backlog
lpk51:~# perf stat -a -C0,1 ./netperf -H 127.0.0.1 -Cc -T0,1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
127.0.0.1 () port 0 AF_INET : cpu bind
port 0 protocol 6
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00      45267.63   5.94     5.94     0.344   0.344

 Performance counter stats for 'system wide':

      20006.713062      task-clock (msec)         #    2.000 CPUs
utilized
         1,466,531      context-switches          #    0.073 M/sec
                 2      cpu-migrations            #    0.000 K/sec
               120      page-faults               #    0.006 K/sec
    69,743,911,198      cycles                    #    3.486 GHz
              [83.33%]
    49,509,413,005      stalled-cycles-frontend   #   70.99% frontend
cycles idle    [83.34%]
    28,186,805,983      stalled-cycles-backend    #   40.41% backend
cycles idle    [66.67%]
    31,662,971,822      instructions              #    0.45  insns per
cycle
                                                  #    1.56  stalled
cycles per insn [83.34%]
     5,751,313,010      branches                  #  287.469 M/sec
              [83.34%]
        69,213,717      branch-misses             #    1.20% of all
branches         [83.33%]

      10.002924757 seconds time elapsed

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ