lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 9 May 2018 12:02:34 -0700
From:   Ben Greear <greearb@...delatech.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        netdev <netdev@...r.kernel.org>
Subject: Re: Performance regression between 4.13 and 4.14

On 05/09/2018 11:48 AM, Eric Dumazet wrote:
>
>
> On 05/09/2018 11:43 AM, Ben Greear wrote:
>> On 05/08/2018 10:10 AM, Eric Dumazet wrote:
>>>
>>>
>>> On 05/08/2018 09:44 AM, Ben Greear wrote:
>>>> Hello,
>>>>
>>>> I am trying to track down a performance regression that appears to be between 4.13
>>>> and 4.14.
>>>>
>>>> I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  4.13 can do
>>>> right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps.
>>>>
>>>> I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx
>>>> on 4.14 and about 4.4Gbps on 4.13.
>>>>
>>>> Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits
>>>> crash or do not bring up networking, which makes the bisect difficult.
>>>>
>>>> Looking at perf top, it would appear that some lock is probably to blame.
>>>
>>>
>>> perf record -a -g -e cycles:pp sleep 5
>>> perf report
>>>
>>> Then you'll be able to tell us which lock (or call graph) is killing your perf.
>>>
>>
>> I seem to be chasing multiple issues.  For 4.13, at least part of my problem was that LOCKDEP was enabled,
>> during my bisect, though it does NOT appear enabled in 4.16.  I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
>> in 4.16, or something like that?  My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:
>>
>> [greearb@...-dt3 linux-4.16.x64]$ grep LOCKDEP .config
>> CONFIG_LOCKDEP_SUPPORT=y
>>
>>
>> For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
>> to disable to keep from getting a performance hit from the spectre-related bug
>> fixes?  At this point, I do not care about the security implications.
>>
>> greearb@...-dt3 linux-4.16.x64]$ grep RETPO .config
>> # CONFIG_RETPOLINE is not set
>>
>>
>> Thanks,
>> Ben
>>
>
> No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/

I initially saw the problem in 4.16, then bisected, and 4.14 still showed the
issue.

4.13 works, but only when I use a .config I originally built for 4.13, not the 4.16 .config
that I ended up using with the bisect (make oldconfig, accept all defaults).  I originally
configured 4.16 with a .config that had lockdep enabled, then manually tried to disable it
through 'make xconfig'.  I think that must leave "CONFIG_LOCKDEP=y" in the .config, which
screws up older builds during bisect, perhaps?

> Before doing a (painful) dissection, the perf output would immediately tell you if
> something is really wrong on your .config.

I didn't realize lockdep might be an issue at the time, but here is a 'bad' run from
a 4.13+ (plus pktgen hacks).  I guess lockdep is why this runs slowly, but I see no obvious
proof of that in the output:

4.13+, patched pktgen, 6Gbps throughput, on commit 906dde0f355bd97c080c215811ae7db1137c4af8

Samples: 26K of event 'cycles:pp', Event count (approx.): 20119166736
   Children      Self  Command          Shared Object                Symbol
+   87.97%     0.00%  kpktgend_1       [kernel.kallsyms]            [k] ret_from_fork
+   87.97%     0.00%  kpktgend_1       [kernel.kallsyms]            [k] kthread
+   86.89%     5.42%  kpktgend_1       [kernel.kallsyms]            [k] pktgen_thread_worker
+   33.75%     0.18%  kpktgend_1       [kernel.kallsyms]            [k] getnstimeofday64
+   32.77%     4.47%  kpktgend_1       [kernel.kallsyms]            [k] __getnstimeofday64
+   24.60%    10.91%  kpktgend_1       [kernel.kallsyms]            [k] lock_acquire
+   23.59%     0.03%  kpktgend_1       [kernel.kallsyms]            [k] __do_softirq
+   23.55%     0.07%  kpktgend_1       [kernel.kallsyms]            [k] net_rx_action
+   22.29%     0.47%  kpktgend_1       [kernel.kallsyms]            [k] getRelativeCurNs
+   21.33%     1.71%  kpktgend_1       [kernel.kallsyms]            [k] ixgbe_poll
+   15.79%     0.02%  kpktgend_1       [kernel.kallsyms]            [k] ret_from_intr
+   15.78%     0.01%  kpktgend_1       [kernel.kallsyms]            [k] do_IRQ
+   15.34%     0.01%  kpktgend_1       [kernel.kallsyms]            [k] irq_exit
+   13.95%    10.00%  kpktgend_1       [kernel.kallsyms]            [k] ip_send_check
+   13.80%    13.80%  kpktgend_1       [kernel.kallsyms]            [k] __lock_acquire.isra.31
+   12.98%     0.53%  kpktgend_1       [kernel.kallsyms]            [k] pktgen_finalize_skb
+   12.31%     0.20%  kpktgend_1       [kernel.kallsyms]            [k] timestamp_skb.isra.24
+   11.68%     0.13%  kpktgend_1       [kernel.kallsyms]            [k] napi_gro_receive
+   11.36%     0.25%  kpktgend_1       [kernel.kallsyms]            [k] netif_receive_skb_internal
+   10.93%     0.00%  swapper          [kernel.kallsyms]            [k] verify_cpu
+   10.93%     0.00%  swapper          [kernel.kallsyms]            [k] cpu_startup_entry
+   10.92%     0.02%  swapper          [kernel.kallsyms]            [k] do_idle
+   10.71%     0.00%  swapper          [kernel.kallsyms]            [k] cpuidle_enter
+   10.71%     0.00%  swapper          [kernel.kallsyms]            [k] call_cpuidle
+   10.66%     0.06%  swapper          [kernel.kallsyms]            [k] cpuidle_enter_state
+    9.78%     0.00%  swapper          [kernel.kallsyms]            [k] x86_64_start_kernel
+    9.78%     0.00%  swapper          [kernel.kallsyms]            [k] x86_64_start_reservations
+    9.78%     0.00%  swapper          [kernel.kallsyms]            [k] start_kernel
+    9.78%     0.00%  swapper          [kernel.kallsyms]            [k] rest_init
+    9.25%     0.00%  swapper          [kernel.kallsyms]            [k] ret_from_intr
+    9.25%     0.00%  swapper          [kernel.kallsyms]            [k] do_IRQ
+    9.24%     0.30%  kpktgend_1       [kernel.kallsyms]            [k] pktgen_alloc_skb
+    9.14%     0.00%  swapper          [kernel.kallsyms]            [k] irq_exit
+    8.89%     0.00%  swapper          [kernel.kallsyms]            [k] __do_softirq
+    8.82%     0.01%  swapper          [kernel.kallsyms]            [k] net_rx_action
+    8.80%     0.47%  kpktgend_1       [kernel.kallsyms]            [k] __local_bh_enable_ip
+    8.67%     6.87%  kpktgend_1       [kernel.kallsyms]            [k] lock_release
+    8.57%     1.10%  kpktgend_1       [kernel.kallsyms]            [k] __netdev_alloc_skb
+    8.28%     0.00%  kpktgend_1       [kernel.kallsyms]            [k] do_softirq.part.17
+    8.28%     0.00%  kpktgend_1       [kernel.kallsyms]            [k] do_softirq_own_stack
+    8.01%     0.74%  swapper          [kernel.kallsyms]            [k] ixgbe_poll
+    6.85%     2.28%  kpktgend_1       [kernel.kallsyms]            [k] __build_skb
+    4.89%     0.07%  kpktgend_1       [kernel.kallsyms]            [k] __netif_receive_skb
+    4.87%     0.75%  kpktgend_1       [kernel.kallsyms]            [k] __netif_receive_skb_core
+    4.66%     0.16%  kpktgend_1       [kernel.kallsyms]            [k] _raw_spin_lock
+    4.37%     0.06%  swapper          [kernel.kallsyms]            [k] napi_gro_receive
+    4.23%     0.10%  swapper          [kernel.kallsyms]            [k] netif_receive_skb_internal



Here is a 'bad' run of 4.16.0 + my full out-of-tree patches, including my normal pktgen changes
(pktgen code should be nearly identical between the run above and the run here)

Samples: 26K of event 'cycles:pp', Event count (approx.): 19888249841
   Children      Self  Command          Shared Object                 Symbol
+   74.24%     5.38%  kpktgend_1       [kernel.kallsyms]             [k] pktgen_thread_worker
+   73.89%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] ret_from_fork
+   73.89%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] kthread
+   27.31%     0.25%  kpktgend_1       [kernel.kallsyms]             [k] getnstimeofday64
+   26.15%     4.24%  kpktgend_1       [kernel.kallsyms]             [k] __getnstimeofday64
+   22.64%     0.02%  kpktgend_1       [kernel.kallsyms]             [k] __softirqentry_text_s
+   22.62%     0.06%  kpktgend_1       [kernel.kallsyms]             [k] net_rx_action
+   22.59%     2.04%  kpktgend_1       [kernel.kallsyms]             [k] ixgbe_poll
+   21.49%     7.92%  kpktgend_1       [kernel.kallsyms]             [k] lock_acquire
+   17.78%    17.78%  kpktgend_1       [kernel.kallsyms]             [k] __lock_acquire.isra.3
+   15.95%     0.39%  kpktgend_1       [kernel.kallsyms]             [k] getRelativeCurNs
+   14.49%     0.02%  kpktgend_1       [kernel.kallsyms]             [k] ret_from_intr
+   14.48%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] do_IRQ
+   14.35%     0.01%  kpktgend_1       [kernel.kallsyms]             [k] irq_exit
+   13.46%     9.90%  kpktgend_1       [kernel.kallsyms]             [k] ip_send_check
+   11.36%     0.18%  kpktgend_1       [kernel.kallsyms]             [k] napi_gro_receive
+   11.25%     0.17%  kpktgend_1       [kernel.kallsyms]             [k] netif_receive_skb_int
+   10.37%     0.16%  kpktgend_1       [kernel.kallsyms]             [k] timestamp_skb.isra.25
+    8.85%     0.45%  kpktgend_1       [kernel.kallsyms]             [k] __local_bh_enable_ip
+    8.61%     0.02%  swapper          [kernel.kallsyms]             [k] do_idle
+    8.60%     0.00%  swapper          [kernel.kallsyms]             [k] secondary_startup_64
+    8.60%     0.00%  swapper          [kernel.kallsyms]             [k] cpu_startup_entry
+    8.37%     0.05%  swapper          [kernel.kallsyms]             [k] cpuidle_enter_state
+    8.35%     7.57%  kpktgend_1       [kernel.kallsyms]             [k] lock_release
+    8.32%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] do_softirq.part.16
+    8.31%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] do_softirq_own_stack
+    7.16%     0.36%  kpktgend_1       [kernel.kallsyms]             [k] pktgen_alloc_skb
+    7.09%     1.14%  kpktgend_1       [kernel.kallsyms]             [k] __netdev_alloc_skb
+    6.87%     0.00%  swapper          [kernel.kallsyms]             [k] start_kernel
+    6.55%     0.00%  swapper          [kernel.kallsyms]             [k] ret_from_intr
+    6.55%     0.01%  swapper          [kernel.kallsyms]             [k] do_IRQ
+    6.51%     0.01%  swapper          [kernel.kallsyms]             [k] irq_exit
+    6.34%     0.02%  swapper          [kernel.kallsyms]             [k] __softirqentry_text_s
+    6.25%     0.01%  swapper          [kernel.kallsyms]             [k] net_rx_action
+    6.06%     0.45%  swapper          [kernel.kallsyms]             [k] ixgbe_poll
+    5.35%     2.11%  kpktgend_1       [kernel.kallsyms]             [k] __build_skb


I tried testing with stock kernels and UDP user-space traffic, but results varied widely
from run to run, so it was useless for bisecting.

I have irqbalance disabled on this system in order to pin pktgen process and ixgbe IRQs
to be efficient, so maybe that is why user-space UDP traffic works so unreliably.  My hacked pktgen
reliably shows good on my good 4.13 kernel and reliably bad on higher kernels I have compiled
so far, so I decided it was the only useful way to bisect this problem.

I will now try 4.14 and higher again with lockdep definitely disabled.

Thanks,
Ben


-- 
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc  http://www.candelatech.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ