lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 9 Mar 2021 15:56:29 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     kernel test robot <oliver.sang@...el.com>
Cc:     Jann Horn <jannh@...gle.com>, David Rientjes <rientjes@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Christoph Lameter <cl@...ux.com>,
        Pekka Enberg <penberg@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        kernel test robot <lkp@...el.com>,
        "Huang, Ying" <ying.huang@...el.com>,
        Feng Tang <feng.tang@...el.com>, zhengjun.xing@...el.com
Subject: Re: [mm, slub] 8ff60eb052: stress-ng.rawpkt.ops_per_sec -47.9% regression

Jann,
 it looks like that change of yours made a rather big negative impact
on this load.

On Sun, Feb 28, 2021 at 11:49 PM kernel test robot
<oliver.sang@...el.com> wrote:
>
> FYI, we noticed a -47.9% regression of stress-ng.rawpkt.ops_per_sec due to commit:

Looking at the profile, nothing really obvious stands out, although
some of the numbers imply more polling, and less waiting, ie:

Lots less context switching::

>      12810 ą 17%     -63.6%       4658 ą  8%  sched_debug.cpu.nr_switches.avg

Less time spent sending packets:

>      68.52            -5.0       63.55        perf-profile.calltrace.cycles-pp.__dev_queue_xmit.packet_snd.sock_sendmsg.__sys_sendto.__x64_sys_sendto
>      69.00            -4.9       64.14        perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe

and quite a lot more time spent in what looks like the receive path,
which allocates the packets:

>       5.47 ą 12%      +4.5        9.94 ą  2%  perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.packet_rcv.__netif_receive_skb_core.__netif_receive_skb_one_core
>       5.46 ą 12%      +4.5        9.92 ą  2%  perf-profile.calltrace.cycles-pp.skb_clone.packet_rcv.dev_queue_xmit_nit.dev_hard_start_xmit.__dev_queue_xmit
>       5.37 ą 12%      +4.5        9.88 ą  2%  perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.packet_rcv.dev_queue_xmit_nit.dev_hard_start_xmit
>       5.18 ą 13%      +4.5        9.71 ą  2%  perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv.__netif_receive_skb_core
>       5.08 ą 13%      +4.6        9.66 ą  2%  perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv.dev_queue_xmit_nit
>      29.57 ą  2%      +5.5       35.12        perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      29.48 ą  2%      +5.6       35.08        perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      26.62 ą  2%      +7.1       33.67        perf-profile.calltrace.cycles-pp.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       8.83 ą 15%      +8.8       17.66 ą  2%  perf-profile.calltrace.cycles-pp._raw_spin_lock.get_partial_node.___slab_alloc.__slab_alloc.kmem_cache_alloc
>       8.58 ą 15%      +8.9       17.48 ą  2%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_partial_node.___slab_alloc.__slab_alloc
>       9.46 ą 14%      +9.0       18.44 ą  2%  perf-profile.calltrace.cycles-pp.get_partial_node.___slab_alloc.__slab_alloc.kmem_cache_alloc.skb_clone
>      10.19 ą 13%      +9.1       19.28 ą  2%  perf-profile.calltrace.cycles-pp.___slab_alloc.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv
>      14.31 ą  6%     +14.0       28.27 ą  3%  perf-profile.calltrace.cycles-pp.skb_free_datagram.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64
>       9.50 ą 11%     +16.7       26.21 ą  4%  perf-profile.calltrace.cycles-pp.kmem_cache_free.skb_free_datagram.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
>       6.09 ą 18%     +17.8       23.88 ą  5%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unfreeze_partials.put_cpu_partial.kmem_cache_free
>       6.37 ą 18%     +17.9       24.23 ą  5%  perf-profile.calltrace.cycles-pp._raw_spin_lock.unfreeze_partials.put_cpu_partial.kmem_cache_free.skb_free_datagram
>       6.92 ą 17%     +18.0       24.94 ą  5%  perf-profile.calltrace.cycles-pp.unfreeze_partials.put_cpu_partial.kmem_cache_free.skb_free_datagram.packet_recvmsg
>       7.03 ą 16%     +18.0       25.06 ą  5%  perf-profile.calltrace.cycles-pp.put_cpu_partial.kmem_cache_free.skb_free_datagram.packet_recvmsg.__sys_recvfrom

.. and I think the reason is here:

>      26.26 ą  6%     +23.8       50.07 ą  3%  perf-profile.children.cycles-pp._raw_spin_lock

Look at that +23.8 for _raw_spin_lock, and it really shows up here too:

>      25.86 ą  6%     +23.4       49.22 ą  3%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath

I think what is going on is that your change caused the "contention on
the freelist" case to now loop - possibly several times, and
expensively with atomic operations - while you are holding the
'n->list_lock' spinlock in get_partial_node().

End result: contention on the freelist now becomes *huge* contention
on that list_lock instead.

Before, it would release the list lock, and generally then (maybe) try
again. Or more likely just get another page and avoid the contention.

So when you wrote:

    However, the current code accidentally stops looking at the partial list
    completely in that case.  Especially on kernels without CONFIG_NUMA set,
    this means that get_partial() fails and new_slab_objects() falls back to
    new_slab(), allocating new pages.  This could lead to an unnecessary
    increase in memory fragmentation.

it really looks like this might well have been very intentional
indeed. Or at least very beneficial for _some_ loads.

Comments?

                    Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ