linux-kernel - Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230614043529.GA1942@ziqianlu-dell>
Date:   Wed, 14 Jun 2023 12:35:29 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     David Vernet <void@...ifault.com>,
        Peter Zijlstra <peterz@...radead.org>
CC:     <linux-kernel@...r.kernel.org>, <mingo@...hat.com>,
        <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
        <rostedt@...dmis.org>, <dietmar.eggemann@....com>,
        <bsegall@...gle.com>, <mgorman@...e.de>, <bristot@...hat.com>,
        <vschneid@...hat.com>, <joshdon@...gle.com>,
        <roman.gushchin@...ux.dev>, <tj@...nel.org>, <kernel-team@...a.com>
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote:
> 
> Still gotta read it properly, however:
> 
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
> 
> Could you please also benchmark on something Intel that has these stupid
> large LLCs ?
> 
> Because the last time I tried something like this, it came apart real
> quick. And AMD has these relatively small 8-core LLCs.

I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each
socket has a single LLC with 32 cores/64threads.

When running netperf with nr_thread=128, runtime=60:

"
netserver -4

for i in `seq $nr_threads`; do
	netperf -4 -H 127.0.0.1 -t UDP_RR -c -C -l $runtime &
done

wait
"

The lock contention due to the per-LLC swqueue->lock is quite serious:

    83.39%    83.33%  [kernel.vmlinux]  [k] native_queued_spin_lock_slowpath                                  -      -            
            |          
            |--42.86%--__libc_sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --42.81%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet_sendmsg
            |                     udp_sendmsg
            |                     udp_send_skb
            |                     ip_send_skb
            |                     ip_output
            |                     ip_finish_output
            |                     __ip_finish_output
            |                     ip_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     |          
            |                      --42.81%--net_rx_action
            |                                __napi_poll
            |                                process_backlog
            |                                __netif_receive_skb
            |                                __netif_receive_skb_one_core
            |                                ip_rcv
            |                                ip_local_deliver
            |                                ip_local_deliver_finish
            |                                ip_protocol_deliver_rcu
            |                                udp_rcv
            |                                __udp4_lib_rcv
            |                                udp_unicast_rcv_skb
            |                                udp_queue_rcv_skb
            |                                udp_queue_rcv_one_skb
            |                                __udp_enqueue_schedule_skb
            |                                sock_def_readable
            |                                __wake_up_sync_key
            |                                __wake_up_common_lock
            |                                |          
            |                                 --42.81%--__wake_up_common
            |                                           receiver_wake_function
            |                                           autoremove_wake_function
            |                                           default_wake_function
            |                                           try_to_wake_up
            |                                           ttwu_do_activate
            |                                           enqueue_task
            |                                           enqueue_task_fair
            |                                           _raw_spin_lock_irqsave
            |                                           |          
            |                                            --42.81%--native_queued_spin_lock_slowpath
            |          
            |--20.39%--0
            |          __libc_recvfrom
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          __x64_sys_recvfrom
            |          __sys_recvfrom
            |          sock_recvmsg
            |          inet_recvmsg
            |          udp_recvmsg
            |          __skb_recv_udp
            |          __skb_wait_for_more_packets
            |          schedule_timeout
            |          schedule
            |          __schedule
            |          pick_next_task_fair
            |          |          
            |           --20.39%--swqueue_remove_task
            |                     _raw_spin_lock_irqsave
            |                     |          
            |                      --20.39%--native_queued_spin_lock_slowpath
            |          
             --20.07%--__libc_recvfrom
                       entry_SYSCALL_64
                       do_syscall_64
                       __x64_sys_recvfrom
                       __sys_recvfrom
                       sock_recvmsg
                       inet_recvmsg
                       udp_recvmsg
                       __skb_recv_udp
                       __skb_wait_for_more_packets
                       schedule_timeout
                       schedule
                       __schedule
                       |          
                        --20.06%--pick_next_task_fair
                                  swqueue_remove_task
                                  _raw_spin_lock_irqsave
                                  |          
                                   --20.06%--native_queued_spin_lock_slowpath

I suppose that is because there are too many CPUs in a single LLC on
this machine and when all these CPUs try to queue/pull tasks in this
per-LLC shared wakequeue, it just doesn't scale well.