lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230615044917.GA109334@ziqianlu-dell>
Date:   Thu, 15 Jun 2023 12:49:17 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     David Vernet <void@...ifault.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        <linux-kernel@...r.kernel.org>, <mingo@...hat.com>,
        <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
        <rostedt@...dmis.org>, <dietmar.eggemann@....com>,
        <bsegall@...gle.com>, <mgorman@...e.de>, <bristot@...hat.com>,
        <vschneid@...hat.com>, <joshdon@...gle.com>,
        <roman.gushchin@...ux.dev>, <tj@...nel.org>, <kernel-team@...a.com>
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

Hi David,

On Wed, Jun 14, 2023 at 07:01:03PM -0500, David Vernet wrote:
> Hi Aaron,
> 
> Thanks for taking a look and running some benchmarks. I tried to
> reproduce your results on a 26 core / 52 thread Cooperlake host, and a
> 20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the
> contention on the per-swqueue spinlock you were saw on your Ice Lake.
> 
> I ran the following netperf benchmarks:
> 
> - netperf -l 60 -n $(nproc) -6
> - netperf -l 60 -n $(nproc) -6 -t UDP_RR

Just to confirm: did you run multiple of the above or just one?

> 
> Here are the results I'm seeing from taking the average of running those
> benchmarks three times on each host (all results are from the "Throughput"
> column of the below table).
> 
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput <<<
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 
> 26 / 52 x 1 Cooperlake
> ----------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 34103.76
> SWQUEUE:    34707.46
> Delta: +1.77%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 57695.29
> SWQUEUE: 54827.36
> Delta: -4.97%
> 
> There's clearly a statistically significant decline in performance for
> UDP_RR here, but surprisingly, I don't see any contention on swqueue
> locks when I profile with perf. Rather, we seem to be contending on the
> rq lock (presumably in swqueue_pick_next_task() which is unexpected (to
> me, at least):

I don't see rq lock contention though and since you have seen some
contention here, I suppose you should have started multiple instances of
netperf client?

> 
>      7.81%  netperf  [kernel.vmlinux]                                           [k] queued_spin_lock_slowpath
>             |          
>              --7.81%--queued_spin_lock_slowpath
>                        |          
>                         --7.55%--task_rq_lock
>                                   pick_next_task_fair
>                                   schedule
>                                   schedule_timeout
>                                   __skb_wait_for_more_packets
>                                   __skb_recv_udp
>                                   udpv6_recvmsg
>                                   inet6_recvmsg
>                                   __x64_sys_recvfrom
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_recvfrom
>                                   0x7f923fdfacf0
>                                   0
>      4.97%  netperf  [kernel.vmlinux]                                           [k] _raw_spin_lock_irqsave
>             |          
>              --4.96%--_raw_spin_lock_irqsave
>                        |          
>                        |--1.13%--prepare_to_wait_exclusive
>                        |          __skb_wait_for_more_packets
>                        |          __skb_recv_udp
>                        |          udpv6_recvmsg
>                        |          inet6_recvmsg
>                        |          __x64_sys_recvfrom
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_recvfrom
>                        |          0x7f923fdfacf0
>                        |          0
>                        |          
>                        |--0.92%--default_wake_function
>                        |          autoremove_wake_function
>                        |          __wake_up_sync_key
>                        |          sock_def_readable
>                        |          __udp_enqueue_schedule_skb
>                        |          udpv6_queue_rcv_one_skb
>                        |          __udp6_lib_rcv
>                        |          ip6_input
>                        |          ipv6_rcv
>                        |          process_backlog
>                        |          net_rx_action
>                        |          __softirqentry_text_start
>                        |          __local_bh_enable_ip
>                        |          ip6_output
>                        |          ip6_local_out
>                        |          ip6_send_skb
>                        |          udp_v6_send_skb
>                        |          udpv6_sendmsg
>                        |          __sys_sendto
>                        |          __x64_sys_sendto
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_sendto
>                        |          
>                        |--0.81%--__wake_up_sync_key
>                        |          sock_def_readable
>                        |          __udp_enqueue_schedule_skb
>                        |          udpv6_queue_rcv_one_skb
>                        |          __udp6_lib_rcv
>                        |          ip6_input
>                        |          ipv6_rcv
>                        |          process_backlog
>                        |          net_rx_action
>                        |          __softirqentry_text_start
>                        |          __local_bh_enable_ip
>                        |          ip6_output
>                        |          ip6_local_out
>                        |          ip6_send_skb
>                        |          udp_v6_send_skb
>                        |          udpv6_sendmsg
>                        |          __sys_sendto
>                        |          __x64_sys_sendto
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_sendto
>                        |          
>                        |--0.73%--enqueue_task_fair
>                        |          |          
>                        |           --0.72%--default_wake_function
>                        |                     autoremove_wake_function
>                        |                     __wake_up_sync_key
>                        |                     sock_def_readable
>                        |                     __udp_enqueue_schedule_skb
>                        |                     udpv6_queue_rcv_one_skb
>                        |                     __udp6_lib_rcv
>                        |                     ip6_input
>                        |                     ipv6_rcv
>                        |                     process_backlog
>                        |                     net_rx_action
>                        |                     __softirqentry_text_start
>                        |                     __local_bh_enable_ip
>                        |                     ip6_output
>                        |                     ip6_local_out
>                        |                     ip6_send_skb
>                        |                     udp_v6_send_skb
>                        |                     udpv6_sendmsg
>                        |                     __sys_sendto
>                        |                     __x64_sys_sendto
>                        |                     do_syscall_64
>                        |                     entry_SYSCALL_64
>                        |                     __libc_sendto
>                        |          
>                         --0.68%--task_rq_lock
>                                   pick_next_task_fair
>                                   schedule
>                                   schedule_timeout
>                                   __skb_wait_for_more_packets
>                                   __skb_recv_udp
>                                   udpv6_recvmsg
>                                   inet6_recvmsg
>                                   __x64_sys_recvfrom
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_recvfrom
>                                   0x7f923fdfacf0
>                                   0
> 
> The profile without swqueue doesn't show any contention on the rq lock,
> but does show us spending a good amount of time in the scheduler:
> 
>      4.03%  netperf  [kernel.vmlinux]                                           [k] default_wake_function
>             |          
>              --3.98%--default_wake_function
>                        autoremove_wake_function
>                        __wake_up_sync_key
>                        sock_def_readable
>                        __udp_enqueue_schedule_skb
>                        udpv6_queue_rcv_one_skb
>                        __udp6_lib_rcv
>                        ip6_input
>                        ipv6_rcv
>                        process_backlog
>                        net_rx_action
>                        __softirqentry_text_start
>                        __local_bh_enable_ip
>                        ip6_output
>                        ip6_local_out
>                        ip6_send_skb
>                        udp_v6_send_skb
>                        udpv6_sendmsg
>                        __sys_sendto
>                        __x64_sys_sendto
>                        do_syscall_64
>                        entry_SYSCALL_64
>                        __libc_sendto
>      3.70%  netperf  [kernel.vmlinux]                                           [k] enqueue_entity
>             |          
>              --3.66%--enqueue_entity
>                        enqueue_task_fair
>                        |          
>                         --3.66%--default_wake_function
>                                   autoremove_wake_function
>                                   __wake_up_sync_key
>                                   sock_def_readable
>                                   __udp_enqueue_schedule_skb
>                                   udpv6_queue_rcv_one_skb
>                                   __udp6_lib_rcv
>                                   ip6_input
>                                   ipv6_rcv
>                                   process_backlog
>                                   net_rx_action
>                                   __softirqentry_text_start
>                                   __local_bh_enable_ip
>                                   ip6_output
>                                   ip6_local_out
>                                   ip6_send_skb
>                                   udp_v6_send_skb
>                                   udpv6_sendmsg
>                                   __sys_sendto
>                                   __x64_sys_sendto
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_sendto
> 
> There's clearly a measurable impact on performance here due to swqueue
> (negative for UDP_RR, positive for the default workload), but this looks
> quite different than what you were observing.

Yes.

> 
> 20 / 40 x 2 Skylake
> -------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 57437.45
> SWQUEUE: 58801.11
> Delta: +2.37%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 75932.28
> SWQUEUE: 75232.77
> Delta: -.92%
> 
> Same observation here. I didn't collect a profile, but the trend seems
> consistent, and there's clearly a tradeoff. Despite the small drop in
> perf for UDP_RR, it seems quite a bit less drastic than what would be
> expected with the contention you showed in your profile.

Indeed.

> 
> 7950X (8 cores / 16 threads per CCX, 2 CCXs)
> --------------------------------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 77615.08
> SWQUEUE: 77465.73
> Delta: -.19%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 230258.75
> SWQUEUE: 230280.34
> Delta: ~0%
> 
> I'd call this essentially a no-op.
> 
 
> With all that said, I have a few thoughts.
> 
> Firstly, would you mind please sharing your .config? It's possible that
> the hosts I'm testing on just don't have big enough LLCs to observe the
> contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host
> is smaller than a single socket of your 32/64 ICL host. On the other
> hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to

Agreed. Your 26cores/52threads isn't that smaller than mine and I had
expected to see something similar.

> see if there's a .config difference here that's causing the contention.

Attached my .config.
I also pushed the branch I used for testing to github just in case you
want to take a look: https://github.com/aaronlu/linux.git swqueue

> Also, the fact that Milan (which has only 6 cores / 12 threads per LLC)
> also saw a performance hit with swqueue for UDP_RR suggests to me that
> the issue with UDP_RR is not the scalability of the per-LLC swqueue
> spinlock.

I've tested again today and I still saw serious contention on
swqueue->lock with your cmdline. I did it this way:
"
$ netserver                                                                                      
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC

$ for i in `seq 128`; do netperf -l 60 -n 128 -6 -t UDP_RR & done
"
And the profile is about the same as my last posting:

    83.23%    83.18%  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath            -      -            
            |          
            |--40.23%--sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --40.22%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet6_sendmsg
            |                     udpv6_sendmsg
            |                     udp_v6_send_skb
            |                     ip6_send_skb
            |                     ip6_local_out
            |                     ip6_output
            |                     ip6_finish_output
            |                     ip6_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     net_rx_action
            |                     __napi_poll
            |                     process_backlog
            |                     __netif_receive_skb
            |                     __netif_receive_skb_one_core
            |                     ipv6_rcv
            |                     ip6_input
            |                     ip6_input_finish
            |                     ip6_protocol_deliver_rcu
            |                     udpv6_rcv
            |                     __udp6_lib_rcv
            |                     udp6_unicast_rcv_skb
            |                     udpv6_queue_rcv_skb
            |                     udpv6_queue_rcv_one_skb
            |                     __udp_enqueue_schedule_skb
            |                     sock_def_readable
            |                     __wake_up_sync_key
            |                     __wake_up_common_lock
            |                     |          
            |                      --40.22%--__wake_up_common
            |                                receiver_wake_function
            |                                autoremove_wake_function
            |                                default_wake_function
            |                                try_to_wake_up
            |                                ttwu_do_activate
            |                                enqueue_task
            |                                enqueue_task_fair
            |                                |          
            |                                 --40.22%--_raw_spin_lock_irqsave
            |                                           |          
            |                                            --40.21%--native_queued_spin_lock_slowpath
            |          
            |--38.59%--recvfrom
            |          |          
            |           --38.59%--entry_SYSCALL_64
            |                     do_syscall_64
            |                     __x64_sys_recvfrom
            |                     __sys_recvfrom
            |                     sock_recvmsg
            |                     inet6_recvmsg
            |                     udpv6_recvmsg
            |                     __skb_recv_udp
            |                     __skb_wait_for_more_packets
            |                     schedule_timeout
            |                     schedule
            |                     __schedule
            |                     |          
            |                      --38.59%--pick_next_task_fair
            |                                |          
            |                                 --38.59%--swqueue_remove_task
            |                                           |          
            |                                            --38.59%--_raw_spin_lock_irqsave
            |                                                      |          
            |                                                       --38.58%--native_queued_spin_lock_slowpath
            |          
            |--2.25%--sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --2.25%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet6_sendmsg
            |                     udpv6_sendmsg
            |                     udp_v6_send_skb
            |                     ip6_send_skb
            |                     ip6_local_out
            |                     ip6_output
            |                     ip6_finish_output
            |                     ip6_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     net_rx_action
            |                     __napi_poll
            |                     process_backlog
            |                     __netif_receive_skb
            |                     __netif_receive_skb_one_core
            |                     ipv6_rcv
            |                     ip6_input
            |                     ip6_input_finish
            |                     ip6_protocol_deliver_rcu
            |                     udpv6_rcv
            |                     __udp6_lib_rcv
            |                     udp6_unicast_rcv_skb
            |                     udpv6_queue_rcv_skb
            |                     udpv6_queue_rcv_one_skb
            |                     __udp_enqueue_schedule_skb
            |                     sock_def_readable
            |                     __wake_up_sync_key
            |                     __wake_up_common_lock
            |                     |          
            |                      --2.25%--__wake_up_common
            |                                receiver_wake_function
            |                                autoremove_wake_function
            |                                default_wake_function
            |                                try_to_wake_up
            |                                ttwu_do_activate
            |                                enqueue_task
            |                                enqueue_task_fair
            |                                _raw_spin_lock_irqsave
            |                                |          
            |                                 --2.25%--native_queued_spin_lock_slowpath
            |          
             --2.10%--recvfrom
                       entry_SYSCALL_64
                       do_syscall_64
                       __x64_sys_recvfrom
                       __sys_recvfrom
                       sock_recvmsg
                       inet6_recvmsg
                       udpv6_recvmsg
                       __skb_recv_udp
                       __skb_wait_for_more_packets
                       schedule_timeout
                       schedule
                       __schedule
                       |          
                        --2.09%--pick_next_task_fair
                                  swqueue_remove_task
                                  _raw_spin_lock_irqsave
                                  |          
                                   --2.09%--native_queued_spin_lock_slowpath

During the test, vmstat showed lines like this:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free     buff  cache     si   so    bi    bo   in     cs   us sy id wa st
185 0      0 128555784  47348 1791244    0    0     0     0  32387 4380484  1 98  0  0  0

When swqueue is disabled, vmstat showed lines like this and there is no
lock contention:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free     buff  cache     si   so    bi    bo   in     cs   us sy id wa st
116 0      0 128977136  58324 1338760    0    0     0    16 599677 9734395  5 72 23  0  0

The runnable tasks are a lot more when swqueue is in use and sys% also
increased, perhaps due to the lock contention.

I'll see if I can find a smaller machine and give it a run there too.

Thanks,
Aaron

Download attachment "config-6.4.0-rc5-00237-g06b8769b15ae.gz" of type "application/gzip" (67409 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ