linux-kernel - Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230615232605.GB2915572@maniforge>
Date:   Thu, 15 Jun 2023 18:26:05 -0500
From:   David Vernet <void@...ifault.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        rostedt@...dmis.org, dietmar.eggemann@....com, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
        joshdon@...gle.com, roman.gushchin@...ux.dev, tj@...nel.org,
        kernel-team@...a.com
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Thu, Jun 15, 2023 at 03:31:53PM +0800, Aaron Lu wrote:
> On Thu, Jun 15, 2023 at 12:49:17PM +0800, Aaron Lu wrote:
> > I'll see if I can find a smaller machine and give it a run there too.
> 
> Found a Skylake with 18cores/36threads on each socket/LLC and with
> netperf, the contention is still serious.
> 
> "
> $ netserver
> $ sudo sh -c "echo SWQUEUE > /sys/kernel/debug/sched/features"
> $ for i in `seq 72`; do netperf -l 60 -n 72 -6 -t UDP_RR & done
> "
> 
>         53.61%    53.61%  [kernel.vmlinux]            [k] native_queued_spin_lock_slowpath            -      -            
>             |          
>             |--27.93%--sendto
>             |          entry_SYSCALL_64
>             |          do_syscall_64
>             |          |          
>             |           --27.93%--__x64_sys_sendto
>             |                     __sys_sendto
>             |                     sock_sendmsg
>             |                     inet6_sendmsg
>             |                     udpv6_sendmsg
>             |                     udp_v6_send_skb
>             |                     ip6_send_skb
>             |                     ip6_local_out
>             |                     ip6_output
>             |                     ip6_finish_output
>             |                     ip6_finish_output2
>             |                     __dev_queue_xmit
>             |                     __local_bh_enable_ip
>             |                     do_softirq.part.0
>             |                     __do_softirq
>             |                     net_rx_action
>             |                     __napi_poll
>             |                     process_backlog
>             |                     __netif_receive_skb
>             |                     __netif_receive_skb_one_core
>             |                     ipv6_rcv
>             |                     ip6_input
>             |                     ip6_input_finish
>             |                     ip6_protocol_deliver_rcu
>             |                     udpv6_rcv
>             |                     __udp6_lib_rcv
>             |                     udp6_unicast_rcv_skb
>             |                     udpv6_queue_rcv_skb
>             |                     udpv6_queue_rcv_one_skb
>             |                     __udp_enqueue_schedule_skb
>             |                     sock_def_readable
>             |                     __wake_up_sync_key
>             |                     __wake_up_common_lock
>             |                     |          
>             |                      --27.85%--__wake_up_common
>             |                                receiver_wake_function
>             |                                autoremove_wake_function
>             |                                default_wake_function
>             |                                try_to_wake_up
>             |                                |          
>             |                                 --27.85%--ttwu_do_activate
>             |                                           enqueue_task
>             |                                           enqueue_task_fair
>             |                                           |          
>             |                                            --27.85%--_raw_spin_lock_irqsave
>             |                                                      |          
>             |                                                       --27.85%--native_queued_spin_lock_slowpath
>             |          
>              --25.67%--recvfrom
>                        entry_SYSCALL_64
>                        do_syscall_64
>                        __x64_sys_recvfrom
>                        __sys_recvfrom
>                        sock_recvmsg
>                        inet6_recvmsg
>                        udpv6_recvmsg
>                        __skb_recv_udp
>                        |          
>                         --25.67%--__skb_wait_for_more_packets
>                                   schedule_timeout
>                                   schedule
>                                   __schedule
>                                   |          
>                                    --25.66%--pick_next_task_fair
>                                              |          
>                                               --25.65%--swqueue_remove_task
>                                                         |          
>                                                          --25.65%--_raw_spin_lock_irqsave
>                                                                    |          
>                                                                     --25.65%--native_queued_spin_lock_slowpath
> 
> I didn't aggregate the throughput(Trans. Rate per sec) from all these
> clients, but a glimpse from the result showed that the throughput of
> each client dropped from 4xxxx(NO_SWQUEUE) to 2xxxx(SWQUEUE).
> 
> Thanks,
> Aaron

Ok, it seems that the issue is that I wasn't creating enough netperf
clients. I assumed that -n $(nproc) was sufficient. I was able to repro
the contention on my 26 core / 52 thread skylake client as well:


    41.01%  netperf          [kernel.vmlinux]                                                 [k] queued_spin_lock_slowpath
            |          
             --41.01%--queued_spin_lock_slowpath
                       |          
                        --40.63%--_raw_spin_lock_irqsave
                                  |          
                                  |--21.18%--enqueue_task_fair
                                  |          |          
                                  |           --21.09%--default_wake_function
                                  |                     |          
                                  |                      --21.09%--autoremove_wake_function
                                  |                                |          
                                  |                                 --21.09%--__wake_up_sync_key
                                  |                                           sock_def_readable
                                  |                                           __udp_enqueue_schedule_skb
                                  |                                           udpv6_queue_rcv_one_skb
                                  |                                           __udp6_lib_rcv
                                  |                                           ip6_input
                                  |                                           ipv6_rcv
                                  |                                           process_backlog
                                  |                                           net_rx_action
                                  |                                           |          
                                  |                                            --21.09%--__softirqentry_text_start
                                  |                                                      __local_bh_enable_ip
                                  |                                                      ip6_output
                                  |                                                      ip6_local_out
                                  |                                                      ip6_send_skb
                                  |                                                      udp_v6_send_skb
                                  |                                                      udpv6_sendmsg
                                  |                                                      __sys_sendto
                                  |                                                      __x64_sys_sendto
                                  |                                                      do_syscall_64
                                  |                                                      entry_SYSCALL_64
                                  |          
                                   --19.44%--swqueue_remove_task
                                             |          
                                              --19.42%--pick_next_task_fair
                                                        |          
                                                         --19.42%--schedule
                                                                   |          
                                                                    --19.21%--schedule_timeout
                                                                              __skb_wait_for_more_packets
                                                                              __skb_recv_udp
                                                                              udpv6_recvmsg
                                                                              inet6_recvmsg
                                                                              __x64_sys_recvfrom
                                                                              do_syscall_64
                                                                              entry_SYSCALL_64
    40.87%  netserver        [kernel.vmlinux]                                                 [k] queued_spin_lock_slowpath
            |          
             --40.87%--queued_spin_lock_slowpath
                       |          
                        --40.51%--_raw_spin_lock_irqsave
                                  |          
                                  |--21.03%--enqueue_task_fair
                                  |          |          
                                  |           --20.94%--default_wake_function
                                  |                     |          
                                  |                      --20.94%--autoremove_wake_function
                                  |                                |          
                                  |                                 --20.94%--__wake_up_sync_key
                                  |                                           sock_def_readable
                                  |                                           __udp_enqueue_schedule_skb
                                  |                                           udpv6_queue_rcv_one_skb
                                  |                                           __udp6_lib_rcv
                                  |                                           ip6_input
                                  |                                           ipv6_rcv
                                  |                                           process_backlog
                                  |                                           net_rx_action
                                  |                                           |          
                                  |                                            --20.94%--__softirqentry_text_start
                                  |                                                      __local_bh_enable_ip
                                  |                                                      ip6_output
                                  |                                                      ip6_local_out
                                  |                                                      ip6_send_skb
                                  |                                                      udp_v6_send_skb
                                  |                                                      udpv6_sendmsg
                                  |                                                      __sys_sendto
                                  |                                                      __x64_sys_sendto
                                  |                                                      do_syscall_64
                                  |                                                      entry_SYSCALL_64
                                  |          
                                   --19.48%--swqueue_remove_task
                                             |          
                                              --19.47%--pick_next_task_fair
                                                        schedule
                                                        |          
                                                         --19.38%--schedule_timeout
                                                                   __skb_wait_for_more_packets
                                                                   __skb_recv_udp
                                                                   udpv6_recvmsg
                                                                   inet6_recvmsg
                                                                   __x64_sys_recvfrom
                                                                   do_syscall_64
                                                                   entry_SYSCALL_64

Thanks for the help in getting the repro on my end.

So yes, there is certainly a scalability concern to bear in mind for
swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
e.g. blocking and waking on futexes in a tight loop, I expect a similar
issue would be observed.

On the other hand, the issue did not occur on my 7950X. I also wasn't
able to repro the contention on the Skylake if I ran with the default
netperf workload rather than UDP_RR (even with the additional clients).
I didn't bother to take the mean of all of the throughput results
between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.

So swqueue isn't ideal for every configuration, but I'll echo my
sentiment from [0] that this shouldn't on its own necessarily preclude
it from being merged given that it does help a large class of
configurations and workloads, and it's disabled by default.

[0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/

Thanks,
David