linux-kernel - Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230412114240.GA155547@ziqianlu-desk2>
Date:   Wed, 12 Apr 2023 19:42:40 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        <linux-kernel@...r.kernel.org>, Olivier Dion <odion@...icios.com>,
        <michael.christie@...cle.com>
Subject: Re: [RFC PATCH v4] sched: Fix performance regression introduced by
 mm_cid

On Wed, Apr 12, 2023 at 11:10:43AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 11, 2023 at 09:12:21PM +0800, Aaron Lu wrote:
> 
> > Forget about this "v4 is better than v2 and v3" part, my later test
> > showed the contention can also rise to around 18% for v4.
> 
> So while I can reproduce the initial regression on a HSW-EX system
> (4*18*2) and get lovely things like:
> 
>   34.47%--schedule_hrtimeout_range_clock
>           schedule
>           |
>           --34.42%--__schedule
>                     |
>                     |--31.86%--_raw_spin_lock
>                     |          |
>                     |           --31.65%--native_queued_spin_lock_slowpath
> 	            |
>                     --0.72%--dequeue_task_fair
>                              |
>                              --0.60%--dequeue_entity
> 
> On a --threads=144 run; it is completely gone when I use v4:
> 
>   6.92%--__schedule
>          |
>          |--2.16%--dequeue_task_fair
>          |          |
>          |           --1.69%--dequeue_entity
>          |                     |
>          |                     |--0.61%--update_load_avg
>          |                     |
>          |                      --0.54%--update_curr
>          |
>          |--1.30%--pick_next_task_fair
>          |          |
>          |           --0.54%--set_next_entity
>          |
>          |--0.77%--psi_task_switch
>          |
>          --0.69%--switch_mm_irqs_off
> 
> 
> :-(

Hmm... I also tested on a 2sockets/64cores/128cpus Icelake, the
contention number is about 20%-48% with vanilla v6.3-rc6 and after
applying v4, the contention is gone.

But it's still there on 2sockets/112cores/224cpus Sapphire Rapids(SPR)
with v4(and v2, v3)...:

    18.38%     1.24%  [kernel.vmlinux]                           [k] __schedule
            |
            |--17.14%--__schedule
            |          |
            |          |--10.63%--mm_cid_get
            |          |          |
            |          |           --10.22%--_raw_spin_lock
            |          |                     |
            |          |                      --10.07%--native_queued_spin_lock_slowpath
            |          |
            |          |--3.43%--dequeue_task
            |          |          |
            |          |           --3.39%--dequeue_task_fair
            |          |                     |
            |          |                     |--2.60%--dequeue_entity
            |          |                     |          |
            |          |                     |          |--1.22%--update_cfs_group
            |          |                     |          |
            |          |                     |           --1.05%--update_load_avg
            |          |                     |
            |          |                      --0.63%--update_cfs_group
            |          |
            |          |--0.68%--switch_mm_irqs_off
            |          |
            |          |--0.60%--finish_task_switch.isra.0
            |          |
            |           --0.50%--psi_task_switch
            |
             --0.53%--0x55a8385c088b

It's much better than the initial 70% contention on SPR of course.

BTW, I found hackbench can also show this problem on both Icelake and SPR.

With v4, on SPR:
~/src/rt-tests-2.4/hackbench --pipe --threads -l 500000
Profile was captured 20s after starting hackbench.

    40.89%     7.71%  [kernel.vmlinux]            [k] __schedule
            |
            |--33.19%--__schedule
            |          |
            |          |--22.25%--mm_cid_get
            |          |          |
            |          |           --18.78%--_raw_spin_lock
            |          |                     |
            |          |                      --18.46%--native_queued_spin_lock_slowpath
            |          |
            |          |--7.46%--finish_task_switch.isra.0
            |          |          |
            |          |           --0.52%--asm_sysvec_call_function_single
            |          |                     sysvec_call_function_single
            |          |
            |          |--0.95%--dequeue_task
            |          |          |
            |          |           --0.93%--dequeue_task_fair
            |          |                     |
            |          |                      --0.76%--dequeue_entity
            |          |
            |           --0.75%--debug_smp_processor_id
            |


With v4, on Icelake:
~/src/rt-tests-2.4/hackbench --pipe --threads -l 500000
Profile was captured 20s after starting hackbench.

    25.83%     4.11%  [kernel.kallsyms]  [k] __schedule
            |
            |--21.72%--__schedule
            |          |
            |          |--11.68%--mm_cid_get
            |          |          |
            |          |           --9.36%--_raw_spin_lock
            |          |                     |
            |          |                      --9.09%--native_queued_spin_lock_slowpath
            |          |
            |          |--3.80%--finish_task_switch.isra.0
            |          |          |
            |          |           --0.70%--asm_sysvec_call_function_single
            |          |                     |
            |          |                      --0.69%--sysvec_call_function_single
            |          |
            |          |--2.58%--dequeue_task
            |          |          |
            |          |           --2.53%--dequeue_task_fair

I *guess* you might be able to see some contention with hackbench on
that HSW-EX system with v4.