linux-kernel - Re: [tip: sched/core] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <794520d1-8cfa-0b81-a8d6-2c2bf4b55eb9@efficios.com>
Date:   Fri, 14 Jul 2023 10:55:32 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Swapnil Sapkal <Swapnil.Sapkal@....com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
        Aaron Lu <aaron.lu@...el.com>, x86@...nel.org,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by
 mm_cid

On 7/14/23 02:02, Swapnil Sapkal wrote:
> Hello Mathieu,
> 
> On 6/22/2023 12:21 AM, Mathieu Desnoyers wrote:
>> On 6/21/23 12:36, Swapnil Sapkal wrote:
>>> Hello Mathieu,
>>>
>> [...]
>>>>
>>>> I suspect the regression is caused by the mm_count cache line bouncing.
>>>>
>>>> Please try with this additional patch applied:
>>>>
>>>> https://lore.kernel.org/lkml/20230515143536.114960-1-mathieu.desnoyers@efficios.com/
>>>
>>> Thanks for the suggestion. I tried out with the patch you suggested. 
>>> I am seeing
>>> improvement in hackbench numbers with mm_count padding. But this is 
>>> not matching
>>> with what we achieved through reverting the new mm_cid patch.
>>>
>>> Below are the results on the 1 Socket 4th Generation EPYC Processor 
>>> (1 x 96C/192T):
>>>
>>> Threads:
>>>
>>> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted  Base 
>>> + mm_count_padding
>>>   1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct)        
>>> 5.11 (2.29 pct)
>>>   2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct)         
>>> 5.00 (-0.20 pct)
>>>   4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct)        
>>> 5.86 (1.67 pct)
>>>   8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct)        
>>> 6.20 (5.77 pct)
>>> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct)       
>>> 10.68 (6.96 pct)
>>>
>>> Processes:
>>>
>>> Test:              Base (v6.4-rc1)  Base + new_mmcid_reverted   Base 
>>> + mm_count_padding
>>>   1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct)         
>>> 5.19 (0.00 pct)
>>>   2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct)         
>>> 5.39 (0.91 pct)
>>>   4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct)         
>>> 5.64 (0.87 pct)
>>>   8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct)         
>>> 6.04 (0.65 pct)
>>> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct)        
>>> 10.93 (-0.55 pct)
>>>
>>> The ibs profile shows that function __switch_to_asm() is coming at 
>>> top in baseline
>>> run and is not seen with mm_count padding patch. Will be attaching 
>>> full ibs profile
>>> data for all the 3 runs:
>>>
>>> # Base (v6.4-rc1)
>>> Threads:
>>> Total time: 11.486 [sec]
>>>
>>>     5.15%  sched-messaging  [kernel.vmlinux]      [k] __switch_to_asm
>>>     4.31%  sched-messaging  [kernel.vmlinux]      [k] copyout
>>>     4.29%  sched-messaging  [kernel.vmlinux]      [k] 
>>> native_queued_spin_lock_slowpath
>>>     4.22%  sched-messaging  [kernel.vmlinux]      [k] copyin
>>>     3.92%  sched-messaging  [kernel.vmlinux]      [k] 
>>> apparmor_file_permission
>>>     2.91%  sched-messaging  [kernel.vmlinux]      [k] __schedule
>>>     2.34%  swapper          [kernel.vmlinux]      [k] __switch_to_asm
>>>     2.10%  sched-messaging  [kernel.vmlinux]      [k] 
>>> prepare_to_wait_event
>>>     2.10%  sched-messaging  [kernel.vmlinux]      [k] try_to_wake_up
>>>     2.07%  sched-messaging  [kernel.vmlinux]      [k] 
>>> finish_task_switch.isra.0
>>>     2.00%  sched-messaging  [kernel.vmlinux]      [k] pipe_write
>>>     1.82%  sched-messaging  [kernel.vmlinux]      [k] 
>>> check_preemption_disabled
>>>     1.73%  sched-messaging  [kernel.vmlinux]      [k] 
>>> exit_to_user_mode_prepare
>>>     1.52%  sched-messaging  [kernel.vmlinux]      [k] __entry_text_start
>>>     1.49%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
>>>     1.45%  sched-messaging  libc.so.6             [.] write
>>>     1.44%  swapper          [kernel.vmlinux]      [k] native_sched_clock
>>>     1.38%  sched-messaging  [kernel.vmlinux]      [k] psi_group_change
>>>     1.38%  sched-messaging  [kernel.vmlinux]      [k] pipe_read
>>>     1.37%  sched-messaging  libc.so.6             [.] read
>>>     1.06%  sched-messaging  [kernel.vmlinux]      [k] vfs_read
>>>     1.01%  swapper          [kernel.vmlinux]      [k] psi_group_change
>>>     1.00%  sched-messaging  [kernel.vmlinux]      [k] update_curr
>>>
>>> # Base + mm_count_padding
>>> Threads:
>>> Total time: 11.384 [sec]
>>>
>>>     4.43%  sched-messaging  [kernel.vmlinux]         [k] copyin
>>>     4.39%  sched-messaging  [kernel.vmlinux]         [k] 
>>> native_queued_spin_lock_slowpath
>>>     4.07%  sched-messaging  [kernel.vmlinux]         [k] 
>>> apparmor_file_permission
>>>     4.07%  sched-messaging  [kernel.vmlinux]         [k] copyout
>>>     2.49%  sched-messaging  [kernel.vmlinux]         [k] 
>>> entry_SYSCALL_64
>>>     2.37%  sched-messaging  [kernel.vmlinux]         [k] 
>>> update_cfs_group
>>>     2.19%  sched-messaging  [kernel.vmlinux]         [k] pipe_write
>>>     2.00%  sched-messaging  [kernel.vmlinux]         [k] 
>>> check_preemption_disabled
>>>     1.93%  swapper          [kernel.vmlinux]         [k] update_load_avg
>>>     1.81%  sched-messaging  [kernel.vmlinux]         [k] 
>>> exit_to_user_mode_prepare
>>>     1.69%  sched-messaging  [kernel.vmlinux]         [k] try_to_wake_up
>>>     1.58%  sched-messaging  libc.so.6                [.] write
>>>     1.53%  sched-messaging  [kernel.vmlinux]         [k] 
>>> psi_group_change
>>>     1.50%  sched-messaging  libc.so.6                [.] read
>>>     1.50%  sched-messaging  [kernel.vmlinux]         [k] pipe_read
>>>     1.39%  sched-messaging  [kernel.vmlinux]         [k] update_load_avg
>>>     1.39%  sched-messaging  [kernel.vmlinux]         [k] osq_lock
>>>     1.30%  sched-messaging  [kernel.vmlinux]         [k] update_curr
>>>     1.28%  swapper          [kernel.vmlinux]         [k] 
>>> psi_group_change
>>>     1.16%  sched-messaging  [kernel.vmlinux]         [k] vfs_read
>>>     1.12%  sched-messaging  [kernel.vmlinux]         [k] vfs_write
>>>     1.10%  sched-messaging  [kernel.vmlinux]         [k] 
>>> entry_SYSRETQ_unsafe_stack
>>>     1.09%  sched-messaging  [kernel.vmlinux]         [k] __switch_to_asm
>>>     1.08%  sched-messaging  [kernel.vmlinux]         [k] do_syscall_64
>>>     1.06%  sched-messaging  [kernel.vmlinux]         [k] 
>>> select_task_rq_fair
>>>     1.03%  swapper          [kernel.vmlinux]         [k] 
>>> update_cfs_group
>>>     1.00%  swapper          [kernel.vmlinux]         [k] rb_insert_color
>>>
>>> # Base + reverted_new_mm_cid
>>> Threads:
>>> Total time: 7.847 [sec]
>>>
>>>    12.14%  sched-messaging  [kernel.vmlinux]      [k] 
>>> native_queued_spin_lock_slowpath
>>>     8.86%  swapper          [kernel.vmlinux]      [k] 
>>> native_queued_spin_lock_slowpath
>>>     6.13%  sched-messaging  [kernel.vmlinux]      [k] copyin
>>>     5.54%  sched-messaging  [kernel.vmlinux]      [k] 
>>> apparmor_file_permission
>>>     3.59%  sched-messaging  [kernel.vmlinux]      [k] copyout
>>>     2.61%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
>>>     2.48%  sched-messaging  [kernel.vmlinux]      [k] pipe_write
>>>     2.33%  sched-messaging  [kernel.vmlinux]      [k] 
>>> exit_to_user_mode_prepare
>>>     2.01%  sched-messaging  [kernel.vmlinux]      [k] 
>>> check_preemption_disabled
>>>     1.96%  sched-messaging  [kernel.vmlinux]      [k] __entry_text_start
>>>     1.91%  sched-messaging  libc.so.6             [.] write
>>>     1.77%  sched-messaging  libc.so.6             [.] read
>>>     1.64%  sched-messaging  [kernel.vmlinux]      [k] 
>>> mutex_spin_on_owner
>>>     1.58%  sched-messaging  [kernel.vmlinux]      [k] pipe_read
>>>     1.52%  sched-messaging  [kernel.vmlinux]      [k] try_to_wake_up
>>>     1.38%  sched-messaging  [kernel.vmlinux]      [k] 
>>> ktime_get_coarse_real_ts64
>>>     1.35%  sched-messaging  [kernel.vmlinux]      [k] vfs_write
>>>     1.28%  sched-messaging  [kernel.vmlinux]      [k] 
>>> entry_SYSRETQ_unsafe_stack
>>>     1.28%  sched-messaging  [kernel.vmlinux]      [k] vfs_read
>>>     1.25%  sched-messaging  [kernel.vmlinux]      [k] do_syscall_64
>>>     1.22%  sched-messaging  [kernel.vmlinux]      [k] __fget_light
>>>     1.18%  sched-messaging  [kernel.vmlinux]      [k] mutex_lock
>>>     1.12%  sched-messaging  [kernel.vmlinux]      [k] file_update_time
>>>     1.04%  sched-messaging  [kernel.vmlinux]      [k] _copy_from_iter
>>>     1.01%  sched-messaging  [kernel.vmlinux]      [k] current_time
>>>
>>> So with the reverted new_mm_cid patch, we are seeing a lot of time 
>>> being spent in
>>> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>>>
>>> I keep further digging into this please let me know if you have any 
>>> pointers for me.
>>
>> Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?
>>
> Sorry for the delay in response. My system was busy running some 
> workloads. I tried
> running hackbench disabling apparmor, looks like apparmor is not the 
> culprit here.
> Below are the results with apparmor disabled:
> 
> Test:                   Base            Base + Reverted_new_mmcid   
> Base+Apparmour_disabled
>   1-groups:         2.81 (0.00 pct)         2.79 (0.71 pct)              
> 2.79 (0.71 pct)
>   2-groups:         3.25 (0.00 pct)         3.25 (0.00 pct)              
> 3.20 (1.53 pct)
>   4-groups:         3.44 (0.00 pct)         3.28 (4.65 pct)              
> 3.43 (0.29 pct)
>   8-groups:         3.52 (0.00 pct)         3.42 (2.84 pct)              
> 3.53 (-0.28 pct)
> 16-groups:         5.65 (0.00 pct)         4.52 (20.00 pct)             
> 5.67 (-0.35 pct)

Can you provide the kernel config file associated with this
test ? I would also need to see ibs profiles showing the
functions using most cpu, especially spinlocks and their
callers.

My working hypothesis is that adding the rseq-mm-cid spinlock
in the scheduler improves performances of your benchmark because
it lessens the contention on _another_ lock somewhere else.

Note that we've just received a brand new 2 sockets,
96 cores/socket AMD machine at EfficiOS. We've bought it to
increase our coverage of scalability testing. With this I should
be able to reproduce those regressions on my end, which should
facilitate the investigation.

Thanks!

Mathieu


> 
> Thanks,
> Swapnil
> 
>> I notice that apparmor_file_permission appears near the top of your
>> profiles, and apparmor uses an internal aa_buffers_lock spinlock,
>> which could possibly explain the top hits for
>> native_queued_spin_lock_slowpath. My current suspicion is that
>> the raw spinlock that was taken by "Base + reverted_new_mm_cid"
>> changed the contention pattern on the apparmor lock enough to
>> speed things up by pure accident.
>>
>> Thanks,
>>
>> Mathieu
>>
>>
>>>
>>>>
>>>> This patch has recently been merged into the mm tree.
>>>>
>>>> Thanks,
>>>>
>>>> Mathieu
>>>>
>>> -- 
>>> Thanks and Regards,
>>> Swapnil
>>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com