linux-kernel - Re: [tip: sched/core] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ba3870b6-97f2-729b-df6b-f30e4c2001d4@amd.com>
Date:   Tue, 18 Jul 2023 11:31:58 +0530
From:   Swapnil Sapkal <Swapnil.Sapkal@....com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
        Aaron Lu <aaron.lu@...el.com>, x86@...nel.org,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by
 mm_cid

Hello Mathieu ,

On 7/14/2023 8:25 PM, Mathieu Desnoyers wrote:
> On 7/14/23 02:02, Swapnil Sapkal wrote:
>> Hello Mathieu,
>>
>> On 6/22/2023 12:21 AM, Mathieu Desnoyers wrote:
>>> On 6/21/23 12:36, Swapnil Sapkal wrote:
>>>> Hello Mathieu,
>>>>
>>> [...]
>>>>>
>>>>> I suspect the regression is caused by the mm_count cache line bouncing.
>>>>>
>>>>> Please try with this additional patch applied:
>>>>>
>>>>> https://lore.kernel.org/lkml/20230515143536.114960-1-mathieu.desnoyers@efficios.com/
>>>>
>>>> Thanks for the suggestion. I tried out with the patch you suggested. I am seeing
>>>> improvement in hackbench numbers with mm_count padding. But this is not matching
>>>> with what we achieved through reverting the new mm_cid patch.
>>>>
>>>> Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x 96C/192T):
>>>>
>>>> Threads:
>>>>
>>>> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted  Base + mm_count_padding
>>>>   1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct) 5.11 (2.29 pct)
>>>>   2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct) 5.00 (-0.20 pct)
>>>>   4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct) 5.86 (1.67 pct)
>>>>   8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct) 6.20 (5.77 pct)
>>>> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct) 10.68 (6.96 pct)
>>>>
>>>> Processes:
>>>>
>>>> Test:              Base (v6.4-rc1)  Base + new_mmcid_reverted   Base + mm_count_padding
>>>>   1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct) 5.19 (0.00 pct)
>>>>   2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct) 5.39 (0.91 pct)
>>>>   4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct) 5.64 (0.87 pct)
>>>>   8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct) 6.04 (0.65 pct)
>>>> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct) 10.93 (-0.55 pct)
>>>>
>>>> The ibs profile shows that function __switch_to_asm() is coming at top in baseline
>>>> run and is not seen with mm_count padding patch. Will be attaching full ibs profile
>>>> data for all the 3 runs:
>>>>
>>>> # Base (v6.4-rc1)
>>>> Threads:
>>>> Total time: 11.486 [sec]
>>>>
>>>>     5.15%  sched-messaging  [kernel.vmlinux]      [k] __switch_to_asm
>>>>     4.31%  sched-messaging  [kernel.vmlinux]      [k] copyout
>>>>     4.29%  sched-messaging  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>>     4.22%  sched-messaging  [kernel.vmlinux]      [k] copyin
>>>>     3.92%  sched-messaging  [kernel.vmlinux]      [k] apparmor_file_permission
>>>>     2.91%  sched-messaging  [kernel.vmlinux]      [k] __schedule
>>>>     2.34%  swapper          [kernel.vmlinux]      [k] __switch_to_asm
>>>>     2.10%  sched-messaging  [kernel.vmlinux]      [k] prepare_to_wait_event
>>>>     2.10%  sched-messaging  [kernel.vmlinux]      [k] try_to_wake_up
>>>>     2.07%  sched-messaging  [kernel.vmlinux]      [k] finish_task_switch.isra.0
>>>>     2.00%  sched-messaging  [kernel.vmlinux]      [k] pipe_write
>>>>     1.82%  sched-messaging  [kernel.vmlinux]      [k] check_preemption_disabled
>>>>     1.73%  sched-messaging  [kernel.vmlinux]      [k] exit_to_user_mode_prepare
>>>>     1.52%  sched-messaging  [kernel.vmlinux]      [k] __entry_text_start
>>>>     1.49%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
>>>>     1.45%  sched-messaging  libc.so.6             [.] write
>>>>     1.44%  swapper          [kernel.vmlinux]      [k] native_sched_clock
>>>>     1.38%  sched-messaging  [kernel.vmlinux]      [k] psi_group_change
>>>>     1.38%  sched-messaging  [kernel.vmlinux]      [k] pipe_read
>>>>     1.37%  sched-messaging  libc.so.6             [.] read
>>>>     1.06%  sched-messaging  [kernel.vmlinux]      [k] vfs_read
>>>>     1.01%  swapper          [kernel.vmlinux]      [k] psi_group_change
>>>>     1.00%  sched-messaging  [kernel.vmlinux]      [k] update_curr
>>>>
>>>> # Base + mm_count_padding
>>>> Threads:
>>>> Total time: 11.384 [sec]
>>>>
>>>>     4.43%  sched-messaging  [kernel.vmlinux]         [k] copyin
>>>>     4.39%  sched-messaging  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>>     4.07%  sched-messaging  [kernel.vmlinux]         [k] apparmor_file_permission
>>>>     4.07%  sched-messaging  [kernel.vmlinux]         [k] copyout
>>>>     2.49%  sched-messaging  [kernel.vmlinux]         [k] entry_SYSCALL_64
>>>>     2.37%  sched-messaging  [kernel.vmlinux]         [k] update_cfs_group
>>>>     2.19%  sched-messaging  [kernel.vmlinux]         [k] pipe_write
>>>>     2.00%  sched-messaging  [kernel.vmlinux]         [k] check_preemption_disabled
>>>>     1.93%  swapper          [kernel.vmlinux]         [k] update_load_avg
>>>>     1.81%  sched-messaging  [kernel.vmlinux]         [k] exit_to_user_mode_prepare
>>>>     1.69%  sched-messaging  [kernel.vmlinux]         [k] try_to_wake_up
>>>>     1.58%  sched-messaging  libc.so.6                [.] write
>>>>     1.53%  sched-messaging  [kernel.vmlinux]         [k] psi_group_change
>>>>     1.50%  sched-messaging  libc.so.6                [.] read
>>>>     1.50%  sched-messaging  [kernel.vmlinux]         [k] pipe_read
>>>>     1.39%  sched-messaging  [kernel.vmlinux]         [k] update_load_avg
>>>>     1.39%  sched-messaging  [kernel.vmlinux]         [k] osq_lock
>>>>     1.30%  sched-messaging  [kernel.vmlinux]         [k] update_curr
>>>>     1.28%  swapper          [kernel.vmlinux]         [k] psi_group_change
>>>>     1.16%  sched-messaging  [kernel.vmlinux]         [k] vfs_read
>>>>     1.12%  sched-messaging  [kernel.vmlinux]         [k] vfs_write
>>>>     1.10%  sched-messaging  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
>>>>     1.09%  sched-messaging  [kernel.vmlinux]         [k] __switch_to_asm
>>>>     1.08%  sched-messaging  [kernel.vmlinux]         [k] do_syscall_64
>>>>     1.06%  sched-messaging  [kernel.vmlinux]         [k] select_task_rq_fair
>>>>     1.03%  swapper          [kernel.vmlinux]         [k] update_cfs_group
>>>>     1.00%  swapper          [kernel.vmlinux]         [k] rb_insert_color
>>>>
>>>> # Base + reverted_new_mm_cid
>>>> Threads:
>>>> Total time: 7.847 [sec]
>>>>
>>>>    12.14%  sched-messaging  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>>     8.86%  swapper          [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>>     6.13%  sched-messaging  [kernel.vmlinux]      [k] copyin
>>>>     5.54%  sched-messaging  [kernel.vmlinux]      [k] apparmor_file_permission
>>>>     3.59%  sched-messaging  [kernel.vmlinux]      [k] copyout
>>>>     2.61%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
>>>>     2.48%  sched-messaging  [kernel.vmlinux]      [k] pipe_write
>>>>     2.33%  sched-messaging  [kernel.vmlinux]      [k] exit_to_user_mode_prepare
>>>>     2.01%  sched-messaging  [kernel.vmlinux]      [k] check_preemption_disabled
>>>>     1.96%  sched-messaging  [kernel.vmlinux]      [k] __entry_text_start
>>>>     1.91%  sched-messaging  libc.so.6             [.] write
>>>>     1.77%  sched-messaging  libc.so.6             [.] read
>>>>     1.64%  sched-messaging  [kernel.vmlinux]      [k] mutex_spin_on_owner
>>>>     1.58%  sched-messaging  [kernel.vmlinux]      [k] pipe_read
>>>>     1.52%  sched-messaging  [kernel.vmlinux]      [k] try_to_wake_up
>>>>     1.38%  sched-messaging  [kernel.vmlinux]      [k] ktime_get_coarse_real_ts64
>>>>     1.35%  sched-messaging  [kernel.vmlinux]      [k] vfs_write
>>>>     1.28%  sched-messaging  [kernel.vmlinux]      [k] entry_SYSRETQ_unsafe_stack
>>>>     1.28%  sched-messaging  [kernel.vmlinux]      [k] vfs_read
>>>>     1.25%  sched-messaging  [kernel.vmlinux]      [k] do_syscall_64
>>>>     1.22%  sched-messaging  [kernel.vmlinux]      [k] __fget_light
>>>>     1.18%  sched-messaging  [kernel.vmlinux]      [k] mutex_lock
>>>>     1.12%  sched-messaging  [kernel.vmlinux]      [k] file_update_time
>>>>     1.04%  sched-messaging  [kernel.vmlinux]      [k] _copy_from_iter
>>>>     1.01%  sched-messaging  [kernel.vmlinux]      [k] current_time
>>>>
>>>> So with the reverted new_mm_cid patch, we are seeing a lot of time being spent in
>>>> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>>>>
>>>> I keep further digging into this please let me know if you have any pointers for me.
>>>
>>> Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?
>>>
>> Sorry for the delay in response. My system was busy running some workloads. I tried
>> running hackbench disabling apparmor, looks like apparmor is not the culprit here.
>> Below are the results with apparmor disabled:
>>
>> Test:                   Base            Base + Reverted_new_mmcid Base+Apparmour_disabled
>>   1-groups:         2.81 (0.00 pct)         2.79 (0.71 pct) 2.79 (0.71 pct)
>>   2-groups:         3.25 (0.00 pct)         3.25 (0.00 pct) 3.20 (1.53 pct)
>>   4-groups:         3.44 (0.00 pct)         3.28 (4.65 pct) 3.43 (0.29 pct)
>>   8-groups:         3.52 (0.00 pct)         3.42 (2.84 pct) 3.53 (-0.28 pct)
>> 16-groups:         5.65 (0.00 pct)         4.52 (20.00 pct) 5.67 (-0.35 pct)
> 
> Can you provide the kernel config file associated with this
> test ? I would also need to see ibs profiles showing the
> functions using most cpu, especially spinlocks and their
> callers.
> 
I have attached kernel config with this mail.

> My working hypothesis is that adding the rseq-mm-cid spinlock
> in the scheduler improves performances of your benchmark because
> it lessens the contention on _another_ lock somewhere else.
> 
> Note that we've just received a brand new 2 sockets,
> 96 cores/socket AMD machine at EfficiOS. We've bought it to
> increase our coverage of scalability testing. With this I should
> be able to reproduce those regressions on my end, which should
> facilitate the investigation.
> 
> Thanks!
> 
> Mathieu
> 
> 
>>
>> Thanks,
>> Swapnil
>>
>>> I notice that apparmor_file_permission appears near the top of your
>>> profiles, and apparmor uses an internal aa_buffers_lock spinlock,
>>> which could possibly explain the top hits for
>>> native_queued_spin_lock_slowpath. My current suspicion is that
>>> the raw spinlock that was taken by "Base + reverted_new_mm_cid"
>>> changed the contention pattern on the apparmor lock enough to
>>> speed things up by pure accident.
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>>
>>>>
>>>>>
>>>>> This patch has recently been merged into the mm tree.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mathieu
>>>>>
>>>> -- 
>>>> Thanks and Regards,
>>>> Swapnil
>>>
> 
--
Thanks and regards,
Swapnil
View attachment "mmcid.config" of type "text/plain" (277598 bytes)