linux-kernel - Re: [tip: sched/core] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ddbd1564-8135-5bc3-72b4-afb7c6e9caba@amd.com>
Date:   Wed, 21 Jun 2023 22:06:34 +0530
From:   Swapnil Sapkal <Swapnil.Sapkal@....com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
        Aaron Lu <aaron.lu@...el.com>, x86@...nel.org,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by
 mm_cid

Hello Mathieu,

On 6/20/2023 4:21 PM, Mathieu Desnoyers wrote:
> On 6/20/23 06:35, Swapnil Sapkal wrote:
>> Hello Peter,
>>
>> On 6/20/2023 2:41 PM, Peter Zijlstra wrote:
>>> On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
>>>> Hello Mathieu,
>>>>
>>>> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
>>>>> The following commit has been merged into the sched/core branch of tip:
>>>>>
>>>>> Commit-ID:     223baf9d17f25e2608dbdff7232c095c1e612268
>>>>> Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
>>>>> Author:        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>>>> AuthorDate:    Thu, 20 Apr 2023 10:55:48 -04:00
>>>>> Committer:     Peter Zijlstra <peterz@...radead.org>
>>>>> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>>>>>
>>>>> sched: Fix performance regression introduced by mm_cid
>>>>>
>>>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
>>>>> sysbench regression reported by Aaron Lu.
>>>>>
>>>>> Keep track of the currently allocated mm_cid for each mm/cpu rather than
>>>>> freeing them immediately on context switch. This eliminates most atomic
>>>>> operations when context switching back and forth between threads
>>>>> belonging to different memory spaces in multi-threaded scenarios (many
>>>>> processes, each with many threads). The per-mm/per-cpu mm_cid values are
>>>>> serialized by their respective runqueue locks.
>>>>>
>>>>> Thread migration is handled by introducing invocation to
>>>>> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
>>>>> activate_task() for migrating tasks. If the destination cpu's mm_cid is
>>>>> unset, and if the source runqueue is not actively using its mm_cid, then
>>>>> the source cpu's mm_cid is moved to the destination cpu on migration.
>>>>>
>>>>> Introduce a task-work executed periodically, similarly to NUMA work,
>>>>> which delays reclaim of cid values when they are unused for a period of
>>>>> time.
>>>>>
>>>>> Keep track of the allocation time for each per-cpu cid, and let the task
>>>>> work clear them when they are observed to be older than
>>>>> SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
>>>>> mm_cids which are greater or equal to the Hamming weight of the mm
>>>>> cidmask to keep concurrency ids compact.
>>>>>
>>>>> Because we want to ensure the mm_cid converges towards the smaller
>>>>> values as migrations happen, the prior optimization that was done when
>>>>> context switching between threads belonging to the same mm is removed,
>>>>> because it could delay the lazy release of the destination runqueue
>>>>> mm_cid after it has been replaced by a migration. Removing this prior
>>>>> optimization is not an issue performance-wise because the introduced
>>>>> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>>>>>
>>>>> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
>>>>> Reported-by: Aaron Lu <aaron.lu@...el.com>
>>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>>>>> Tested-by: Aaron Lu <aaron.lu@...el.com>
>>>>> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>>>>
>>>> I run standard benchmarks as a part of kernel performance regression
>>>> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
>>>> I have seen performance regression in hackbench running with threads. When I did
>>>> git bisect it pointed to this commit and reverting this commit helps regains
>>>> the performance. This regression is not seen with hackbench processes.
>>>
>>> Well, *this* commit was supposed to help fix the horrible contention on
>>> cid_lock that was introduced with af7f588d8f73.
>>
>> I went back and tested the commit that introduced mm_cid and I found that the
>> original implementation actually helped hackbench. Following are numbers from
>> 2 Socket Zen3 Server (2 X 64C/128T):
>>
>> Test:           base (v6.2-rc1)      base + orig_mm_cid
>>   1-groups:     4.29 (0.00 pct)     4.32 (-0.69 pct)
>>   2-groups:     4.96 (0.00 pct)     4.94 (0.40 pct)
>>   4-groups:     5.21 (0.00 pct)     4.10 (21.30 pct)
>>   8-groups:     5.44 (0.00 pct)     4.50 (17.27 pct)
>> 16-groups:     7.09 (0.00 pct)     5.28 (25.52 pct)
>>
>> I see following IBS traces in this case:
>>
>> Base:
>>
>>     6.69%  sched-messaging  [kernel.vmlinux]          [k] copy_user_generic_string
>>     5.38%  sched-messaging  [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath
>>     3.73%  swapper          [kernel.vmlinux]          [k] __switch_to_asm
>>     3.23%  sched-messaging  [kernel.vmlinux]          [k] __calc_delta
>>     2.93%  sched-messaging  [kernel.vmlinux]          [k] try_to_wake_up
>>     2.63%  sched-messaging  [kernel.vmlinux]          [k] dequeue_task_fair
>>     2.56%  sched-messaging  [kernel.vmlinux]          [k] osq_lock
>>
>> Base + orig_mm_cid:
>>
>>    13.70%  sched-messaging  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>    11.87%  swapper          [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>     8.99%  sched-messaging  [kernel.vmlinux]      [k] copy_user_generic_string
>>     6.08%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
>>     4.79%  sched-messaging  [kernel.vmlinux]      [k] apparmor_file_permission
>>     3.71%  sched-messaging  [kernel.vmlinux]      [k] mutex_spin_on_owner
>>     3.66%  sched-messaging  [kernel.vmlinux]      [k] ktime_get_coarse_real_ts64
>>     3.11%  sched-messaging  [kernel.vmlinux]      [k] _copy_from_iter
>>
>>>
>>>> Following are the results from 1 Socket 4th generation EPYC
>>>> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
>>>> becomes more severe as the number of core count increases.
>>>>
>>>> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.
>>>>
>>>> Threads:
>>>>
>>>> Test:             With-mmcid-patch        Without-mmcid-patch
>>>>   1-groups:         5.23 (0.00 pct)         4.61 (+11.85 pct)
>>>>   2-groups:         4.99 (0.00 pct)         4.72 (+5.41 pct)
>>>>   4-groups:         5.96 (0.00 pct)         4.87 (+18.28 pct)
>>>>   8-groups:         6.58 (0.00 pct)         5.44 (+17.32 pct)
>>>> 16-groups:        11.48 (0.00 pct)         8.07 (+29.70 pct)
>>>
>>> I'm really confused, so you're saying that having a process wide
>>> spinlock is better than what this patch does? Or are you testing against
>>> something without mm-cid entirely?
>>
>> It does look like the lock contention introduced by the original mm_cid patch helped
>> hackbench in this case. In that case, I see hackbench threads run for longer on average (avg_atom)
>> and total idle entries are down significantly. Even on disabling C1 and C2, I see
>> similar behavior. With the new mm_cid patch that gets rid of the lock contention, we see a drop
>> in the hackbench performance.
>>
>> I will go dig into this further meanwhile if you have any pointers please do let me know.
> 
> I suspect the baseline don't have spinlock contention because the test-case
> schedules between threads belonging to the same process, for which the initial
> mm_cid patch had an optimization which skips the spinlock entirely.
> 
> This optimization for inter-thread scheduling had to be removed in the following
> patch to address the performance issue more generally, covering the inter-process
> scheduling.
> 
> I suspect the regression is caused by the mm_count cache line bouncing.
> 
> Please try with this additional patch applied:
> 
> https://lore.kernel.org/lkml/20230515143536.114960-1-mathieu.desnoyers@efficios.com/

Thanks for the suggestion. I tried out with the patch you suggested. I am seeing
improvement in hackbench numbers with mm_count padding. But this is not matching
with what we achieved through reverting the new mm_cid patch.

Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x 96C/192T):

Threads:

Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted  Base + mm_count_padding
  1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct)        5.11 (2.29 pct)
  2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct)         5.00 (-0.20 pct)
  4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct)        5.86 (1.67 pct)
  8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct)        6.20 (5.77 pct)
16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct)       10.68 (6.96 pct)

Processes:

Test:              Base (v6.4-rc1)  Base + new_mmcid_reverted   Base + mm_count_padding
  1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct)         5.19 (0.00 pct)
  2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct)         5.39 (0.91 pct)
  4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct)         5.64 (0.87 pct)
  8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct)         6.04 (0.65 pct)
16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct)        10.93 (-0.55 pct)

The ibs profile shows that function __switch_to_asm() is coming at top in baseline
run and is not seen with mm_count padding patch. Will be attaching full ibs profile
data for all the 3 runs:

# Base (v6.4-rc1)
Threads:
Total time: 11.486 [sec]

    5.15%  sched-messaging  [kernel.vmlinux]      [k] __switch_to_asm
    4.31%  sched-messaging  [kernel.vmlinux]      [k] copyout
    4.29%  sched-messaging  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
    4.22%  sched-messaging  [kernel.vmlinux]      [k] copyin
    3.92%  sched-messaging  [kernel.vmlinux]      [k] apparmor_file_permission
    2.91%  sched-messaging  [kernel.vmlinux]      [k] __schedule
    2.34%  swapper          [kernel.vmlinux]      [k] __switch_to_asm
    2.10%  sched-messaging  [kernel.vmlinux]      [k] prepare_to_wait_event
    2.10%  sched-messaging  [kernel.vmlinux]      [k] try_to_wake_up
    2.07%  sched-messaging  [kernel.vmlinux]      [k] finish_task_switch.isra.0
    2.00%  sched-messaging  [kernel.vmlinux]      [k] pipe_write
    1.82%  sched-messaging  [kernel.vmlinux]      [k] check_preemption_disabled
    1.73%  sched-messaging  [kernel.vmlinux]      [k] exit_to_user_mode_prepare
    1.52%  sched-messaging  [kernel.vmlinux]      [k] __entry_text_start
    1.49%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
    1.45%  sched-messaging  libc.so.6             [.] write
    1.44%  swapper          [kernel.vmlinux]      [k] native_sched_clock
    1.38%  sched-messaging  [kernel.vmlinux]      [k] psi_group_change
    1.38%  sched-messaging  [kernel.vmlinux]      [k] pipe_read
    1.37%  sched-messaging  libc.so.6             [.] read
    1.06%  sched-messaging  [kernel.vmlinux]      [k] vfs_read
    1.01%  swapper          [kernel.vmlinux]      [k] psi_group_change
    1.00%  sched-messaging  [kernel.vmlinux]      [k] update_curr

# Base + mm_count_padding
Threads:
Total time: 11.384 [sec]

    4.43%  sched-messaging  [kernel.vmlinux]         [k] copyin
    4.39%  sched-messaging  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
    4.07%  sched-messaging  [kernel.vmlinux]         [k] apparmor_file_permission
    4.07%  sched-messaging  [kernel.vmlinux]         [k] copyout
    2.49%  sched-messaging  [kernel.vmlinux]         [k] entry_SYSCALL_64
    2.37%  sched-messaging  [kernel.vmlinux]         [k] update_cfs_group
    2.19%  sched-messaging  [kernel.vmlinux]         [k] pipe_write
    2.00%  sched-messaging  [kernel.vmlinux]         [k] check_preemption_disabled
    1.93%  swapper          [kernel.vmlinux]         [k] update_load_avg
    1.81%  sched-messaging  [kernel.vmlinux]         [k] exit_to_user_mode_prepare
    1.69%  sched-messaging  [kernel.vmlinux]         [k] try_to_wake_up
    1.58%  sched-messaging  libc.so.6                [.] write
    1.53%  sched-messaging  [kernel.vmlinux]         [k] psi_group_change
    1.50%  sched-messaging  libc.so.6                [.] read
    1.50%  sched-messaging  [kernel.vmlinux]         [k] pipe_read
    1.39%  sched-messaging  [kernel.vmlinux]         [k] update_load_avg
    1.39%  sched-messaging  [kernel.vmlinux]         [k] osq_lock
    1.30%  sched-messaging  [kernel.vmlinux]         [k] update_curr
    1.28%  swapper          [kernel.vmlinux]         [k] psi_group_change
    1.16%  sched-messaging  [kernel.vmlinux]         [k] vfs_read
    1.12%  sched-messaging  [kernel.vmlinux]         [k] vfs_write
    1.10%  sched-messaging  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
    1.09%  sched-messaging  [kernel.vmlinux]         [k] __switch_to_asm
    1.08%  sched-messaging  [kernel.vmlinux]         [k] do_syscall_64
    1.06%  sched-messaging  [kernel.vmlinux]         [k] select_task_rq_fair
    1.03%  swapper          [kernel.vmlinux]         [k] update_cfs_group
    1.00%  swapper          [kernel.vmlinux]         [k] rb_insert_color

# Base + reverted_new_mm_cid
Threads:
Total time: 7.847 [sec]

   12.14%  sched-messaging  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
    8.86%  swapper          [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
    6.13%  sched-messaging  [kernel.vmlinux]      [k] copyin
    5.54%  sched-messaging  [kernel.vmlinux]      [k] apparmor_file_permission
    3.59%  sched-messaging  [kernel.vmlinux]      [k] copyout
    2.61%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
    2.48%  sched-messaging  [kernel.vmlinux]      [k] pipe_write
    2.33%  sched-messaging  [kernel.vmlinux]      [k] exit_to_user_mode_prepare
    2.01%  sched-messaging  [kernel.vmlinux]      [k] check_preemption_disabled
    1.96%  sched-messaging  [kernel.vmlinux]      [k] __entry_text_start
    1.91%  sched-messaging  libc.so.6             [.] write
    1.77%  sched-messaging  libc.so.6             [.] read
    1.64%  sched-messaging  [kernel.vmlinux]      [k] mutex_spin_on_owner
    1.58%  sched-messaging  [kernel.vmlinux]      [k] pipe_read
    1.52%  sched-messaging  [kernel.vmlinux]      [k] try_to_wake_up
    1.38%  sched-messaging  [kernel.vmlinux]      [k] ktime_get_coarse_real_ts64
    1.35%  sched-messaging  [kernel.vmlinux]      [k] vfs_write
    1.28%  sched-messaging  [kernel.vmlinux]      [k] entry_SYSRETQ_unsafe_stack
    1.28%  sched-messaging  [kernel.vmlinux]      [k] vfs_read
    1.25%  sched-messaging  [kernel.vmlinux]      [k] do_syscall_64
    1.22%  sched-messaging  [kernel.vmlinux]      [k] __fget_light
    1.18%  sched-messaging  [kernel.vmlinux]      [k] mutex_lock
    1.12%  sched-messaging  [kernel.vmlinux]      [k] file_update_time
    1.04%  sched-messaging  [kernel.vmlinux]      [k] _copy_from_iter
    1.01%  sched-messaging  [kernel.vmlinux]      [k] current_time

So with the reverted new_mm_cid patch, we are seeing a lot of time being spent in
native_queued_spin_lock_slowpath and yet, hackbench finishes faster.

I keep further digging into this please let me know if you have any pointers for me.

> 
> This patch has recently been merged into the mm tree.
> 
> Thanks,
> 
> Mathieu
> 
--
Thanks and Regards,
Swapnil