linux-kernel - Re: [tip: sched/core] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com>
Date:   Tue, 20 Jun 2023 16:05:45 +0530
From:   Swapnil Sapkal <Swapnil.Sapkal@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
        Aaron Lu <aaron.lu@...el.com>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        x86@...nel.org
Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by
 mm_cid

Hello Peter,

On 6/20/2023 2:41 PM, Peter Zijlstra wrote:
> On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
>> Hello Mathieu,
>>
>> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
>>> The following commit has been merged into the sched/core branch of tip:
>>>
>>> Commit-ID:     223baf9d17f25e2608dbdff7232c095c1e612268
>>> Gitweb:        https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
>>> Author:        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>> AuthorDate:    Thu, 20 Apr 2023 10:55:48 -04:00
>>> Committer:     Peter Zijlstra <peterz@...radead.org>
>>> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>>>
>>> sched: Fix performance regression introduced by mm_cid
>>>
>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
>>> sysbench regression reported by Aaron Lu.
>>>
>>> Keep track of the currently allocated mm_cid for each mm/cpu rather than
>>> freeing them immediately on context switch. This eliminates most atomic
>>> operations when context switching back and forth between threads
>>> belonging to different memory spaces in multi-threaded scenarios (many
>>> processes, each with many threads). The per-mm/per-cpu mm_cid values are
>>> serialized by their respective runqueue locks.
>>>
>>> Thread migration is handled by introducing invocation to
>>> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
>>> activate_task() for migrating tasks. If the destination cpu's mm_cid is
>>> unset, and if the source runqueue is not actively using its mm_cid, then
>>> the source cpu's mm_cid is moved to the destination cpu on migration.
>>>
>>> Introduce a task-work executed periodically, similarly to NUMA work,
>>> which delays reclaim of cid values when they are unused for a period of
>>> time.
>>>
>>> Keep track of the allocation time for each per-cpu cid, and let the task
>>> work clear them when they are observed to be older than
>>> SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
>>> mm_cids which are greater or equal to the Hamming weight of the mm
>>> cidmask to keep concurrency ids compact.
>>>
>>> Because we want to ensure the mm_cid converges towards the smaller
>>> values as migrations happen, the prior optimization that was done when
>>> context switching between threads belonging to the same mm is removed,
>>> because it could delay the lazy release of the destination runqueue
>>> mm_cid after it has been replaced by a migration. Removing this prior
>>> optimization is not an issue performance-wise because the introduced
>>> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>>>
>>> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
>>> Reported-by: Aaron Lu <aaron.lu@...el.com>
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>>> Tested-by: Aaron Lu <aaron.lu@...el.com>
>>> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>>
>> I run standard benchmarks as a part of kernel performance regression
>> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
>> I have seen performance regression in hackbench running with threads. When I did
>> git bisect it pointed to this commit and reverting this commit helps regains
>> the performance. This regression is not seen with hackbench processes.
> 
> Well, *this* commit was supposed to help fix the horrible contention on
> cid_lock that was introduced with af7f588d8f73.

I went back and tested the commit that introduced mm_cid and I found that the
original implementation actually helped hackbench. Following are numbers from
2 Socket Zen3 Server (2 X 64C/128T):

Test:           base (v6.2-rc1)      base + orig_mm_cid
  1-groups:     4.29 (0.00 pct)     4.32 (-0.69 pct)
  2-groups:     4.96 (0.00 pct)     4.94 (0.40 pct)
  4-groups:     5.21 (0.00 pct)     4.10 (21.30 pct)
  8-groups:     5.44 (0.00 pct)     4.50 (17.27 pct)
16-groups:     7.09 (0.00 pct)     5.28 (25.52 pct)

I see following IBS traces in this case:

Base:

    6.69%  sched-messaging  [kernel.vmlinux]          [k] copy_user_generic_string
    5.38%  sched-messaging  [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath
    3.73%  swapper          [kernel.vmlinux]          [k] __switch_to_asm
    3.23%  sched-messaging  [kernel.vmlinux]          [k] __calc_delta
    2.93%  sched-messaging  [kernel.vmlinux]          [k] try_to_wake_up
    2.63%  sched-messaging  [kernel.vmlinux]          [k] dequeue_task_fair
    2.56%  sched-messaging  [kernel.vmlinux]          [k] osq_lock

Base + orig_mm_cid:

   13.70%  sched-messaging  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   11.87%  swapper          [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
    8.99%  sched-messaging  [kernel.vmlinux]      [k] copy_user_generic_string
    6.08%  sched-messaging  [kernel.vmlinux]      [k] osq_lock
    4.79%  sched-messaging  [kernel.vmlinux]      [k] apparmor_file_permission
    3.71%  sched-messaging  [kernel.vmlinux]      [k] mutex_spin_on_owner
    3.66%  sched-messaging  [kernel.vmlinux]      [k] ktime_get_coarse_real_ts64
    3.11%  sched-messaging  [kernel.vmlinux]      [k] _copy_from_iter

> 
>> Following are the results from 1 Socket 4th generation EPYC
>> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
>> becomes more severe as the number of core count increases.
>>
>> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.
>>
>> Threads:
>>
>> Test:             With-mmcid-patch        Without-mmcid-patch
>>   1-groups:         5.23 (0.00 pct)         4.61 (+11.85 pct)
>>   2-groups:         4.99 (0.00 pct)         4.72 (+5.41 pct)
>>   4-groups:         5.96 (0.00 pct)         4.87 (+18.28 pct)
>>   8-groups:         6.58 (0.00 pct)         5.44 (+17.32 pct)
>> 16-groups:        11.48 (0.00 pct)         8.07 (+29.70 pct)
> 
> I'm really confused, so you're saying that having a process wide
> spinlock is better than what this patch does? Or are you testing against
> something without mm-cid entirely?

It does look like the lock contention introduced by the original mm_cid patch helped
hackbench in this case. In that case, I see hackbench threads run for longer on average (avg_atom)
and total idle entries are down significantly. Even on disabling C1 and C2, I see
similar behavior. With the new mm_cid patch that gets rid of the lock contention, we see a drop
in the hackbench performance.

I will go dig into this further meanwhile if you have any pointers please do let me know.

--
Thanks and Regards,
Swapnil