[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com>
Date: Tue, 20 Jun 2023 16:05:45 +0530
From: Swapnil Sapkal <Swapnil.Sapkal@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
Aaron Lu <aaron.lu@...el.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
x86@...nel.org
Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by
mm_cid
Hello Peter,
On 6/20/2023 2:41 PM, Peter Zijlstra wrote:
> On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
>> Hello Mathieu,
>>
>> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
>>> The following commit has been merged into the sched/core branch of tip:
>>>
>>> Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268
>>> Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
>>> Author: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>> AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00
>>> Committer: Peter Zijlstra <peterz@...radead.org>
>>> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>>>
>>> sched: Fix performance regression introduced by mm_cid
>>>
>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
>>> sysbench regression reported by Aaron Lu.
>>>
>>> Keep track of the currently allocated mm_cid for each mm/cpu rather than
>>> freeing them immediately on context switch. This eliminates most atomic
>>> operations when context switching back and forth between threads
>>> belonging to different memory spaces in multi-threaded scenarios (many
>>> processes, each with many threads). The per-mm/per-cpu mm_cid values are
>>> serialized by their respective runqueue locks.
>>>
>>> Thread migration is handled by introducing invocation to
>>> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
>>> activate_task() for migrating tasks. If the destination cpu's mm_cid is
>>> unset, and if the source runqueue is not actively using its mm_cid, then
>>> the source cpu's mm_cid is moved to the destination cpu on migration.
>>>
>>> Introduce a task-work executed periodically, similarly to NUMA work,
>>> which delays reclaim of cid values when they are unused for a period of
>>> time.
>>>
>>> Keep track of the allocation time for each per-cpu cid, and let the task
>>> work clear them when they are observed to be older than
>>> SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
>>> mm_cids which are greater or equal to the Hamming weight of the mm
>>> cidmask to keep concurrency ids compact.
>>>
>>> Because we want to ensure the mm_cid converges towards the smaller
>>> values as migrations happen, the prior optimization that was done when
>>> context switching between threads belonging to the same mm is removed,
>>> because it could delay the lazy release of the destination runqueue
>>> mm_cid after it has been replaced by a migration. Removing this prior
>>> optimization is not an issue performance-wise because the introduced
>>> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>>>
>>> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
>>> Reported-by: Aaron Lu <aaron.lu@...el.com>
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>>> Tested-by: Aaron Lu <aaron.lu@...el.com>
>>> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>>
>> I run standard benchmarks as a part of kernel performance regression
>> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
>> I have seen performance regression in hackbench running with threads. When I did
>> git bisect it pointed to this commit and reverting this commit helps regains
>> the performance. This regression is not seen with hackbench processes.
>
> Well, *this* commit was supposed to help fix the horrible contention on
> cid_lock that was introduced with af7f588d8f73.
I went back and tested the commit that introduced mm_cid and I found that the
original implementation actually helped hackbench. Following are numbers from
2 Socket Zen3 Server (2 X 64C/128T):
Test: base (v6.2-rc1) base + orig_mm_cid
1-groups: 4.29 (0.00 pct) 4.32 (-0.69 pct)
2-groups: 4.96 (0.00 pct) 4.94 (0.40 pct)
4-groups: 5.21 (0.00 pct) 4.10 (21.30 pct)
8-groups: 5.44 (0.00 pct) 4.50 (17.27 pct)
16-groups: 7.09 (0.00 pct) 5.28 (25.52 pct)
I see following IBS traces in this case:
Base:
6.69% sched-messaging [kernel.vmlinux] [k] copy_user_generic_string
5.38% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
3.73% swapper [kernel.vmlinux] [k] __switch_to_asm
3.23% sched-messaging [kernel.vmlinux] [k] __calc_delta
2.93% sched-messaging [kernel.vmlinux] [k] try_to_wake_up
2.63% sched-messaging [kernel.vmlinux] [k] dequeue_task_fair
2.56% sched-messaging [kernel.vmlinux] [k] osq_lock
Base + orig_mm_cid:
13.70% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
11.87% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
8.99% sched-messaging [kernel.vmlinux] [k] copy_user_generic_string
6.08% sched-messaging [kernel.vmlinux] [k] osq_lock
4.79% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission
3.71% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
3.66% sched-messaging [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
3.11% sched-messaging [kernel.vmlinux] [k] _copy_from_iter
>
>> Following are the results from 1 Socket 4th generation EPYC
>> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
>> becomes more severe as the number of core count increases.
>>
>> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.
>>
>> Threads:
>>
>> Test: With-mmcid-patch Without-mmcid-patch
>> 1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct)
>> 2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct)
>> 4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct)
>> 8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct)
>> 16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct)
>
> I'm really confused, so you're saying that having a process wide
> spinlock is better than what this patch does? Or are you testing against
> something without mm-cid entirely?
It does look like the lock contention introduced by the original mm_cid patch helped
hackbench in this case. In that case, I see hackbench threads run for longer on average (avg_atom)
and total idle entries are down significantly. Even on disabling C1 and C2, I see
similar behavior. With the new mm_cid patch that gets rid of the lock contention, we see a drop
in the hackbench performance.
I will go dig into this further meanwhile if you have any pointers please do let me know.
--
Thanks and Regards,
Swapnil
Powered by blists - more mailing lists