[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3e9eaed6-4708-9e58-c80d-143760d6b23a@efficios.com>
Date: Tue, 20 Jun 2023 06:51:01 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Swapnil Sapkal <Swapnil.Sapkal@....com>,
Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
Aaron Lu <aaron.lu@...el.com>, x86@...nel.org,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by
mm_cid
On 6/20/23 06:35, Swapnil Sapkal wrote:
> Hello Peter,
>
> On 6/20/2023 2:41 PM, Peter Zijlstra wrote:
>> On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
>>> Hello Mathieu,
>>>
>>> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
>>>> The following commit has been merged into the sched/core branch of tip:
>>>>
>>>> Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268
>>>> Gitweb:
>>>> https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
>>>> Author: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>>> AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00
>>>> Committer: Peter Zijlstra <peterz@...radead.org>
>>>> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>>>>
>>>> sched: Fix performance regression introduced by mm_cid
>>>>
>>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a
>>>> PostgreSQL
>>>> sysbench regression reported by Aaron Lu.
>>>>
>>>> Keep track of the currently allocated mm_cid for each mm/cpu rather
>>>> than
>>>> freeing them immediately on context switch. This eliminates most atomic
>>>> operations when context switching back and forth between threads
>>>> belonging to different memory spaces in multi-threaded scenarios (many
>>>> processes, each with many threads). The per-mm/per-cpu mm_cid values
>>>> are
>>>> serialized by their respective runqueue locks.
>>>>
>>>> Thread migration is handled by introducing invocation to
>>>> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
>>>> activate_task() for migrating tasks. If the destination cpu's mm_cid is
>>>> unset, and if the source runqueue is not actively using its mm_cid,
>>>> then
>>>> the source cpu's mm_cid is moved to the destination cpu on migration.
>>>>
>>>> Introduce a task-work executed periodically, similarly to NUMA work,
>>>> which delays reclaim of cid values when they are unused for a period of
>>>> time.
>>>>
>>>> Keep track of the allocation time for each per-cpu cid, and let the
>>>> task
>>>> work clear them when they are observed to be older than
>>>> SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
>>>> mm_cids which are greater or equal to the Hamming weight of the mm
>>>> cidmask to keep concurrency ids compact.
>>>>
>>>> Because we want to ensure the mm_cid converges towards the smaller
>>>> values as migrations happen, the prior optimization that was done when
>>>> context switching between threads belonging to the same mm is removed,
>>>> because it could delay the lazy release of the destination runqueue
>>>> mm_cid after it has been replaced by a migration. Removing this prior
>>>> optimization is not an issue performance-wise because the introduced
>>>> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>>>>
>>>> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
>>>> Reported-by: Aaron Lu <aaron.lu@...el.com>
>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>>>> Tested-by: Aaron Lu <aaron.lu@...el.com>
>>>> Link:
>>>> https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>>>
>>> I run standard benchmarks as a part of kernel performance regression
>>> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
>>> I have seen performance regression in hackbench running with threads.
>>> When I did
>>> git bisect it pointed to this commit and reverting this commit helps
>>> regains
>>> the performance. This regression is not seen with hackbench processes.
>>
>> Well, *this* commit was supposed to help fix the horrible contention on
>> cid_lock that was introduced with af7f588d8f73.
>
> I went back and tested the commit that introduced mm_cid and I found
> that the
> original implementation actually helped hackbench. Following are numbers
> from
> 2 Socket Zen3 Server (2 X 64C/128T):
>
> Test: base (v6.2-rc1) base + orig_mm_cid
> 1-groups: 4.29 (0.00 pct) 4.32 (-0.69 pct)
> 2-groups: 4.96 (0.00 pct) 4.94 (0.40 pct)
> 4-groups: 5.21 (0.00 pct) 4.10 (21.30 pct)
> 8-groups: 5.44 (0.00 pct) 4.50 (17.27 pct)
> 16-groups: 7.09 (0.00 pct) 5.28 (25.52 pct)
>
> I see following IBS traces in this case:
>
> Base:
>
> 6.69% sched-messaging [kernel.vmlinux] [k]
> copy_user_generic_string
> 5.38% sched-messaging [kernel.vmlinux] [k]
> native_queued_spin_lock_slowpath
> 3.73% swapper [kernel.vmlinux] [k] __switch_to_asm
> 3.23% sched-messaging [kernel.vmlinux] [k] __calc_delta
> 2.93% sched-messaging [kernel.vmlinux] [k] try_to_wake_up
> 2.63% sched-messaging [kernel.vmlinux] [k] dequeue_task_fair
> 2.56% sched-messaging [kernel.vmlinux] [k] osq_lock
>
> Base + orig_mm_cid:
>
> 13.70% sched-messaging [kernel.vmlinux] [k]
> native_queued_spin_lock_slowpath
> 11.87% swapper [kernel.vmlinux] [k]
> native_queued_spin_lock_slowpath
> 8.99% sched-messaging [kernel.vmlinux] [k]
> copy_user_generic_string
> 6.08% sched-messaging [kernel.vmlinux] [k] osq_lock
> 4.79% sched-messaging [kernel.vmlinux] [k]
> apparmor_file_permission
> 3.71% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
> 3.66% sched-messaging [kernel.vmlinux] [k]
> ktime_get_coarse_real_ts64
> 3.11% sched-messaging [kernel.vmlinux] [k] _copy_from_iter
>
>>
>>> Following are the results from 1 Socket 4th generation EPYC
>>> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
>>> becomes more severe as the number of core count increases.
>>>
>>> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is
>>> significantly worse.
>>>
>>> Threads:
>>>
>>> Test: With-mmcid-patch Without-mmcid-patch
>>> 1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct)
>>> 2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct)
>>> 4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct)
>>> 8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct)
>>> 16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct)
>>
>> I'm really confused, so you're saying that having a process wide
>> spinlock is better than what this patch does? Or are you testing against
>> something without mm-cid entirely?
>
> It does look like the lock contention introduced by the original mm_cid
> patch helped
> hackbench in this case. In that case, I see hackbench threads run for
> longer on average (avg_atom)
> and total idle entries are down significantly. Even on disabling C1 and
> C2, I see
> similar behavior. With the new mm_cid patch that gets rid of the lock
> contention, we see a drop
> in the hackbench performance.
>
> I will go dig into this further meanwhile if you have any pointers
> please do let me know.
I suspect the baseline don't have spinlock contention because the test-case
schedules between threads belonging to the same process, for which the initial
mm_cid patch had an optimization which skips the spinlock entirely.
This optimization for inter-thread scheduling had to be removed in the following
patch to address the performance issue more generally, covering the inter-process
scheduling.
I suspect the regression is caused by the mm_count cache line bouncing.
Please try with this additional patch applied:
https://lore.kernel.org/lkml/20230515143536.114960-1-mathieu.desnoyers@efficios.com/
This patch has recently been merged into the mm tree.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists