[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6b8e63ab-e81e-470c-e03f-f3860c83bdb1@efficios.com>
Date: Thu, 13 Apr 2023 09:56:38 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Zijlstra <peterz@...radead.org>,
Aaron Lu <aaron.lu@...el.com>
Cc: linux-kernel@...r.kernel.org, Olivier Dion <odion@...icios.com>,
michael.christie@...cle.com
Subject: Re: [RFC PATCH v4] sched: Fix performance regression introduced by
mm_cid
On 2023-04-13 07:10, Peter Zijlstra wrote:
> On Wed, Apr 12, 2023 at 10:39:34PM +0800, Aaron Lu wrote:
>> On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
>>> On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
>>>
>>>> I *guess* you might be able to see some contention with hackbench on
>>>> that HSW-EX system with v4.
>>>
>>> Indeed! Notably it seems to be the wakeup from idle that trips it
>>> hardest:
>>
>> Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
>> the sake of compacting and when task wakes there, it will have to
>> re-allocate a new cid through mm_get_cid() which needs to acquire
>> mm->cid_lock?
>
> Yup. And I'm thinking it is futile (and counter productive) to strive
> for compactness in this (nr_threads >= nr_cpus) case.
>
> The below on v4 solves the contention I see with hackbench (which runs
> 400 threads which is well above the 144 cpu count on that machine).
>
> This obviously leaves a problem with the nr_threads = nr_cpus - 1 case,
> but I'm thinking we can add some fuzz (nr_cpu_ids - ilog2(nr_cpus_ids+1)
> perhaps). Also, I would be thinking that's not something that typically
> happens.
>
> Mathieu, WDYT? -- other than that the patch is an obvious hack :-)
I hate it with passion :-)
It is quite specific to your workload/configuration.
If we take for instance a process with a large mm_users count which is
eventually affined to a subset of the cpus with cpusets or
sched_setaffinity, your patch will prevent compaction of the concurrency
ids when it really should not.
I'm working on a lock-free cid-get, hopefully my next version will
eliminate the performance regression.
Thanks,
Mathieu
>
> ---
> include/linux/mm_types.h | 8 ++++++++
> kernel/fork.c | 4 +++-
> kernel/sched/core.c | 9 +++++++++
> kernel/sched/sched.h | 2 ++
> 4 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4160ff5c6ebd..598d1b657afa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -609,6 +609,7 @@ struct mm_struct {
> * were being concurrently updated by the updaters.
> */
> raw_spinlock_t cid_lock;
> + int cid_saturated;
> /**
> * @pcpu_cid: Per-cpu current cid.
> *
> @@ -912,6 +913,12 @@ static inline int mm_cid_clear_lazy_put(int cid)
> return cid & ~MM_CID_LAZY_PUT;
> }
>
> +static inline void mm_cid_desaturate(struct mm_struct *mm)
> +{
> + if (mm->cid_saturated && atomic_read(&mm->mm_users) < nr_cpu_ids)
> + mm->cid_saturated = 0;
> +}
> +
> /* Accessor for struct mm_struct's cidmask. */
> static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
> {
> @@ -928,6 +935,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
> int i;
>
> raw_spin_lock_init(&mm->cid_lock);
> + mm->cid_saturated = 0;
> for_each_possible_cpu(i)
> *per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
> cpumask_clear(mm_cidmask(mm));
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3832bea713c4..a5233e450435 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1233,7 +1233,9 @@ void mmput(struct mm_struct *mm)
> might_sleep();
>
> if (atomic_dec_and_test(&mm->mm_users))
> - __mmput(mm);
> + return __mmput(mm);
> +
> + mm_cid_desaturate(mm);
> }
> EXPORT_SYMBOL_GPL(mmput);
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 425766cc1300..d5004d179531 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11550,6 +11550,15 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
> if (last_mm_cid == -1)
> return;
>
> + /*
> + * When nr_threads > nr_cpus, there is no point in moving anything
> + * around to keep it compact.
> + */
> + if (mm->cid_saturated) {
> + t->last_mm_cid = -1;
> + return;
> + }
> +
> src_rq = task_rq(t);
> src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
> src_cid = READ_ONCE(*src_pcpu_cid);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..6c4af2992e79 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3347,6 +3347,8 @@ static inline int mm_cid_get(struct mm_struct *mm)
> }
> raw_spin_lock(&mm->cid_lock);
> cid = __mm_cid_get_locked(mm);
> + if (cid == nr_cpu_ids - 1)
> + mm->cid_saturated = 1;
> raw_spin_unlock(&mm->cid_lock);
> WRITE_ONCE(*pcpu_cid, cid);
> return cid;
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists