linux-kernel - Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230413111047.GB83892@hirez.programming.kicks-ass.net>
Date:   Thu, 13 Apr 2023 13:10:47 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        linux-kernel@...r.kernel.org, Olivier Dion <odion@...icios.com>,
        michael.christie@...cle.com
Subject: Re: [RFC PATCH v4] sched: Fix performance regression introduced by
 mm_cid

On Wed, Apr 12, 2023 at 10:39:34PM +0800, Aaron Lu wrote:
> On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
> > 
> > > I *guess* you might be able to see some contention with hackbench on
> > > that HSW-EX system with v4.
> > 
> > Indeed! Notably it seems to be the wakeup from idle that trips it
> > hardest:
> 
> Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
> the sake of compacting and when task wakes there, it will have to
> re-allocate a new cid through mm_get_cid() which needs to acquire
> mm->cid_lock?

Yup. And I'm thinking it is futile (and counter productive) to strive
for compactness in this (nr_threads >= nr_cpus) case.

The below on v4 solves the contention I see with hackbench (which runs
400 threads which is well above the 144 cpu count on that machine).

This obviously leaves a problem with the nr_threads = nr_cpus - 1 case,
but I'm thinking we can add some fuzz (nr_cpu_ids - ilog2(nr_cpus_ids+1)
perhaps). Also, I would be thinking that's not something that typically
happens.

Mathieu, WDYT? -- other than that the patch is an obvious hack :-)

---
 include/linux/mm_types.h | 8 ++++++++
 kernel/fork.c            | 4 +++-
 kernel/sched/core.c      | 9 +++++++++
 kernel/sched/sched.h     | 2 ++
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4160ff5c6ebd..598d1b657afa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -609,6 +609,7 @@ struct mm_struct {
 		 * were being concurrently updated by the updaters.
 		 */
 		raw_spinlock_t cid_lock;
+		int cid_saturated;
 		/**
 		 * @pcpu_cid: Per-cpu current cid.
 		 *
@@ -912,6 +913,12 @@ static inline int mm_cid_clear_lazy_put(int cid)
 	return cid & ~MM_CID_LAZY_PUT;
 }
 
+static inline void mm_cid_desaturate(struct mm_struct *mm)
+{
+	if (mm->cid_saturated && atomic_read(&mm->mm_users) < nr_cpu_ids)
+		mm->cid_saturated = 0;
+}
+
 /* Accessor for struct mm_struct's cidmask. */
 static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 {
@@ -928,6 +935,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
 	int i;
 
 	raw_spin_lock_init(&mm->cid_lock);
+	mm->cid_saturated = 0;
 	for_each_possible_cpu(i)
 		*per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
 	cpumask_clear(mm_cidmask(mm));
diff --git a/kernel/fork.c b/kernel/fork.c
index 3832bea713c4..a5233e450435 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1233,7 +1233,9 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users))
-		__mmput(mm);
+		return __mmput(mm);
+
+	mm_cid_desaturate(mm);
 }
 EXPORT_SYMBOL_GPL(mmput);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 425766cc1300..d5004d179531 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11550,6 +11550,15 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
 	if (last_mm_cid == -1)
 		return;
 
+	/*
+	 * When nr_threads > nr_cpus, there is no point in moving anything
+	 * around to keep it compact.
+	 */
+	if (mm->cid_saturated) {
+		t->last_mm_cid = -1;
+		return;
+	}
+
 	src_rq = task_rq(t);
 	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
 	src_cid = READ_ONCE(*src_pcpu_cid);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3e7dc2cd1cc..6c4af2992e79 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3347,6 +3347,8 @@ static inline int mm_cid_get(struct mm_struct *mm)
 	}
 	raw_spin_lock(&mm->cid_lock);
 	cid = __mm_cid_get_locked(mm);
+	if (cid == nr_cpu_ids - 1)
+		mm->cid_saturated = 1;
 	raw_spin_unlock(&mm->cid_lock);
 	WRITE_ONCE(*pcpu_cid, cid);
 	return cid;