linux-kernel - Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230413134912.GB214119@ziqianlu-desk2>
Date:   Thu, 13 Apr 2023 21:49:12 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        <linux-kernel@...r.kernel.org>, Olivier Dion <odion@...icios.com>,
        <michael.christie@...cle.com>
Subject: Re: [RFC PATCH v4] sched: Fix performance regression introduced by
 mm_cid

On Thu, Apr 13, 2023 at 01:10:47PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2023 at 10:39:34PM +0800, Aaron Lu wrote:
> > On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
> > > 
> > > > I *guess* you might be able to see some contention with hackbench on
> > > > that HSW-EX system with v4.
> > > 
> > > Indeed! Notably it seems to be the wakeup from idle that trips it
> > > hardest:
> > 
> > Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
> > the sake of compacting and when task wakes there, it will have to
> > re-allocate a new cid through mm_get_cid() which needs to acquire
> > mm->cid_lock?
> 
> Yup. And I'm thinking it is futile (and counter productive) to strive
> for compactness in this (nr_threads >= nr_cpus) case.
> 
> The below on v4 solves the contention I see with hackbench (which runs
> 400 threads which is well above the 144 cpu count on that machine).

I also tested on Icelake and SPR using hackbench and the contention is
also gone :-)

I then tested with sysbench_postgres, the contention is still there on
SPR:

    11.63%    11.62%  [kernel.vmlinux]        [k]     native_queued_spin_lock_slowpath
    7.91%     native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
    3.17%     native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;epoll_wait

> 
> This obviously leaves a problem with the nr_threads = nr_cpus - 1 case,
> but I'm thinking we can add some fuzz (nr_cpu_ids - ilog2(nr_cpus_ids+1)
> perhaps). Also, I would be thinking that's not something that typically
> happens.

Let me see if adding some fuzz helps here.

For sysbench_postgres, the server uses process model while the client
uses thread model and the client(sysbench) started nr_cpus threads so
it's likely didn't trigger the (nr_cpus-1) bar here. I doubt the bar
may have to be as low as 1/2 nr_cpus considering that some of the
cpus are running the server processes.

> 
> Mathieu, WDYT? -- other than that the patch is an obvious hack :-)
> 
> ---
>  include/linux/mm_types.h | 8 ++++++++
>  kernel/fork.c            | 4 +++-
>  kernel/sched/core.c      | 9 +++++++++
>  kernel/sched/sched.h     | 2 ++
>  4 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4160ff5c6ebd..598d1b657afa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -609,6 +609,7 @@ struct mm_struct {
>  		 * were being concurrently updated by the updaters.
>  		 */
>  		raw_spinlock_t cid_lock;
> +		int cid_saturated;
>  		/**
>  		 * @pcpu_cid: Per-cpu current cid.
>  		 *
> @@ -912,6 +913,12 @@ static inline int mm_cid_clear_lazy_put(int cid)
>  	return cid & ~MM_CID_LAZY_PUT;
>  }
>  
> +static inline void mm_cid_desaturate(struct mm_struct *mm)
> +{
> +	if (mm->cid_saturated && atomic_read(&mm->mm_users) < nr_cpu_ids)
> +		mm->cid_saturated = 0;
> +}
> +
>  /* Accessor for struct mm_struct's cidmask. */
>  static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
>  {
> @@ -928,6 +935,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
>  	int i;
>  
>  	raw_spin_lock_init(&mm->cid_lock);
> +	mm->cid_saturated = 0;
>  	for_each_possible_cpu(i)
>  		*per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
>  	cpumask_clear(mm_cidmask(mm));
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3832bea713c4..a5233e450435 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1233,7 +1233,9 @@ void mmput(struct mm_struct *mm)
>  	might_sleep();
>  
>  	if (atomic_dec_and_test(&mm->mm_users))
> -		__mmput(mm);
> +		return __mmput(mm);
> +
> +	mm_cid_desaturate(mm);
>  }
>  EXPORT_SYMBOL_GPL(mmput);
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 425766cc1300..d5004d179531 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11550,6 +11550,15 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
>  	if (last_mm_cid == -1)
>  		return;
>  
> +	/*
> +	 * When nr_threads > nr_cpus, there is no point in moving anything
> +	 * around to keep it compact.
> +	 */
> +	if (mm->cid_saturated) {
> +		t->last_mm_cid = -1;
> +		return;
> +	}
> +
>  	src_rq = task_rq(t);
>  	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
>  	src_cid = READ_ONCE(*src_pcpu_cid);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..6c4af2992e79 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3347,6 +3347,8 @@ static inline int mm_cid_get(struct mm_struct *mm)
>  	}
>  	raw_spin_lock(&mm->cid_lock);
>  	cid = __mm_cid_get_locked(mm);
> +	if (cid == nr_cpu_ids - 1)
> +		mm->cid_saturated = 1;
>  	raw_spin_unlock(&mm->cid_lock);
>  	WRITE_ONCE(*pcpu_cid, cid);
>  	return cid;