linux-kernel - Re: [PATCH v1 1/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241009090744.GK17263@noisy.programming.kicks-ass.net>
Date: Wed, 9 Oct 2024 11:07:44 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Boqun Feng <boqun.feng@...il.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Mel Gorman <mgorman@...e.de>, Steven Rostedt <rostedt@...dmis.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Yury Norov <yury.norov@...il.com>,
	Rasmus Villemoes <linux@...musvillemoes.dk>,
	Marco Elver <elver@...gle.com>, Dmitry Vyukov <dvyukov@...gle.com>
Subject: Re: [PATCH v1 1/2] sched: Improve cache locality of RSEQ concurrency
 IDs for intermittent workloads

On Thu, Oct 03, 2024 at 08:44:38PM -0400, Mathieu Desnoyers wrote:
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
> a reference to the concurrency id allocated for each CPU. This reference
> expires shortly after a 100ms delay.
> 
> These per-CPU references keep the per-mm-cid data cache-local in
> situations where threads are running at least once on each CPU within
> each 100ms window, thus keeping the per-cpu reference alive.
> 
> However, intermittent workloads behaving in bursts spaced by more than
> 100ms on each CPU exhibit bad cache locality and degraded performance
> compared to purely per-cpu data indexing, because concurrency IDs are
> allocated over various CPUs and cores, therefore losing cache locality
> of the associated data.
> 
> Introduce the following changes to improve per-mm-cid cache locality:
> 
> - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
>   track of which mm_cid value was last used, and use it as a hint to
>   attempt re-allocating the same concurrency ID the next time this
>   mm/cpu needs to allocate a concurrency ID,
> 
> - Add a per-mm CPUs allowed mask, which keeps track of the union of
>   CPUs allowed for all threads belonging to this mm. This cpumask is
>   only set during the lifetime of the mm, never cleared, so it
>   represents the union of all the CPUs allowed since the beginning of
>   the mm lifetime. (note that the mm_cpumask() is really arch-specific
>   and tailored to the TLB flush needs, and is thus _not_ a viable
>   approach for this)

Because my morning juice came with an excessive dose of pedantry this
morning -- the previous and next item end with a comma due to this being
an enumeration; but this one has a full stop, suggesting the iteration
is at an end.

> - Add a per-mm nr_cpus_allowed to keep track of the weight of the
>   per-mm CPUs allowed mask (for fast access),
> 
> - Add a per-mm nr_cids_used to keep track of the highest concurrency
>   ID allocated for the mm. This is used for expanding the concurrency ID
>   allocation within the upper bound defined by:

The description and naming disagree -- while from vague memories they
end up being similar -- it is a stumbling block this morning. The
description seems to suggest this should be called max_cid or somesuch.

Also, is it actually used for anything? I found the tracking code in
__mm_cid_try_get(), but it's not actually doing anything?

>     min(mm->nr_cpus_allowed, mm->mm_users)
> 
>   When the next unused CID value reaches this threshold, stop trying
>   to expand the cid allocation and use the first available cid value
>   instead.
> 
> Spreading allocation to use all the cid values within the range
> 
>   [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
> 
> improves cache locality while preserving mm_cid compactness within the
> expected user limits.

This paragraph seems to rudely interrupt the iteration ? Or is (Fred)
Colon gone missing again to start a new iteration?

(Damn, and now I need me a Nobby reference somehow)

Anyway, I have vague memories I strongly suggested keeping the CID space
dense at some point :-)

> - In __mm_cid_try_get, only return cid values within the range
>   [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
>   prevents allocating cids above the number of allowed cpus in
>   rare scenarios where cid allocation races with a concurrent
>   remote-clear of the per-mm/cpu cid. This improvement is made
>   possible by the addition of the per-mm CPUs allowed mask.

and no comma to continue the iteration.

> - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
>   t->nr_cpus_allowed. This criterion was really meant to compare
>   the number of mm->mm_users to the number of CPUs allowed for the
>   entire mm. Therefore, the prior comparison worked fine when all
>   threads shared the same CPUs allowed mask, but not so much in
>   scenarios where those threads have different masks (e.g. each
>   thread pinned to a single CPU). This improvement is made
>   possible by the addition of the per-mm CPUs allowed mask.
> 

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8e38bc..8b5a185b4d5a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -782,6 +782,7 @@ struct vm_area_struct {
>  struct mm_cid {
>  	u64 time;
>  	int cid;
> +	int recent_cid;
>  };
>  #endif
>  
> @@ -852,6 +853,27 @@ struct mm_struct {
>  		 * When the next mm_cid scan is due (in jiffies).
>  		 */
>  		unsigned long mm_cid_next_scan;
> +		/**
> +		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
> +		 *
> +		 * Number of CPUs allowed in the union of all mm's
> +		 * threads allowed CPUs.
> +		 */
> +		atomic_t nr_cpus_allowed;
> +		/**
> +		 * @nr_cids_used: Number of used concurrency IDs.
> +		 *
> +		 * Track the highest concurrency ID allocated for the
> +		 * mm: nr_cids_used - 1.
> +		 */
> +		atomic_t nr_cids_used;
> +		/**
> +		 * @cpus_allowed_lock: Lock protecting mm cpus_allowed.
> +		 *
> +		 * Provide mutual exclusion for mm cpus_allowed and
> +		 * mm nr_cpus_allowed updates.

If nr_cpus_allowed update is serialized by this here thing, why is it an
atomic_t? A quick search seems to suggest you're only using atomic_set()
/ atomic_read() on it, which is a big fat clue it shouldn't be atomic_t.

Am I missing something?

> +		 */
> +		spinlock_t cpus_allowed_lock;
>  #endif
>  #ifdef CONFIG_MMU
>  		atomic_long_t pgtables_bytes;	/* size of all page tables */
> @@ -1170,18 +1192,30 @@ static inline int mm_cid_clear_lazy_put(int cid)
>  	return cid & ~MM_CID_LAZY_PUT;
>  }
>  
> +/*
> + * mm_cpus_allowed: Union of all mm's threads allowed CPUs.
> + */
> +static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm)
> +{
> +	unsigned long bitmap = (unsigned long)mm;
> +
> +	bitmap += offsetof(struct mm_struct, cpu_bitmap);
> +	/* Skip cpu_bitmap */
> +	bitmap += cpumask_size();
> +	return (struct cpumask *)bitmap;
> +}
> +
>  /* Accessor for struct mm_struct's cidmask. */
>  static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
>  {
> -	unsigned long cid_bitmap = (unsigned long)mm;
> +	unsigned long cid_bitmap = (unsigned long)mm_cpus_allowed(mm);
>  
> -	cid_bitmap += offsetof(struct mm_struct, cpu_bitmap);
> -	/* Skip cpu_bitmap */
> +	/* Skip mm_cpus_allowed */
>  	cid_bitmap += cpumask_size();
>  	return (struct cpumask *)cid_bitmap;
>  }
>  
> -static inline void mm_init_cid(struct mm_struct *mm)
> +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
>  {
>  	int i;
>  
> @@ -1189,17 +1223,22 @@ static inline void mm_init_cid(struct mm_struct *mm)
>  		struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
>  
>  		pcpu_cid->cid = MM_CID_UNSET;
> +		pcpu_cid->recent_cid = MM_CID_UNSET;
>  		pcpu_cid->time = 0;
>  	}
> +	atomic_set(&mm->nr_cpus_allowed, p->nr_cpus_allowed);
> +	atomic_set(&mm->nr_cids_used, 0);
> +	spin_lock_init(&mm->cpus_allowed_lock);
> +	cpumask_copy(mm_cpus_allowed(mm), p->cpus_ptr);

Should that not be using p->cpus_mask ? I mean, it is unlikely this code
is ran during migrate_disable(), but just in case that ever does do
happen, we'll be getting a spurious single CPU mask.

>  	cpumask_clear(mm_cidmask(mm));
>  }
>  
> -static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
> +static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
>  {
>  	mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
>  	if (!mm->pcpu_cid)
>  		return -ENOMEM;
> -	mm_init_cid(mm);
> +	mm_init_cid(mm, p);
>  	return 0;
>  }
>  #define mm_alloc_cid(...)	alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__))
> @@ -1212,16 +1251,31 @@ static inline void mm_destroy_cid(struct mm_struct *mm)
>  
>  static inline unsigned int mm_cid_size(void)
>  {
> -	return cpumask_size();
> +	return 2 * cpumask_size();	/* mm_cpus_allowed(), mm_cidmask(). */
> +}
> +
> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
> +{
> +	struct cpumask *mm_allowed = mm_cpus_allowed(mm);
> +
> +	if (!mm)
> +		return;
> +	/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
> +	spin_lock(&mm->cpus_allowed_lock);
> +	cpumask_or(mm_allowed, mm_allowed, cpumask);
> +	atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed));
> +	spin_unlock(&mm->cpus_allowed_lock);

We're having a problem here, you call this from __do_set_cpus_allowed(),
which is holding rq->lock, which is a raw_spinlock_t.

>  }
>  #else /* CONFIG_SCHED_MM_CID */
> -static inline void mm_init_cid(struct mm_struct *mm) { }
> -static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
> +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
> +static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
>  static inline void mm_destroy_cid(struct mm_struct *mm) { }
> +
>  static inline unsigned int mm_cid_size(void)
>  {
>  	return 0;
>  }
> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
>  #endif /* CONFIG_SCHED_MM_CID */
>  
>  struct mmu_gather;

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 43e453ab7e20..772a3daf784a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2691,6 +2691,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
>  		put_prev_task(rq, p);
>  
>  	p->sched_class->set_cpus_allowed(p, ctx);
> +	mm_set_cpus_allowed(p->mm, ctx->new_mask);

This here, is with p->pi_lock and rq->lock held -- both are
raw_spinlock_t.

>  
>  	if (queued)
>  		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);