linux-kernel - Re: [RFC][PATCH] sched: Cache aware load-balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250326093841.GC25239@noisy.programming.kicks-ass.net>
Date: Wed, 26 Mar 2025 10:38:41 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
	tim.c.chen@...ux.intel.com, tglx@...utronix.de, len.brown@...el.com,
	gautham.shenoy@....com, mingo@...nel.org, kprateek.nayak@....com,
	yu.chen.surf@...mail.com
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing

On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:

> > +static void task_tick_cache(struct rq *rq, struct task_struct *p)
> > +{
> > +	struct callback_head *work = &p->cache_work;
> > +	struct mm_struct *mm = p->mm;
> > +
> > +	if (!mm || !mm->pcpu_sched)
> > +		return;
> > +
> > +	if (mm->mm_sched_epoch == rq->cpu_epoch)
> > +		return;
> > +
> 
> [1]
> 
> > +	guard(raw_spinlock)(&mm->mm_sched_lock);
> > +
> > +	if (mm->mm_sched_epoch == rq->cpu_epoch)
> > +		return;
> 
> Remove above duplicated [1] and keep the check here?

Why? That just adds locking overhead for no reason.

> > +
> > +	if (work->next == work) {
> > +		task_work_add(p, work, TWA_RESUME);
> > +		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
> > +	}
> > +}
> > +
> > +static void task_cache_work(struct callback_head *work)
> > +{
> > +	struct task_struct *p = current;
> > +	struct mm_struct *mm = p->mm;
> > +	unsigned long m_a_occ = 0;
> > +	int cpu, m_a_cpu = -1;
> > +	cpumask_var_t cpus;
> > +
> > +	WARN_ON_ONCE(work != &p->cache_work);
> > +
> > +	work->next = work;
> > +
> > +	if (p->flags & PF_EXITING)
> > +		return;
> > +
> > +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> > +		return;
> > +
> > +	scoped_guard (cpus_read_lock) {
> > +		cpumask_copy(cpus, cpu_online_mask);
> > +
> > +		for_each_cpu(cpu, cpus) {
> > +			/* XXX sched_cluster_active */
> > +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
> > +			unsigned long occ, m_occ = 0, a_occ = 0;
> > +			int m_cpu = -1, nr = 0, i;
> > +
> > +			for_each_cpu(i, sched_domain_span(sd)) {
> > +				occ = fraction_mm_sched(cpu_rq(i),
> > +							per_cpu_ptr(mm->pcpu_sched, i));
> > +				a_occ += occ;
> > +				if (occ > m_occ) {
> > +					m_occ = occ;
> > +					m_cpu = i;
> > +				}
> > +				nr++;
> > +				trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
> > +					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
> > +			}
> > +
> > +			a_occ /= nr;
> 
> 
> In systems with a larger number of CPUs within a single LLC, the division
> may lose accuracy.

May, sure, but does it actually matter? We're tracking occupancy in ns
per EPOCH_PERIOD, which is 1e7. Lets assume 'large' here is something
daft like 100 CPUs per LLC (lets just really hope nobody actually goes
and do that, please), then you're still only loosing like 1e2 ns from
your average.

> Can we either use multiplication for comparison, or just use the accumulated
> total CPU occupancy
> of that LLC for comparison (by removing a_occ /= nr)?

You can't remove the devision, you get into trouble when the LLCs do not
have equal number of CPUs. You can carry that multiplication around ofc,
but again, why bother?

> > +			if (a_occ > m_a_occ) {
> > +				m_a_occ = a_occ;
> > +				m_a_cpu = m_cpu;
> > +			}
> > +
> > +			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
> > +				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
> > +
> > +			for_each_cpu(i, sched_domain_span(sd)) {
> > +				/* XXX threshold ? */
> > +				per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
> > +			}
> > +
> > +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * If the max average cache occupancy is 'small' we don't care.
> > +	 */
> > +	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
> > +		m_a_cpu = -1;
> > +
> > +	mm->mm_sched_cpu = m_a_cpu;
> 
> There might be an issue with mm_sched_cpu bouncing. Perhaps adding a
> threshold to compare the old a_occ of mm->mm_sched_cpu with the new a_occ of
> m_a_cpu could help. For example, if the latter (new_a_occ) is twice larger
> than the former (old a_occ), we can update mm->mm_sched_cpu to the new
> m_a_cpu. Otherwise, we keep the old value.

Some hysteresis might make sense, but I don't think waiting for it to
double makes sense, that might be too agressive.

> > +#ifdef CONFIG_SCHED_CACHE
> > +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
> > +
> > +static int select_cache_cpu(struct task_struct *p, int prev_cpu)
> > +{
> > +	struct mm_struct *mm = p->mm;
> > +	int cpu;
> > +
> > +	if (!mm || p->nr_cpus_allowed == 1)
> > +		return prev_cpu;
> > +
> > +	cpu = mm->mm_sched_cpu;
> > +	if (cpu < 0)
> > +		return prev_cpu;
> > +
> 
> 
> We observed frequent task migrations during some highly context-switch
> benchmarks, which led to performance regression when the LLC was saturated.
> Could we avoid task migration in cases where the previous CPU and the
> preferred CPU are within the same LLC?
> 
> if (cpus_share_cache(prev_cpu, cpu))
> 	return prev_cpu;

Sure.

> > +
> > +	if (static_branch_likely(&sched_numa_balancing) &&
> > +	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
> > +		/*
> > +		 * XXX look for max occupancy inside prev_cpu's node
> > +		 */
> > +		return prev_cpu;
> > +	}
> > +
> 
> Tim found that spreading tasks within the preferred LLC might help avoid
> task stacking:
> 
> sd = rcu_dereference(per_cpu(sd_llc, cpu));
> if (likely(sd))
> 	return cpumask_any(sched_domain_span(sd));

You know that cpumask_any() is implemented using cpumask_first(), right?
You're causing stacking if anything with that.

> > @@ -9288,6 +9556,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
> >   	if (sysctl_sched_migration_cost == 0)
> >   		return 0;
> > +#ifdef CONFIG_SCHED_CACHE
> > +	if (p->mm && p->mm->pcpu_sched) {
> > +		/*
> > +		 * XXX things like Skylake have non-inclusive L3 and might not
> > +		 * like this L3 centric view. What to do about L2 stickyness ?
> > +		 */
> > +		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
> > +		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
> 
> This might encourage more task migration within the preferred LLC, leading
> to some performance regression. Is it possible to raise the threshold for
> task migration within the preferred LLC and use the original delta time to
> determine whether a task is cache-hot?
> 
> if (p->mm && p->mm->pcpu_sched &&
>     cpus_share_cache(env->dst_cpu, env->src_cpu))
> 	return  2*per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
> 		  per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
> }

Nah, the saner thing to do is to preserve the topology averages and look
at those instead of the per-cpu values.

Eg. have task_cache_work() compute and store averages in the
sched_domain structure and then use those.