linux-kernel - Re: [RFC][PATCH] sched: Cache aware load-balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <38126b7c-7782-46c9-ae07-86a669a710e0@intel.com>
Date: Thu, 27 Mar 2025 10:48:23 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<tim.c.chen@...ux.intel.com>, <tglx@...utronix.de>, <len.brown@...el.com>,
	<gautham.shenoy@....com>, <mingo@...nel.org>, <kprateek.nayak@....com>,
	<yu.chen.surf@...mail.com>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing

On 3/26/2025 5:38 PM, Peter Zijlstra wrote:
> On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:
> 
>>> +static void task_tick_cache(struct rq *rq, struct task_struct *p)
>>> +{
>>> +	struct callback_head *work = &p->cache_work;
>>> +	struct mm_struct *mm = p->mm;
>>> +
>>> +	if (!mm || !mm->pcpu_sched)
>>> +		return;
>>> +
>>> +	if (mm->mm_sched_epoch == rq->cpu_epoch)
>>> +		return;
>>> +
>>
>> [1]
>>
>>> +	guard(raw_spinlock)(&mm->mm_sched_lock);
>>> +
>>> +	if (mm->mm_sched_epoch == rq->cpu_epoch)
>>> +		return;
>>
>> Remove above duplicated [1] and keep the check here?
> 
> Why? That just adds locking overhead for no reason.
> 

I thought mm->mm_sched_epoch is updated under the protect of
mm->mm_sched_lock, so the read side should also be protected by
this lock?

>>> +
>>> +	if (work->next == work) {
>>> +		task_work_add(p, work, TWA_RESUME);
>>> +		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
>>> +	}
>>> +}
>>> +
>>> +static void task_cache_work(struct callback_head *work)
>>> +{
>>> +	struct task_struct *p = current;
>>> +	struct mm_struct *mm = p->mm;
>>> +	unsigned long m_a_occ = 0;
>>> +	int cpu, m_a_cpu = -1;
>>> +	cpumask_var_t cpus;
>>> +
>>> +	WARN_ON_ONCE(work != &p->cache_work);
>>> +
>>> +	work->next = work;
>>> +
>>> +	if (p->flags & PF_EXITING)
>>> +		return;
>>> +
>>> +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>>> +		return;
>>> +
>>> +	scoped_guard (cpus_read_lock) {
>>> +		cpumask_copy(cpus, cpu_online_mask);
>>> +
>>> +		for_each_cpu(cpu, cpus) {
>>> +			/* XXX sched_cluster_active */
>>> +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
>>> +			unsigned long occ, m_occ = 0, a_occ = 0;
>>> +			int m_cpu = -1, nr = 0, i;
>>> +
>>> +			for_each_cpu(i, sched_domain_span(sd)) {
>>> +				occ = fraction_mm_sched(cpu_rq(i),
>>> +							per_cpu_ptr(mm->pcpu_sched, i));
>>> +				a_occ += occ;
>>> +				if (occ > m_occ) {
>>> +					m_occ = occ;
>>> +					m_cpu = i;
>>> +				}
>>> +				nr++;
>>> +				trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
>>> +					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
>>> +			}
>>> +
>>> +			a_occ /= nr;
>>
>>
>> In systems with a larger number of CPUs within a single LLC, the division
>> may lose accuracy.
> 
> May, sure, but does it actually matter? We're tracking occupancy in ns
> per EPOCH_PERIOD, which is 1e7. Lets assume 'large' here is something
> daft like 100 CPUs per LLC (lets just really hope nobody actually goes
> and do that, please), then you're still only loosing like 1e2 ns from
> your average.
> 

I see, the 100 ns should not matter much compared to 10 ms sample period.

>> Can we either use multiplication for comparison, or just use the accumulated
>> total CPU occupancy
>> of that LLC for comparison (by removing a_occ /= nr)?
> 
> You can't remove the devision, you get into trouble when the LLCs do not
> have equal number of CPUs. 

On a hybrid system, suppose there are two LLCs, LLC1 has 8 CPUs, the 
sum_occ is 1024, its avg_occ is 128. LLC2 has 4 CPUs, the sum_occ is
768, avg_occ is 192. If we compare avg_occ, we should choose LLC2, even 
if LLC1 might have more idle CPUs, and LLC1 has more active cache-hot 
threads.

> You can carry that multiplication around ofc,
> but again, why bother?
> 
>>> +			if (a_occ > m_a_occ) {
>>> +				m_a_occ = a_occ;
>>> +				m_a_cpu = m_cpu;
>>> +			}
>>> +
>>> +			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
>>> +				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
>>> +
>>> +			for_each_cpu(i, sched_domain_span(sd)) {
>>> +				/* XXX threshold ? */
>>> +				per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
>>> +			}
>>> +
>>> +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
>>> +		}
>>> +	}
>>> +
>>> +	/*
>>> +	 * If the max average cache occupancy is 'small' we don't care.
>>> +	 */
>>> +	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
>>> +		m_a_cpu = -1;
>>> +
>>> +	mm->mm_sched_cpu = m_a_cpu;
>>
>> There might be an issue with mm_sched_cpu bouncing. Perhaps adding a
>> threshold to compare the old a_occ of mm->mm_sched_cpu with the new a_occ of
>> m_a_cpu could help. For example, if the latter (new_a_occ) is twice larger
>> than the former (old a_occ), we can update mm->mm_sched_cpu to the new
>> m_a_cpu. Otherwise, we keep the old value.
> 
> Some hysteresis might make sense, but I don't think waiting for it to
> double makes sense, that might be too agressive.
> 

OK, might need to do some evaluation to figure out a reasonable threshold.

>>> +#ifdef CONFIG_SCHED_CACHE
>>> +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
>>> +
>>> +static int select_cache_cpu(struct task_struct *p, int prev_cpu)
>>> +{
>>> +	struct mm_struct *mm = p->mm;
>>> +	int cpu;
>>> +
>>> +	if (!mm || p->nr_cpus_allowed == 1)
>>> +		return prev_cpu;
>>> +
>>> +	cpu = mm->mm_sched_cpu;
>>> +	if (cpu < 0)
>>> +		return prev_cpu;
>>> +
>>
>>
>> We observed frequent task migrations during some highly context-switch
>> benchmarks, which led to performance regression when the LLC was saturated.
>> Could we avoid task migration in cases where the previous CPU and the
>> preferred CPU are within the same LLC?
>>
>> if (cpus_share_cache(prev_cpu, cpu))
>> 	return prev_cpu;
> 
> Sure.
> 
>>> +
>>> +	if (static_branch_likely(&sched_numa_balancing) &&
>>> +	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
>>> +		/*
>>> +		 * XXX look for max occupancy inside prev_cpu's node
>>> +		 */
>>> +		return prev_cpu;
>>> +	}
>>> +
>>
>> Tim found that spreading tasks within the preferred LLC might help avoid
>> task stacking:
>>
>> sd = rcu_dereference(per_cpu(sd_llc, cpu));
>> if (likely(sd))
>> 	return cpumask_any(sched_domain_span(sd));
> 
> You know that cpumask_any() is implemented using cpumask_first(), right?
> You're causing stacking if anything with that.
> 

I see, this still cause stacking issue. Let me try to find other method 
to spread task on its preferred LLC.

>>> @@ -9288,6 +9556,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>>>    	if (sysctl_sched_migration_cost == 0)
>>>    		return 0;
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +	if (p->mm && p->mm->pcpu_sched) {
>>> +		/*
>>> +		 * XXX things like Skylake have non-inclusive L3 and might not
>>> +		 * like this L3 centric view. What to do about L2 stickyness ?
>>> +		 */
>>> +		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
>>> +		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
>>
>> This might encourage more task migration within the preferred LLC, leading
>> to some performance regression. Is it possible to raise the threshold for
>> task migration within the preferred LLC and use the original delta time to
>> determine whether a task is cache-hot?
>>
>> if (p->mm && p->mm->pcpu_sched &&
>>      cpus_share_cache(env->dst_cpu, env->src_cpu))
>> 	return  2*per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
>> 		  per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
>> }

After a second thought, my change might be incorrect. Because every CPU 
in the same LLC should have the same percpu occ.

 > > Nah, the saner thing to do is to preserve the topology averages and 
look
> at those instead of the per-cpu values.
> 
> Eg. have task_cache_work() compute and store averages in the
> sched_domain structure and then use those.
Sorry I did not quite catch up with this, doesn't the
per_cpu_ptr(p->mm->pcpu_sched, cpu)->occ
already have the average occupancy of the whole LLC domain? Every
CPU in the same LLC should have the same average occ because in
task_cache_work():
for_each_cpu(i, sched_domain_span(sd)) {
	/* XXX threshold ? */
	per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
}


If the goal is to avoid migrating task to its non-preferred LLC during
load balance, maybe we can check if:
1. env->src_cpu is in p's preferred LLC already, and
2. env->dst_cpu is not in p's preferred LLC, and
3. p's preferred LLC is not overloaded,we should treat p as cache hot.

The definition of preferred LLC of a task is something like:
p->mm->mm_preferred_llc = llc_id(p->mm->mm_sched_cpu)


I'll add some trace/schedstat on top of your patch to investigate.

thanks,
Chenyu

thanks,
Chenyu