[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <38126b7c-7782-46c9-ae07-86a669a710e0@intel.com>
Date: Thu, 27 Mar 2025 10:48:23 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
<tim.c.chen@...ux.intel.com>, <tglx@...utronix.de>, <len.brown@...el.com>,
<gautham.shenoy@....com>, <mingo@...nel.org>, <kprateek.nayak@....com>,
<yu.chen.surf@...mail.com>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing
On 3/26/2025 5:38 PM, Peter Zijlstra wrote:
> On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:
>
>>> +static void task_tick_cache(struct rq *rq, struct task_struct *p)
>>> +{
>>> + struct callback_head *work = &p->cache_work;
>>> + struct mm_struct *mm = p->mm;
>>> +
>>> + if (!mm || !mm->pcpu_sched)
>>> + return;
>>> +
>>> + if (mm->mm_sched_epoch == rq->cpu_epoch)
>>> + return;
>>> +
>>
>> [1]
>>
>>> + guard(raw_spinlock)(&mm->mm_sched_lock);
>>> +
>>> + if (mm->mm_sched_epoch == rq->cpu_epoch)
>>> + return;
>>
>> Remove above duplicated [1] and keep the check here?
>
> Why? That just adds locking overhead for no reason.
>
I thought mm->mm_sched_epoch is updated under the protect of
mm->mm_sched_lock, so the read side should also be protected by
this lock?
>>> +
>>> + if (work->next == work) {
>>> + task_work_add(p, work, TWA_RESUME);
>>> + WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
>>> + }
>>> +}
>>> +
>>> +static void task_cache_work(struct callback_head *work)
>>> +{
>>> + struct task_struct *p = current;
>>> + struct mm_struct *mm = p->mm;
>>> + unsigned long m_a_occ = 0;
>>> + int cpu, m_a_cpu = -1;
>>> + cpumask_var_t cpus;
>>> +
>>> + WARN_ON_ONCE(work != &p->cache_work);
>>> +
>>> + work->next = work;
>>> +
>>> + if (p->flags & PF_EXITING)
>>> + return;
>>> +
>>> + if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>>> + return;
>>> +
>>> + scoped_guard (cpus_read_lock) {
>>> + cpumask_copy(cpus, cpu_online_mask);
>>> +
>>> + for_each_cpu(cpu, cpus) {
>>> + /* XXX sched_cluster_active */
>>> + struct sched_domain *sd = per_cpu(sd_llc, cpu);
>>> + unsigned long occ, m_occ = 0, a_occ = 0;
>>> + int m_cpu = -1, nr = 0, i;
>>> +
>>> + for_each_cpu(i, sched_domain_span(sd)) {
>>> + occ = fraction_mm_sched(cpu_rq(i),
>>> + per_cpu_ptr(mm->pcpu_sched, i));
>>> + a_occ += occ;
>>> + if (occ > m_occ) {
>>> + m_occ = occ;
>>> + m_cpu = i;
>>> + }
>>> + nr++;
>>> + trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
>>> + per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
>>> + }
>>> +
>>> + a_occ /= nr;
>>
>>
>> In systems with a larger number of CPUs within a single LLC, the division
>> may lose accuracy.
>
> May, sure, but does it actually matter? We're tracking occupancy in ns
> per EPOCH_PERIOD, which is 1e7. Lets assume 'large' here is something
> daft like 100 CPUs per LLC (lets just really hope nobody actually goes
> and do that, please), then you're still only loosing like 1e2 ns from
> your average.
>
I see, the 100 ns should not matter much compared to 10 ms sample period.
>> Can we either use multiplication for comparison, or just use the accumulated
>> total CPU occupancy
>> of that LLC for comparison (by removing a_occ /= nr)?
>
> You can't remove the devision, you get into trouble when the LLCs do not
> have equal number of CPUs.
On a hybrid system, suppose there are two LLCs, LLC1 has 8 CPUs, the
sum_occ is 1024, its avg_occ is 128. LLC2 has 4 CPUs, the sum_occ is
768, avg_occ is 192. If we compare avg_occ, we should choose LLC2, even
if LLC1 might have more idle CPUs, and LLC1 has more active cache-hot
threads.
> You can carry that multiplication around ofc,
> but again, why bother?
>
>>> + if (a_occ > m_a_occ) {
>>> + m_a_occ = a_occ;
>>> + m_a_cpu = m_cpu;
>>> + }
>>> +
>>> + trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
>>> + per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
>>> +
>>> + for_each_cpu(i, sched_domain_span(sd)) {
>>> + /* XXX threshold ? */
>>> + per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
>>> + }
>>> +
>>> + cpumask_andnot(cpus, cpus, sched_domain_span(sd));
>>> + }
>>> + }
>>> +
>>> + /*
>>> + * If the max average cache occupancy is 'small' we don't care.
>>> + */
>>> + if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
>>> + m_a_cpu = -1;
>>> +
>>> + mm->mm_sched_cpu = m_a_cpu;
>>
>> There might be an issue with mm_sched_cpu bouncing. Perhaps adding a
>> threshold to compare the old a_occ of mm->mm_sched_cpu with the new a_occ of
>> m_a_cpu could help. For example, if the latter (new_a_occ) is twice larger
>> than the former (old a_occ), we can update mm->mm_sched_cpu to the new
>> m_a_cpu. Otherwise, we keep the old value.
>
> Some hysteresis might make sense, but I don't think waiting for it to
> double makes sense, that might be too agressive.
>
OK, might need to do some evaluation to figure out a reasonable threshold.
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
>>> +
>>> +static int select_cache_cpu(struct task_struct *p, int prev_cpu)
>>> +{
>>> + struct mm_struct *mm = p->mm;
>>> + int cpu;
>>> +
>>> + if (!mm || p->nr_cpus_allowed == 1)
>>> + return prev_cpu;
>>> +
>>> + cpu = mm->mm_sched_cpu;
>>> + if (cpu < 0)
>>> + return prev_cpu;
>>> +
>>
>>
>> We observed frequent task migrations during some highly context-switch
>> benchmarks, which led to performance regression when the LLC was saturated.
>> Could we avoid task migration in cases where the previous CPU and the
>> preferred CPU are within the same LLC?
>>
>> if (cpus_share_cache(prev_cpu, cpu))
>> return prev_cpu;
>
> Sure.
>
>>> +
>>> + if (static_branch_likely(&sched_numa_balancing) &&
>>> + __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
>>> + /*
>>> + * XXX look for max occupancy inside prev_cpu's node
>>> + */
>>> + return prev_cpu;
>>> + }
>>> +
>>
>> Tim found that spreading tasks within the preferred LLC might help avoid
>> task stacking:
>>
>> sd = rcu_dereference(per_cpu(sd_llc, cpu));
>> if (likely(sd))
>> return cpumask_any(sched_domain_span(sd));
>
> You know that cpumask_any() is implemented using cpumask_first(), right?
> You're causing stacking if anything with that.
>
I see, this still cause stacking issue. Let me try to find other method
to spread task on its preferred LLC.
>>> @@ -9288,6 +9556,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>>> if (sysctl_sched_migration_cost == 0)
>>> return 0;
>>> +#ifdef CONFIG_SCHED_CACHE
>>> + if (p->mm && p->mm->pcpu_sched) {
>>> + /*
>>> + * XXX things like Skylake have non-inclusive L3 and might not
>>> + * like this L3 centric view. What to do about L2 stickyness ?
>>> + */
>>> + return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
>>> + per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
>>
>> This might encourage more task migration within the preferred LLC, leading
>> to some performance regression. Is it possible to raise the threshold for
>> task migration within the preferred LLC and use the original delta time to
>> determine whether a task is cache-hot?
>>
>> if (p->mm && p->mm->pcpu_sched &&
>> cpus_share_cache(env->dst_cpu, env->src_cpu))
>> return 2*per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
>> per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
>> }
After a second thought, my change might be incorrect. Because every CPU
in the same LLC should have the same percpu occ.
> > Nah, the saner thing to do is to preserve the topology averages and
look
> at those instead of the per-cpu values.
>
> Eg. have task_cache_work() compute and store averages in the
> sched_domain structure and then use those.
Sorry I did not quite catch up with this, doesn't the
per_cpu_ptr(p->mm->pcpu_sched, cpu)->occ
already have the average occupancy of the whole LLC domain? Every
CPU in the same LLC should have the same average occ because in
task_cache_work():
for_each_cpu(i, sched_domain_span(sd)) {
/* XXX threshold ? */
per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
}
If the goal is to avoid migrating task to its non-preferred LLC during
load balance, maybe we can check if:
1. env->src_cpu is in p's preferred LLC already, and
2. env->dst_cpu is not in p's preferred LLC, and
3. p's preferred LLC is not overloaded,we should treat p as cache hot.
The definition of preferred LLC of a task is something like:
p->mm->mm_preferred_llc = llc_id(p->mm->mm_sched_cpu)
I'll add some trace/schedstat on top of your patch to investigate.
thanks,
Chenyu
thanks,
Chenyu
Powered by blists - more mailing lists