[<prev] [next>] [day] [month] [year] [list]
Message-ID: <09ba5932-a256-4cdd-94dc-4f2b6569c855@intel.com>
Date: Mon, 31 Mar 2025 14:25:32 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Hillf Danton <hdanton@...a.com>
CC: <vincent.guittot@...aro.org>, <linux-kernel@...r.kernel.org>,
<kprateek.nayak@....com>, <yu.chen.surf@...mail.com>, Peter Zijlstra
<peterz@...radead.org>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing
On 3/27/2025 7:20 PM, Hillf Danton wrote:
> On Wed, Mar 26, 2025 at 11:25:53AM +0100, Peter Zijlstra wrote:
>> On Wed, Mar 26, 2025 at 10:38:41AM +0100, Peter Zijlstra wrote:
>>
>>> Nah, the saner thing to do is to preserve the topology averages and look
>>> at those instead of the per-cpu values.
>>>
>>> Eg. have task_cache_work() compute and store averages in the
>>> sched_domain structure and then use those.
>>
>> A little something like so perhaps ?
>>
> My $.02 followup with the assumption that l2 cache temperature can not
> make sense without comparing. Just for idea show.
>
> Hillf
>
> --- m/include/linux/sched.h
> +++ n/include/linux/sched.h
> @@ -1355,6 +1355,11 @@ struct task_struct {
> unsigned long numa_pages_migrated;
> #endif /* CONFIG_NUMA_BALANCING */
>
> +#ifdef CONFIG_SCHED_CACHE
> +#define LXC_SIZE 64 /* should be setup by parsing topology */
> + unsigned long lxc_temp[LXC_SIZE]; /* x > 1, l2 cache temperature for instance */
> +#endif
> +
> #ifdef CONFIG_RSEQ
> struct rseq __user *rseq;
> u32 rseq_len;
> --- m/kernel/sched/fair.c
> +++ n/kernel/sched/fair.c
> @@ -7953,6 +7953,22 @@ static int select_idle_sibling(struct ta
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> +#ifdef CONFIG_SCHED_CACHE
> + /*
> + * 2, lxc temp can not make sense without comparing
> + *
> + * target can be any cpu if lxc is cold
> + */
> + if ((unsigned int)prev_aff < nr_cpumask_bits)
> + if (p->lxc_temp[per_cpu(sd_share_id, (unsigned int)prev_aff)] >
> + p->lxc_temp[per_cpu(sd_share_id, target)])
> + target = prev_aff;
> + if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
> + if (p->lxc_temp[per_cpu(sd_share_id, (unsigned int)recent_used_cpu)] >
> + p->lxc_temp[per_cpu(sd_share_id, target)])
> + target = recent_used_cpu;
> + p->lxc_temp[per_cpu(sd_share_id, target)] += 1;
> +#else
> /*
> * For cluster machines which have lower sharing cache like L2 or
> * LLC Tag, we tend to find an idle CPU in the target's cluster
> @@ -7963,6 +7979,7 @@ static int select_idle_sibling(struct ta
> return prev_aff;
> if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
> return recent_used_cpu;
> +#endif
>
> return target;
> }
> @@ -13059,6 +13076,13 @@ static void task_tick_fair(struct rq *rq
> if (static_branch_unlikely(&sched_numa_balancing))
> task_tick_numa(rq, curr);
>
> +#ifdef CONFIG_SCHED_CACHE
> + /*
> + * 0, lxc is defined cold after 2-second nap
> + * 1, task migrate across NUMA node makes lxc cold
> + */
> + curr->lxc_temp[per_cpu(sd_share_id, rq->cpu)] += 5;
If lxc_temp is per task, this might be of another direction that to
track each task's activity rather than the whole process activity.
The idea I think it is applicable to overwrite target to other CPU
if the latter is in a hot LLC, so select_idle_cpu() can search for
an idle CPU in cache hot LLC.
thanks,
Chenyu
Powered by blists - more mailing lists