[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <522c6c09-9d07-4417-b9d3-925bd2224627@amd.com>
Date: Wed, 29 Oct 2025 10:02:01 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Tim Chen <tim.c.chen@...ux.intel.com>, "Chen, Yu C" <yu.c.chen@...el.com>
CC: Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli
<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, "Steven
Rostedt" <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, "Madadi Vineeth
Reddy" <vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>, "Shrikanth
Hegde" <sshegde@...ux.ibm.com>, Jianyong Wu <jianyong.wu@...look.com>, Yangyu
Chen <cyy@...self.name>, Tingyin Duan <tingyin.duan@...il.com>, "Vern Hao"
<vernhao@...cent.com>, Len Brown <len.brown@...el.com>, Aubrey Li
<aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>, Chen Yu
<yu.chen.surf@...il.com>, Adam Li <adamli@...amperecomputing.com>, Tim Chen
<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Peter Zijlstra
<peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "Gautham R . Shenoy"
<gautham.shenoy@....com>
Subject: Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
Hello Tim,
On 10/28/2025 9:16 PM, Tim Chen wrote:
> On Tue, 2025-10-28 at 23:15 +0800, Chen, Yu C wrote:
>> On 10/27/2025 2:04 PM, K Prateek Nayak wrote:
>>> Hello Tim,
>>>
>>> On 10/11/2025 11:54 PM, Tim Chen wrote:
>>>> @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>> struct rq *rq = rq_of(cfs_rq);
>>>>
>>>> account_numa_enqueue(rq, task_of(se));
>>>> + account_llc_enqueue(rq, task_of(se));
>>>> list_add(&se->group_node, &rq->cfs_tasks);
>>>> }
>>>> cfs_rq->nr_queued++;
>>>> @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>> update_load_sub(&cfs_rq->load, se->load.weight);
>>>> if (entity_is_task(se)) {
>>>> account_numa_dequeue(rq_of(cfs_rq), task_of(se));
>>>> + account_llc_dequeue(rq_of(cfs_rq), task_of(se));
>>>> list_del_init(&se->group_node);
>>>> }
>>>> cfs_rq->nr_queued--;
>>>> +
>>>> + /* safeguard to clear the cache aware data */
>>>> + if (!parent_entity(se) && !cfs_rq->nr_queued)
>>>> + reset_llc_stats(rq_of(cfs_rq));
>>>
>>> Instead of relying on reset_llc_stats() hack, I think a better approach
>>> would be to have a "p->se.llc_sched_active" flag similar to how uclamp
>>> has "uc_se->active" and we set this in account_llc_enqueue() which will
>>> still check for sched_cache_enabled() but account_llc_dequeue() would
>>> only check for "p->se.llc_sched_active" to decrement the stats and then
>>> unset the flag.
>>>
>>> That way, we cannot have an imbalanced accounting. Thoughts?
>>>
>>
>> I suppose what you mean is to avoid the race condition between
>> enabling sched_cache and EQ/DE_LLC, similar to uclamp:
>>
>> enqueue(taskA)
>> // sched_cache gets enabled
>> enqueue(taskB)
>> dequeue(taskA)
>> // Must not decrement rq->llc_pref for taskA
>
> For this case, task A is already on rq when sched cache get
> enabled. But task A's preferred_llc is still -1.
>
> If we dequeue it while its preferred_llc is still -1, it won't
> affect rq->llc_pref.
>
> If we change its preferred_llc to llc_i before we dequeue it,
> then rq->llc_pref[llc_i] will be incremented first.
>
> Then when we dequeue task A, we will decrement it. We are
> still accounting rq->llc_pref[llc_i] correctly with current
> code.
So what I really disliked was having reset_llc_stats() to
reset the stat but looking at it again, that too is guarded
by sched_cache_enabled() counter so I think the counters can
still go out of balance if:
/* Cache aware scheduling enabled */
enqueue(TaskA) /* nr_llc_running = 1 */
enqueue(TaskB) /* nr_llc_running = 2 */
enqueue(TaskC) /* nr_llc_running = 3 */
dequeue(TaskA) /* nr_llc_running = 2 */
/* Cache aware scheduling disabled */
dequeue(TaskB) /* nr_llc_running = 2 */
dequeue(TaskC) /* nr_llc_running = 2 */
/* nr_running == 0; nr_llc_running = 2 */
/* Cache aware scheduling enabled again */
enqueue(TaskD) /* nr_llc_running = 3 */
enqueue(TaskE) /* nr_llc_running = 4 */
...
At some later point if nr_running reaches 0 again, then
"nr_llc_running" is finally reset to 0 but until then it
can show inaccurate value if users repeatedly toggle the
feature depending on the workload running.
>
> The trickier case is if we need to dynamically resize
> rq->llc_pref[]. We need to make sure that we lock the rq
> to prevent enqueue/dequeue, switch it to a larger size
> rq->llc_pref[], copy the old data over, then switch over
> to the larger sized rq->llc_pref[] and unlock rq to keep
> the accounting straight.
When that happens, we'll have to modify sched domains since
something has changed in the system / cpuset and rq_attach_root()
could be a good place to do it at when rq is offlined and onlined
once again with the rq_lock held.
Only issue is if the partition splits the LLC in which case, we'll
have two LLC index - one for for first half and another for the
second half and we'll have to re-account the task to this new
index.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists