linux-kernel - Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <522c6c09-9d07-4417-b9d3-925bd2224627@amd.com>
Date: Wed, 29 Oct 2025 10:02:01 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Tim Chen <tim.c.chen@...ux.intel.com>, "Chen, Yu C" <yu.c.chen@...el.com>
CC: Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli
	<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, "Steven
 Rostedt" <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, "Madadi Vineeth
 Reddy" <vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>, "Shrikanth
 Hegde" <sshegde@...ux.ibm.com>, Jianyong Wu <jianyong.wu@...look.com>, Yangyu
 Chen <cyy@...self.name>, Tingyin Duan <tingyin.duan@...il.com>, "Vern Hao"
	<vernhao@...cent.com>, Len Brown <len.brown@...el.com>, Aubrey Li
	<aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>, Chen Yu
	<yu.chen.surf@...il.com>, Adam Li <adamli@...amperecomputing.com>, Tim Chen
	<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Peter Zijlstra
	<peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "Gautham R . Shenoy"
	<gautham.shenoy@....com>
Subject: Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue

Hello Tim,

On 10/28/2025 9:16 PM, Tim Chen wrote:
> On Tue, 2025-10-28 at 23:15 +0800, Chen, Yu C wrote:
>> On 10/27/2025 2:04 PM, K Prateek Nayak wrote:
>>> Hello Tim,
>>>
>>> On 10/11/2025 11:54 PM, Tim Chen wrote:
>>>> @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>   		struct rq *rq = rq_of(cfs_rq);
>>>>   
>>>>   		account_numa_enqueue(rq, task_of(se));
>>>> +		account_llc_enqueue(rq, task_of(se));
>>>>   		list_add(&se->group_node, &rq->cfs_tasks);
>>>>   	}
>>>>   	cfs_rq->nr_queued++;
>>>> @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>   	update_load_sub(&cfs_rq->load, se->load.weight);
>>>>   	if (entity_is_task(se)) {
>>>>   		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
>>>> +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
>>>>   		list_del_init(&se->group_node);
>>>>   	}
>>>>   	cfs_rq->nr_queued--;
>>>> +
>>>> +	/* safeguard to clear the cache aware data */
>>>> +	if (!parent_entity(se) && !cfs_rq->nr_queued)
>>>> +		reset_llc_stats(rq_of(cfs_rq));
>>>
>>> Instead of relying on reset_llc_stats() hack, I think a better approach
>>> would be to have a "p->se.llc_sched_active" flag similar to how uclamp
>>> has "uc_se->active" and we set this in account_llc_enqueue() which will
>>> still check for sched_cache_enabled() but account_llc_dequeue() would
>>> only check for "p->se.llc_sched_active" to decrement the stats and then
>>> unset the flag.
>>>
>>> That way, we cannot have an imbalanced accounting. Thoughts?
>>>
>>
>> I suppose what you mean is to avoid the race condition between
>> enabling sched_cache and EQ/DE_LLC, similar to uclamp:
>>
>>          enqueue(taskA)
>>          // sched_cache gets enabled
>>          enqueue(taskB)
>>          dequeue(taskA)
>>          // Must not decrement rq->llc_pref for taskA
> 
> For this case, task A is already on rq when sched cache get
> enabled. But task A's preferred_llc is still -1. 
> 
> If we dequeue it while its preferred_llc is still -1, it won't
> affect rq->llc_pref.
> 
> If we change its preferred_llc to llc_i before we dequeue it,
> then rq->llc_pref[llc_i] will be incremented first.
> 
> Then when we dequeue task A, we will decrement it. We are
> still accounting rq->llc_pref[llc_i] correctly with current
> code.

So what I really disliked was having reset_llc_stats() to
reset the stat but looking at it again, that too is guarded
by sched_cache_enabled() counter so I think the counters can
still go out of balance if:

    /* Cache aware scheduling enabled */
    enqueue(TaskA) /* nr_llc_running = 1 */
    enqueue(TaskB) /* nr_llc_running = 2 */
    enqueue(TaskC) /* nr_llc_running = 3 */
    dequeue(TaskA) /* nr_llc_running = 2 */

    /* Cache aware scheduling disabled */

   dequeue(TaskB) /* nr_llc_running = 2 */
   dequeue(TaskC) /* nr_llc_running = 2 */

   /* nr_running == 0; nr_llc_running = 2 */
   /* Cache aware scheduling enabled again */

   enqueue(TaskD) /* nr_llc_running = 3 */
   enqueue(TaskE) /* nr_llc_running = 4 */

   ...

At some later point if nr_running reaches 0 again, then
"nr_llc_running" is finally reset to 0 but until then it
can show inaccurate value if users repeatedly toggle the
feature depending on the workload running.

> 
> The trickier case is if we need to dynamically resize
> rq->llc_pref[]. We need to make sure that we lock the rq
> to prevent enqueue/dequeue, switch it to a larger size
> rq->llc_pref[], copy the old data over, then switch over
> to the larger sized rq->llc_pref[] and unlock rq to keep
> the accounting straight.

When that happens, we'll have to modify sched domains since
something has changed in the system / cpuset and rq_attach_root()
could be a good place to do it at when rq is offlined and onlined
once again with the rq_lock held.

Only issue is if the partition splits the LLC in which case, we'll
have two LLC index - one for for first half and another for the
second half and we'll have to re-account the task to this new
index.

-- 
Thanks and Regards,
Prateek