linux-kernel - Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250523075640.GA1168183@bytedance>
Date: Fri, 23 May 2025 15:56:40 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Chengming Zhou <chengming.zhou@...ux.dev>
Cc: Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>
Subject: Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class
 change for throttled task

On Fri, May 23, 2025 at 10:43:53AM +0800, Chengming Zhou wrote:
> On 2025/5/20 18:41, Aaron Lu wrote:
> > On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
> > core will dequeue it and then requeued it.
> > 
> > The throttled task is still considered as queued by core because p->on_rq
> > is still set so core will dequeue it, but since the task is already
> > dequeued on throttle in fair, handle this case properly.
> > 
> > Affinity and sched class change is similar.
> 
> How about setting p->on_rq to 0 when throttled? which is the fact that
> the task is not on cfs queue anymore, does this method cause any problem?
> 

On task group change/affinity change etc. if the throttled task is
regarded as !on_rq, then it will miss the chance to be enqueued to the
new(and correct) cfs_rqs, instead, it will be enqueued back to its
original cfs_rq on unthrottle which breaks affinity or task group
settings. We may be able to do something in tg_unthrottle_up() to take
special care of these situations, but it seems a lot of headaches.

Also, for task group change, if the new task group does not have throttle
setting, that throttled task should be allowed to run immediately instead
of waiting for its old cfs_rq's unthrottle event. Similar is true when
this throttled task changed its sched class, like from fair to rt.

Makes sense?

Thanks,
Aaron

> > 
> > Signed-off-by: Aaron Lu <ziqianlu@...edance.com>
> > ---
> >   kernel/sched/fair.c | 24 ++++++++++++++++++++++++
> >   1 file changed, 24 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 74bc320cbc238..4c66fd8d24389 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
> >   		update_rq_clock(rq);
> >   		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> >   		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > +		/*
> > +		 * Must not add it to limbo list before dequeue or dequeue will
> > +		 * mistakenly regard this task as an already throttled one.
> > +		 */
> >   		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> >   		resched_curr(rq);
> >   	}
> > @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
> >   	INIT_LIST_HEAD(&p->throttle_node);
> >   }
> > +static void dequeue_throttled_task(struct task_struct *p, int flags)
> > +{
> > +	/*
> > +	 * Task is throttled and someone wants to dequeue it again:
> > +	 * it must be sched/core when core needs to do things like
> > +	 * task affinity change, task group change, task sched class
> > +	 * change etc.
> > +	 */
> > +	WARN_ON_ONCE(p->se.on_rq);
> > +	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
> > +
> > +	list_del_init(&p->throttle_node);
> > +}
> > +
> >   static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
> >   static int tg_unthrottle_up(struct task_group *tg, void *data)
> >   {
> > @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
> >   static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> >   static void task_throttle_setup_work(struct task_struct *p) {}
> >   static bool task_is_throttled(struct task_struct *p) { return false; }
> > +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
> >   static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> >   {
> > @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >    */
> >   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >   {
> > +	if (unlikely(task_is_throttled(p))) {
> > +		dequeue_throttled_task(p, flags);
> > +		return true;
> > +	}
> > +
> >   	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
> >   		util_est_dequeue(&rq->cfs, p);