[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250523095350.GA1215853@bytedance>
Date: Fri, 23 May 2025 17:53:50 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Valentin Schneider <vschneid@...hat.com>,
Ben Segall <bsegall@...gle.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
Chengming Zhou <chengming.zhou@...ux.dev>,
Chuyi Zhou <zhouchuyi@...edance.com>,
Jan Kiszka <jan.kiszka@...mens.com>,
Florian Bezdeka <florian.bezdeka@...mens.com>
Subject: Re: [PATCH 2/7] sched/fair: prepare throttle path for task based
throttle
On Thu, May 22, 2025 at 08:40:02PM +0800, Aaron Lu wrote:
> On Thu, May 22, 2025 at 01:54:18PM +0200, Peter Zijlstra wrote:
> > On Thu, May 22, 2025 at 07:44:55PM +0800, Aaron Lu wrote:
> > > On Thu, May 22, 2025 at 12:48:43PM +0200, Peter Zijlstra wrote:
> > > > On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> > > >
> > > > > static void throttle_cfs_rq_work(struct callback_head *work)
> > > > > {
> > > > > + struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > > > > + struct sched_entity *se;
> > > > > + struct cfs_rq *cfs_rq;
> > > > > + struct rq *rq;
> > > > > +
> > > > > + WARN_ON_ONCE(p != current);
> > > > > + p->sched_throttle_work.next = &p->sched_throttle_work;
> > > > > +
> > > > > + /*
> > > > > + * If task is exiting, then there won't be a return to userspace, so we
> > > > > + * don't have to bother with any of this.
> > > > > + */
> > > > > + if ((p->flags & PF_EXITING))
> > > > > + return;
> > > > > +
> > > > > + scoped_guard(task_rq_lock, p) {
> > > > > + se = &p->se;
> > > > > + cfs_rq = cfs_rq_of(se);
> > > > > +
> > > > > + /* Raced, forget */
> > > > > + if (p->sched_class != &fair_sched_class)
> > > > > + return;
> > > > > +
> > > > > + /*
> > > > > + * If not in limbo, then either replenish has happened or this
> > > > > + * task got migrated out of the throttled cfs_rq, move along.
> > > > > + */
> > > > > + if (!cfs_rq->throttle_count)
> > > > > + return;
> > > > > + rq = scope.rq;
> > > > > + update_rq_clock(rq);
> > > > > + WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > > > > + dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > > > > + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > > > > + resched_curr(rq);
> > > > > + }
> > > > > +
> > > > > + cond_resched_tasks_rcu_qs();
> > > > > }
> > > >
> > > > What's that cond_resched thing about? The general plan is to make
> > > > cond_resched go away.
> > >
> > > Got it.
> > >
> > > The purpose is to let throttled task schedule and also mark a task rcu
> > > quiescent state. Without this cond_resched_tasks_rcu_qs(), this task
> > > will be scheduled by cond_resched() in task_work_run() and since that is
> > > a preempt schedule, it didn't mark a task rcu quiescent state.
> > >
> > > Any suggestion here? Perhaps a plain schedule()? Thanks.
> >
> > I am confused, this is task_work_run(), that is ran from
> > exit_to_user_mode_loop(), which contains a schedule().
>
I should probably have added that the schedule() call contained in
exit_to_user_mode_loop() is early in that loop, where the to-be-throttled
task doesn't have need_resched bit set yet.
> There is a cond_resched() in task_work_run() loop:
>
> do {
> next = work->next;
> work->func(work);
> work = next;
> cond_resched();
> } while (work);
>
> And when this throttle work returns with need_resched bit set,
> cond_resched() will cause a schedule but that didn't mark a task
> quiescent state...
Another approach I can think of is to add a test of task_is_throttled()
in rcu_tasks_is_holdout(). I remembered when I tried this before, I can
hit the following path:
exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() ->
do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() ->
try_to_block_task() -> dequeue_task_fair().
I would like to avoid this path, because it doesn't feel right for a
throttled task to go through another dequeue again(except for the cases
like task group change, affinity change etc. are special cases that have
to be dealed with though).
It looks to me, a schedule() call(or any other form) that makes sure
this throttled task gets scheduled in its task work is the safest thing
to do.
Thoughts?
Thanks,
Aaron
Powered by blists - more mailing lists