[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9aae247-5347-4748-ac5e-9f54f733e230@amd.com>
Date: Tue, 1 Apr 2025 08:47:02 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
Peter Zijlstra <peterz@...radead.org>, Josh Don <joshdon@...gle.com>, Ingo
Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
<linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>, Dietmar
Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Mel Gorman <mgorman@...e.de>, Chengming Zhou <chengming.zhou@...ux.dev>,
Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based
throttle
Hello Aaron,
On 3/31/2025 11:49 AM, Aaron Lu wrote:
>>> Taskes throttled on exit to user path are scheduled by cond_resched() in
>>> task_work_run() but that is a preempt schedule and doesn't mark a task
>>> rcu quiescent state.
So browsing through kernel/rcu/task.h, I found the
cond_resched_tasks_rcu_qs() that basically clears task holdout before
calling schedule(). The question is, is it safe to be used in the
task_work_run() context? I have no idea but you can do a resched_curr()
followed by a cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() and
that should give you the same effect as doing a schedule().
Thoughts?
>>>
>>> Fix this by directly calling schedule() in throttle_cfs_rq_work().
>> Perhaps that can be gotten around by just using set_ti_thread_flag()
>> resched_curr() will also call set_preempt_need_resched() which allows
>> cond_resched() to resched the task.
>>
>> Since exit_to_user_mode_loop() will run once again seeing that
>> TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts?
>>
> I tried this and noticed an unexpected consequence: get_signal() will
> also run task works and if the signal is kill, then it can happen:
> exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() ->
> do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() ->
> try_to_block_task() -> dequeue_task_fair().
>
> I would like to avoid this path, at least for now. I want to make sure
> for throttled tasks, only events like task group change, affinity change
> etc. can cause it dequeue from core, that's why I added
> SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think
> this can help me capture any unexpected events like this.
>
> Besides, I think explicitely calling schedule() has the advantage of not
> relying on other code, i.e. it doesn't matter if someone removed
> cond_resched() in task_work_run() some time later or someone changed the
> logic in exit_to_user_mode_loop().
>
> So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(),
> but if you see anything wrong of doing this, feel free to let me know,
> thanks.
I don't have any strong feelings. Just that the open-coded schedule()
struck out like a sore thumb and since you mention future changes, the
"local_irq_enable_exit_to_user(ti_work)" could perhaps one day be
extended to not disable IRQs in exit_to_user_mode_loop() in which case
a direct call to schedule() can cause scheduling while atomic warnings.
IMO using cond_resched_tasks_rcu_qs() is cleaner and it is obvious that
the throttle work wants to call schedule() asap while also clearing the
RCU holdout but I'll wait for others to comment if it is safe to do so
or if we are missing something.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists