linux-kernel - Re: [PATCH sched_ext/for-6.12] sched_ext: TASK_DEAD tasks must be switched into SCX on ops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ZtjB35bd6YKZriNU@slm.duckdns.org>
Date: Wed, 4 Sep 2024 10:23:59 -1000
From: Tejun Heo <tj@...nel.org>
To: David Vernet <void@...ifault.com>
Cc: linux-kernel@...r.kernel.org, kernel-team@...a.com,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH sched_ext/for-6.12] sched_ext: TASK_DEAD tasks must be
 switched into SCX on ops_enable

On Fri, Aug 30, 2024 at 10:02:34PM -1000, Tejun Heo wrote:
> During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task()
> on every task. To do this, it does get_task_struct() on each iterated task,
> drop the lock and then call ops.init_task().
> 
> However, a TASK_DEAD task may already have lost all its usage count and be
> waiting for RCU grace period to be freed. If get_task_struct() is called on
> such task, use-after-free can happen. To avoid such situations,
> scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe
> as they are never going to be scheduled again.
> 
> Unfortunately, a racing sched_setscheduler(2) can grab the task before the
> task is unhashed and then continue to e.g. move the task from RT to SCX
> after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't
> gone through scx_ops_init_task(), scx_ops_enable_task() called from
> switching_to_scx() triggers the following warning:
> 
>   sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872]
>   WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0
>   ...
>   RIP: 0010:scx_ops_enable_task+0x18f/0x1f0
>   ...
>    switching_to_scx+0x13/0xa0
>    __sched_setscheduler+0x84e/0xa50
>    do_sched_setscheduler+0x104/0x1c0
>    __x64_sys_sched_setscheduler+0x18/0x30
>    do_syscall_64+0x7b/0x140
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> As in the ops_disable path, it just doesn't seem like a good idea to leave
> any task in an inconsistent state, even when the task is dead. The root
> cause is ops_enable not being able to tell reliably whether a task is truly
> dead (no one else is looking at it and it's about to be freed) and was
> testing TASK_DEAD instead. Fix it by testing the task's usage count
> directly.
> 
> - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all
>   tasks, @include_dead is removed from scx_task_iter_next_locked() along
>   with dead task filtering.
> 
> - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct()
>   fails.
> 
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Cc: David Vernet <void@...ifault.com>
> Cc: Peter Zijlstra <peterz@...radead.org>

Applied to sched_ext/for-6.12.

Thanks.

-- 
tejun