linux-kernel - Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251106150717.cZuPZnF5@linutronix.de>
Date: Thu, 6 Nov 2025 16:07:17 +0100
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Tejun Heo <tj@...nel.org>
Cc: Calvin Owens <calvin@...nvd.org>, linux-kernel@...r.kernel.org,
	Dan Schatzberg <dschatzberg@...a.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid
 context warning on PREEMPT_RT

On 2025-11-05 09:03:55 [-1000], Tejun Heo wrote:
> +#ifdef CONFIG_PREEMPT_RT
> +/*
> + * cgroup_task_dead() is called from finish_task_switch() which doesn't allow
> + * scheduling even in RT. As the task_dead path requires grabbing css_set_lock,
> + * this lead to sleeping in the invalid context warning bug. css_set_lock is too
> + * big to become a raw_spinlock. The task_dead path doesn't need to run
> + * synchronously. Bounce through irq_work instead.
> + */
> +static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks);
> +static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork);
> +
> +static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork)
> +{
> +	struct llist_node *lnode;
> +	struct task_struct *task, *next;
> +
> +	lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks));
> +	llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) {
> +		do_cgroup_task_dead(task);
> +		put_task_struct(task);
> +	}
> +}
> +
> +static void __init cgroup_rt_init(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu));
> +		init_irq_work(per_cpu_ptr(&cgrp_dead_tasks_iwork, cpu),
> +			      cgrp_dead_tasks_iwork_fn);

How important is it, that it happens right away? Written as-is, this
leads to an interrupt then wakes irq_work/$cpu thread which then runs
this callback. That thread runs as SCHED_FIFO-1. This means the
termination of a SCHED_OTHER tasks on a single CPU will run as follows:
 - TASK_DEAD
   schedule()
     - queue IRQ_WORK
     -> INTERRUPT
     -> WAKE irq_work
   -> preempt to irq_work/
      -> handle one callback
      schedule()
 back to next TASK_DEAD

So cgrp_dead_tasks_iwork_fn() will never have to opportunity to batch.
Unless the exiting task's priority is > 1. Then it will be delayed
until all RT tasks are done.

My proposal would be to init the irq_work item with
	*per_cpu_ptr(&cgrp_dead_tasks_iwork, cpu) = IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn);

instead which won't raise an IRQ immediately and delay the callback
until the next timer tick. So it could batch multiple tasks.

[ queue_work() should work, too but the overhead to schedule is greater
  imho so this makes sense ]

Sebastian