linux-kernel - Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0d66a966-0b89-416a-8712-6a6131af355e@siemens.com>
Date: Mon, 1 Dec 2025 22:51:50 +0100
From: Jan Kiszka <jan.kiszka@...mens.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
 stable-rt <stable-rt@...r.kernel.org>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>,
 Davidlohr Bueso <dave@...olabs.net>,
 Frederic Weisbecker <frederic@...nel.org>, Ingo Molnar <mingo@...nel.org>,
 Josh Triplett <josh@...htriplett.org>, Thomas Gleixner <tglx@...utronix.de>,
 Florian Bezdeka <florian.bezdeka@...mens.com>, Pavel Machek <pavel@...x.de>
Subject: Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups
 on PREEMPT_RT.

On 06.11.24 15:51, Sebastian Andrzej Siewior wrote:
> A timer/ hrtimer softirq is raised in-IRQ context. With threaded
> interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
> for the processing of the softirq. ksoftirqd runs as SCHED_OTHER which
> means it will compete with other tasks for CPU ressources.
> This can introduce long delays for timer processing on heavy loaded
> systems and is not desired.
> 
> Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
> timers thread and let it run at the lowest SCHED_FIFO priority.
> Wake-ups for RT tasks happen from hardirq context so only timer_list timers
> and hrtimers for "regular" tasks are processed here. The higher priority
> ensures that wakeups are performed before scheduling SCHED_OTHER tasks.
> 
> Using a dedicated variable to store the pending softirq bits values
> ensure that the timer are not accidentally picked up by ksoftirqd and
> other threaded interrupts.
> It shouldn't be picked up by ksoftirqd since it runs at lower priority.
> However if ksoftirqd is already running while a timer fires, then
> ksoftird will be PI-boosted due to the BH-lock to ktimer's priority.
> Ideally we try to avoid having ksoftirqd running.
> 
> The timer thread can pick up pending softirqs from ksoftirqd but only
> if the softirq load is high. It is not be desired that the picked up
> softirqs are processed at SCHED_FIFO priority under high softirq load
> but this can already happen by a PI-boost by a force-threaded interrupt.
> 
> [ frederic@...nel.org: rcutorture.c fixes, storm fix by introduction of
>   local_timers_pending() for tick_nohz_next_event() ]
> 
> [ junxiao.chang@...el.com: Ensure ktimersd gets woken up even if a
>   softirq is currently served. ]
> 
> Reviewed-by: Paul E. McKenney <paulmck@...nel.org> [rcutorture]
> Reviewed-by: Frederic Weisbecker <frederic@...nel.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>

This went into 6.13 and was never backported to 6.12-lts. And that is
why you can easily stall the latter with a workload like this and
CONFIG_PREEMPT_RT enabled:

echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control

mkdir /sys/fs/cgroup/stalltest.sub1
mkdir /sys/fs/cgroup/stalltest.sub2
sleep 10000000 &
pid=$!

systemd-run --slice "stalltest.slice" taskset -c 0 sh -c " \
    while true; do
        echo $pid > /sys/fs/cgroup/stalltest.sub1/cgroup.procs;
        echo $pid > /sys/fs/cgroup/stalltest.sub2/cgroup.procs;
    done"

echo "1000 20000" > /sys/fs/cgroup/stalltest.slice/cpu.max

This triggers a lock-up if a holder of cgroup_file_kn_lock with
SCHED_OTHER is scheduled out after using up its timeslice and then
cgroup_file_notify_timer fires over a SCHED_OTHER context as well,
trying to get this lock, failing and then never being able to reactivate
the lock holder again as well.

I've nicely reproduced this with upstream 6.12.58 while Debian's lastest
6.12-rt does not trigger because it additionally has the downstream -rt
patches on board.

How should we handle this? Consider 6.12 mainline with -rt and cgroups
as potentially broken, asking people to user 6.12-rt? Or port this back?

BTW, the original report of this issue came from an older
5.10.194-cip39-rt16 kernel (based on rt94 for 5.10). When was this
feature introduced to the -rt patches? Was it ever backported to 5.10-rt
or other -rt versions?

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center