[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87tsyokaug.ffs@tglx>
Date: Thu, 20 Nov 2025 16:00:39 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Frederic Weisbecker <frederic@...nel.org>, Marek Szyprowski
<m.szyprowski@...sung.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Marco Crivellari
<marco.crivellari@...e.com>, Waiman Long <llong@...hat.com>,
cgroups@...r.kernel.org
Subject: Re: [PATCH 1/2] genirq: Fix IRQ threads affinity VS cpuset isolated
partitions
On Thu, Nov 20 2025 at 14:20, Frederic Weisbecker wrote:
> Le Thu, Nov 20, 2025 at 12:51:31PM +0100, Marek Szyprowski a écrit :
>> On 18.11.2025 15:30, Frederic Weisbecker wrote:
>> > When a cpuset isolated partition is created / updated or destroyed,
>> > the IRQ threads are affine blindly to all the non-isolated CPUs. And
>> > this happens without taking into account the IRQ thread initial
>> > affinity that becomes ignored.
>> >
>> > For example in a system with 8 CPUs, if an IRQ and its kthread are
>> > initially affine to CPU 5, creating an isolated partition with only
>> > CPU 2 inside will eventually end up affining the IRQ kthread to all
>> > CPUs but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for
>> > CPU 5.
>> >
>> > Besides the blind re-affinity, this doesn't take care of the actual
>> > low level interrupt which isn't migrated. As of today the only way to
>> > isolate non managed interrupts, along with their kthreads, is to
>> > overwrite their affinity separately, for example through /proc/irq/
>> >
>> > To avoid doing that manually, future development should focus on
>> > updating the IRQs affinity whenever cpuset isolated partitions are
>> > updated.
>> >
>> > In the meantime, cpuset shouldn't fiddle with IRQ threads directly.
>> > To prevent from that, set the PF_NO_SETAFFINITY flag to them.
>> >
>> > Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
>>
>> This patch landed in today's linux-next as commit 844dcacab287 ("genirq:
>> Fix interrupt threads affinity vs. cpuset isolated partitions"). In my
>> tests I found that it triggers a warnings on some of my test systems.
>> This is example of such warning:
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 1 at kernel/kthread.c:599 kthread_bind_mask+0x2c/0x84
>
> Erm, does this means that the IRQ thread got awaken before the first official
> wakeup in wake_up_and_wait_for_irq_thread_ready()? This looks wrong...
>
> irq_startup() may be called on a few occcasions before. So perhaps
> the IRQ already fired and woke up the kthread once before the "official"
> first wake up?
>
> There seem to be some initialization ordering issue here...
That's unavoidable because of locking and hen and egg ordering problems.
When the thread is created the interrupt is not yet started up and
therefore effective affinity is not known. So doing a bind there is
pointless.
The action has to be made visible _before_ starting it up as once the
interrupt is unmasked it might fire. That's even more true for shared
interrupts.
The wakeup/bind/... muck cannot be invoked with the descriptor lock
held, so it has to be done _after_ the thing went live.
So yes an interrupt which fires before kthread_bind() is invoked might
exactly have that effect. It wakes it from the kthread() wait and it
goes straight through to the thread function.
That's why the original code just set the affinity bit and let the
thread itself handle it.
There are three options to solve that:
1) Bind the thread right after kthread_create() to a random
housekeeping task and let move itself over to the real place
once it came up (at this point the affinity is established)
2) Serialize the setup so that the thread (even, when woken up
early) get's stuck in UNINTERRUPTIBLE state _before_ it can
reach wait_for_interrupt() which waits with INTERRUPTIBLE
3) Teach kthread_create() that the thread is subject to being
bound to something later on and implement that serialization
there.
#1 is pretty straight forward. See untested below.
#2 is ugly (I tried)
#3 could be useful in general, but that needs more thoughts
Thanks
tglx
---
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -133,7 +133,15 @@ void __irq_wake_thread(struct irq_desc *
*/
atomic_inc(&desc->threads_active);
- wake_up_process(action->thread);
+ /*
+ * This might be a premature wakeup before the thread reached the
+ * thread function and set the IRQTF_READY bit. It's waiting in
+ * kthread code with state UNINTERRUPTIBLE. Once it reaches the
+ * thread function it waits with INTERRUPTIBLE. The wakeup is not
+ * lost in that case because the thread is guaranteed to observe
+ * the RUN flag before it goes to sleep in wait_for_interrupt().
+ */
+ wake_up_state(action->thread, TASK_INTERRUPTIBLE);
}
static DEFINE_STATIC_KEY_FALSE(irqhandler_duration_check_enabled);
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1026,16 +1026,8 @@ static void irq_thread_check_affinity(st
set_cpus_allowed_ptr(current, mask);
free_cpumask_var(mask);
}
-
-static inline void irq_thread_set_affinity(struct task_struct *t,
- struct irq_desc *desc)
-{
- kthread_bind_mask(t, irq_data_get_effective_affinity_mask(&desc->irq_data));
-}
#else
static inline void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
-static inline void irq_thread_set_affinity(struct task_struct *t,
- struct irq_desc *desc) { }
#endif
static int irq_wait_for_interrupt(struct irq_desc *desc,
@@ -1220,7 +1212,6 @@ static void wake_up_and_wait_for_irq_thr
if (!action || !action->thread)
return;
- irq_thread_set_affinity(action->thread, desc);
wake_up_process(action->thread);
wait_event(desc->wait_for_threads,
test_bit(IRQTF_READY, &action->thread_flags));
@@ -1389,26 +1380,35 @@ static void irq_nmi_teardown(struct irq_
static int
setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
{
+ /*
+ * At this point interrupt affinity is not known. just assume that
+ * the current CPU is not isolated and valid to bring the thread
+ * up. The thread will move itself over to the right place once the
+ * whole setup is complete.
+ */
+ unsigned int cpu = raw_smp_processor_id();
struct task_struct *t;
- if (!secondary) {
- t = kthread_create(irq_thread, new, "irq/%d-%s", irq,
- new->name);
- } else {
- t = kthread_create(irq_thread, new, "irq/%d-s-%s", irq,
- new->name);
- }
+ if (!secondary)
+ t = kthread_create_on_cpu(irq_thread, new, cpu, "irq/%d-%s", irq, new->name);
+ else
+ t = kthread_create_on_cpu(irq_thread, new, cpu, "irq/%d-s-%s", irq, new->name);
if (IS_ERR(t))
return PTR_ERR(t);
/*
- * We keep the reference to the task struct even if
- * the thread dies to avoid that the interrupt code
- * references an already freed task_struct.
+ * We keep the reference to the task struct even if the thread dies
+ * to avoid that the interrupt code references an already freed
+ * task_struct.
*/
new->thread = get_task_struct(t);
+ /*
+ * Ensure the thread adjusts the affinity once it reaches the
+ * thread function.
+ */
+ new->thread_flags = BIT(IRQTF_AFFINITY);
return 0;
}
Powered by blists - more mailing lists