lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87tsyokaug.ffs@tglx>
Date: Thu, 20 Nov 2025 16:00:39 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Frederic Weisbecker <frederic@...nel.org>, Marek Szyprowski
 <m.szyprowski@...sung.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Marco Crivellari
 <marco.crivellari@...e.com>, Waiman Long <llong@...hat.com>,
 cgroups@...r.kernel.org
Subject: Re: [PATCH 1/2] genirq: Fix IRQ threads affinity VS cpuset isolated
 partitions

On Thu, Nov 20 2025 at 14:20, Frederic Weisbecker wrote:
> Le Thu, Nov 20, 2025 at 12:51:31PM +0100, Marek Szyprowski a écrit :
>> On 18.11.2025 15:30, Frederic Weisbecker wrote:
>> > When a cpuset isolated partition is created / updated or destroyed,
>> > the IRQ threads are affine blindly to all the non-isolated CPUs. And
>> > this happens without taking into account the IRQ thread initial
>> > affinity that becomes ignored.
>> >
>> > For example in a system with 8 CPUs, if an IRQ and its kthread are
>> > initially affine to CPU 5, creating an isolated partition with only
>> > CPU 2 inside will eventually end up affining the IRQ kthread to all
>> > CPUs but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for
>> > CPU 5.
>> >
>> > Besides the blind re-affinity, this doesn't take care of the actual
>> > low level interrupt which isn't migrated. As of today the only way to
>> > isolate non managed interrupts, along with their kthreads, is to
>> > overwrite their affinity separately, for example through /proc/irq/
>> >
>> > To avoid doing that manually, future development should focus on
>> > updating the IRQs affinity whenever cpuset isolated partitions are
>> > updated.
>> >
>> > In the meantime, cpuset shouldn't fiddle with IRQ threads directly.
>> > To prevent from that, set the PF_NO_SETAFFINITY flag to them.
>> >
>> > Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
>> 
>> This patch landed in today's linux-next as commit 844dcacab287 ("genirq: 
>> Fix interrupt threads affinity vs. cpuset isolated partitions"). In my 
>> tests I found that it triggers a warnings on some of my test systems. 
>> This is example of such warning:
>> 
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 1 at kernel/kthread.c:599 kthread_bind_mask+0x2c/0x84
>
> Erm, does this means that the IRQ thread got awaken before the first official
> wakeup in wake_up_and_wait_for_irq_thread_ready()? This looks wrong...
>
> irq_startup() may be called on a few occcasions before. So perhaps
> the IRQ already fired and woke up the kthread once before the "official"
> first wake up?
>
> There seem to be some initialization ordering issue here...

That's unavoidable because of locking and hen and egg ordering problems.

When the thread is created the interrupt is not yet started up and
therefore effective affinity is not known. So doing a bind there is
pointless.

The action has to be made visible _before_ starting it up as once the
interrupt is unmasked it might fire. That's even more true for shared
interrupts.

The wakeup/bind/... muck cannot be invoked with the descriptor lock
held, so it has to be done _after_ the thing went live.

So yes an interrupt which fires before kthread_bind() is invoked might
exactly have that effect. It wakes it from the kthread() wait and it
goes straight through to the thread function.

That's why the original code just set the affinity bit and let the
thread itself handle it.

There are three options to solve that:

      1) Bind the thread right after kthread_create() to a random
         housekeeping task and let move itself over to the real place
         once it came up (at this point the affinity is established)

      2) Serialize the setup so that the thread (even, when woken up
         early) get's stuck in UNINTERRUPTIBLE state _before_ it can
         reach wait_for_interrupt() which waits with INTERRUPTIBLE

      3) Teach kthread_create() that the thread is subject to being
         bound to something later on and implement that serialization
         there.

#1 is pretty straight forward. See untested below.

#2 is ugly (I tried)

#3 could be useful in general, but that needs more thoughts

Thanks

        tglx
---
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -133,7 +133,15 @@ void __irq_wake_thread(struct irq_desc *
 	 */
 	atomic_inc(&desc->threads_active);
 
-	wake_up_process(action->thread);
+	/*
+	 * This might be a premature wakeup before the thread reached the
+	 * thread function and set the IRQTF_READY bit. It's waiting in
+	 * kthread code with state UNINTERRUPTIBLE. Once it reaches the
+	 * thread function it waits with INTERRUPTIBLE. The wakeup is not
+	 * lost in that case because the thread is guaranteed to observe
+	 * the RUN flag before it goes to sleep in wait_for_interrupt().
+	 */
+	wake_up_state(action->thread, TASK_INTERRUPTIBLE);
 }
 
 static DEFINE_STATIC_KEY_FALSE(irqhandler_duration_check_enabled);
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1026,16 +1026,8 @@ static void irq_thread_check_affinity(st
 	set_cpus_allowed_ptr(current, mask);
 	free_cpumask_var(mask);
 }
-
-static inline void irq_thread_set_affinity(struct task_struct *t,
-					   struct irq_desc *desc)
-{
-	kthread_bind_mask(t, irq_data_get_effective_affinity_mask(&desc->irq_data));
-}
 #else
 static inline void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
-static inline void irq_thread_set_affinity(struct task_struct *t,
-					   struct irq_desc *desc) { }
 #endif
 
 static int irq_wait_for_interrupt(struct irq_desc *desc,
@@ -1220,7 +1212,6 @@ static void wake_up_and_wait_for_irq_thr
 	if (!action || !action->thread)
 		return;
 
-	irq_thread_set_affinity(action->thread, desc);
 	wake_up_process(action->thread);
 	wait_event(desc->wait_for_threads,
 		   test_bit(IRQTF_READY, &action->thread_flags));
@@ -1389,26 +1380,35 @@ static void irq_nmi_teardown(struct irq_
 static int
 setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
 {
+	/*
+	 * At this point interrupt affinity is not known. just assume that
+	 * the current CPU is not isolated and valid to bring the thread
+	 * up. The thread will move itself over to the right place once the
+	 * whole setup is complete.
+	 */
+	unsigned int cpu = raw_smp_processor_id();
 	struct task_struct *t;
 
-	if (!secondary) {
-		t = kthread_create(irq_thread, new, "irq/%d-%s", irq,
-				   new->name);
-	} else {
-		t = kthread_create(irq_thread, new, "irq/%d-s-%s", irq,
-				   new->name);
-	}
+	if (!secondary)
+		t = kthread_create_on_cpu(irq_thread, new, cpu, "irq/%d-%s", irq, new->name);
+	else
+		t = kthread_create_on_cpu(irq_thread, new, cpu, "irq/%d-s-%s", irq, new->name);
 
 	if (IS_ERR(t))
 		return PTR_ERR(t);
 
 	/*
-	 * We keep the reference to the task struct even if
-	 * the thread dies to avoid that the interrupt code
-	 * references an already freed task_struct.
+	 * We keep the reference to the task struct even if the thread dies
+	 * to avoid that the interrupt code references an already freed
+	 * task_struct.
 	 */
 	new->thread = get_task_struct(t);
 
+	/*
+	 * Ensure the thread adjusts the affinity once it reaches the
+	 * thread function.
+	 */
+	new->thread_flags = BIT(IRQTF_AFFINITY);
 	return 0;
 }
 



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ