[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <13ba4125-8972-4ddb-b630-96e89bdd7248@samsung.com>
Date: Thu, 20 Nov 2025 17:34:43 +0100
From: Marek Szyprowski <m.szyprowski@...sung.com>
To: Thomas Gleixner <tglx@...utronix.de>, Frederic Weisbecker
<frederic@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Marco Crivellari
<marco.crivellari@...e.com>, Waiman Long <llong@...hat.com>,
cgroups@...r.kernel.org
Subject: Re: [PATCH 1/2] genirq: Fix IRQ threads affinity VS cpuset isolated
partitions
On 20.11.2025 16:00, Thomas Gleixner wrote:
> On Thu, Nov 20 2025 at 14:20, Frederic Weisbecker wrote:
>> Le Thu, Nov 20, 2025 at 12:51:31PM +0100, Marek Szyprowski a écrit :
>>> On 18.11.2025 15:30, Frederic Weisbecker wrote:
>>>> When a cpuset isolated partition is created / updated or destroyed,
>>>> the IRQ threads are affine blindly to all the non-isolated CPUs. And
>>>> this happens without taking into account the IRQ thread initial
>>>> affinity that becomes ignored.
>>>>
>>>> For example in a system with 8 CPUs, if an IRQ and its kthread are
>>>> initially affine to CPU 5, creating an isolated partition with only
>>>> CPU 2 inside will eventually end up affining the IRQ kthread to all
>>>> CPUs but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for
>>>> CPU 5.
>>>>
>>>> Besides the blind re-affinity, this doesn't take care of the actual
>>>> low level interrupt which isn't migrated. As of today the only way to
>>>> isolate non managed interrupts, along with their kthreads, is to
>>>> overwrite their affinity separately, for example through /proc/irq/
>>>>
>>>> To avoid doing that manually, future development should focus on
>>>> updating the IRQs affinity whenever cpuset isolated partitions are
>>>> updated.
>>>>
>>>> In the meantime, cpuset shouldn't fiddle with IRQ threads directly.
>>>> To prevent from that, set the PF_NO_SETAFFINITY flag to them.
>>>>
>>>> Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
>>> This patch landed in today's linux-next as commit 844dcacab287 ("genirq:
>>> Fix interrupt threads affinity vs. cpuset isolated partitions"). In my
>>> tests I found that it triggers a warnings on some of my test systems.
>>> This is example of such warning:
>>>
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 1 at kernel/kthread.c:599 kthread_bind_mask+0x2c/0x84
>> Erm, does this means that the IRQ thread got awaken before the first official
>> wakeup in wake_up_and_wait_for_irq_thread_ready()? This looks wrong...
>>
>> irq_startup() may be called on a few occcasions before. So perhaps
>> the IRQ already fired and woke up the kthread once before the "official"
>> first wake up?
>>
>> There seem to be some initialization ordering issue here...
> That's unavoidable because of locking and hen and egg ordering problems.
>
> When the thread is created the interrupt is not yet started up and
> therefore effective affinity is not known. So doing a bind there is
> pointless.
>
> The action has to be made visible _before_ starting it up as once the
> interrupt is unmasked it might fire. That's even more true for shared
> interrupts.
>
> The wakeup/bind/... muck cannot be invoked with the descriptor lock
> held, so it has to be done _after_ the thing went live.
>
> So yes an interrupt which fires before kthread_bind() is invoked might
> exactly have that effect. It wakes it from the kthread() wait and it
> goes straight through to the thread function.
>
> That's why the original code just set the affinity bit and let the
> thread itself handle it.
>
> There are three options to solve that:
>
> 1) Bind the thread right after kthread_create() to a random
> housekeeping task and let move itself over to the real place
> once it came up (at this point the affinity is established)
>
> 2) Serialize the setup so that the thread (even, when woken up
> early) get's stuck in UNINTERRUPTIBLE state _before_ it can
> reach wait_for_interrupt() which waits with INTERRUPTIBLE
>
> 3) Teach kthread_create() that the thread is subject to being
> bound to something later on and implement that serialization
> there.
>
> #1 is pretty straight forward. See untested below.
This fixes the observed issue. The only problem with that patch is
incorrect set of arguments to kthread_create_on_cpu() (it doesn't
support printf-style args).
If that matters:
Tested-by: Marek Szyprowski <m.szyprowski@...sung.com>
> #2 is ugly (I tried)
>
> #3 could be useful in general, but that needs more thoughts
>
> Thanks
>
> tglx
> ---
> --- a/kernel/irq/handle.c
> +++ b/kernel/irq/handle.c
> @@ -133,7 +133,15 @@ void __irq_wake_thread(struct irq_desc *
> */
> atomic_inc(&desc->threads_active);
>
> - wake_up_process(action->thread);
> + /*
> + * This might be a premature wakeup before the thread reached the
> + * thread function and set the IRQTF_READY bit. It's waiting in
> + * kthread code with state UNINTERRUPTIBLE. Once it reaches the
> + * thread function it waits with INTERRUPTIBLE. The wakeup is not
> + * lost in that case because the thread is guaranteed to observe
> + * the RUN flag before it goes to sleep in wait_for_interrupt().
> + */
> + wake_up_state(action->thread, TASK_INTERRUPTIBLE);
> }
>
> static DEFINE_STATIC_KEY_FALSE(irqhandler_duration_check_enabled);
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1026,16 +1026,8 @@ static void irq_thread_check_affinity(st
> set_cpus_allowed_ptr(current, mask);
> free_cpumask_var(mask);
> }
> -
> -static inline void irq_thread_set_affinity(struct task_struct *t,
> - struct irq_desc *desc)
> -{
> - kthread_bind_mask(t, irq_data_get_effective_affinity_mask(&desc->irq_data));
> -}
> #else
> static inline void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
> -static inline void irq_thread_set_affinity(struct task_struct *t,
> - struct irq_desc *desc) { }
> #endif
>
> static int irq_wait_for_interrupt(struct irq_desc *desc,
> @@ -1220,7 +1212,6 @@ static void wake_up_and_wait_for_irq_thr
> if (!action || !action->thread)
> return;
>
> - irq_thread_set_affinity(action->thread, desc);
> wake_up_process(action->thread);
> wait_event(desc->wait_for_threads,
> test_bit(IRQTF_READY, &action->thread_flags));
> @@ -1389,26 +1380,35 @@ static void irq_nmi_teardown(struct irq_
> static int
> setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
> {
> + /*
> + * At this point interrupt affinity is not known. just assume that
> + * the current CPU is not isolated and valid to bring the thread
> + * up. The thread will move itself over to the right place once the
> + * whole setup is complete.
> + */
> + unsigned int cpu = raw_smp_processor_id();
> struct task_struct *t;
>
> - if (!secondary) {
> - t = kthread_create(irq_thread, new, "irq/%d-%s", irq,
> - new->name);
> - } else {
> - t = kthread_create(irq_thread, new, "irq/%d-s-%s", irq,
> - new->name);
> - }
> + if (!secondary)
> + t = kthread_create_on_cpu(irq_thread, new, cpu, "irq/%d-%s", irq, new->name);
> + else
> + t = kthread_create_on_cpu(irq_thread, new, cpu, "irq/%d-s-%s", irq, new->name);
>
> if (IS_ERR(t))
> return PTR_ERR(t);
>
> /*
> - * We keep the reference to the task struct even if
> - * the thread dies to avoid that the interrupt code
> - * references an already freed task_struct.
> + * We keep the reference to the task struct even if the thread dies
> + * to avoid that the interrupt code references an already freed
> + * task_struct.
> */
> new->thread = get_task_struct(t);
>
> + /*
> + * Ensure the thread adjusts the affinity once it reaches the
> + * thread function.
> + */
> + new->thread_flags = BIT(IRQTF_AFFINITY);
> return 0;
> }
>
>
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
Powered by blists - more mailing lists