linux-kernel - Re: [PATCH] irqchip/riscv-imsic: Fix irq migration failure issue when cpu hotplug.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <39cbfcdf-db96-4e2a-bcca-ef10298492fd@riscv-computing.com>
Date: Wed, 4 Feb 2026 09:59:43 +0800
From: "Yingjun Ni" <yingjun.ni@...cv-computing.com>
To: "Thomas Gleixner" <tglx@...nel.org>, <anup@...infault.org>, 
	<pjw@...nel.org>, <palmer@...belt.com>, <aou@...s.berkeley.edu>, 
	<alex@...ti.fr>
Cc: <linux-riscv@...ts.infradead.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] irqchip/riscv-imsic: Fix irq migration failure issue when cpu hotplug.

> On Tue, Feb 03 2026 at 16:02, Yingjun Ni wrote:
>> Add a null pointer check for irq_write_msi_msg to fix NULL pointer
>> dereference issue when migrating irq.
>>
>> Modify the return value of imsic_irq_set_affinity to let the subdomain
>> PCI-MSIX migrate the irq to a new cpu when cpu hotplug.
>>
>> Don't set vec->move_next in imsic_vector_move_update when the cpu is
>> offline, because it will never be cleared.
> You completely fail to explain the actual problem and the root
> cause. See
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
>
>>   drivers/irqchip/irq-riscv-imsic-platform.c | 8 ++++++--
>>   drivers/irqchip/irq-riscv-imsic-state.c    | 5 +++++
>>   2 files changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/irqchip/irq-riscv-imsic-platform.c b/drivers/irqchip/irq-riscv-imsic-platform.c
>> index 643c8e459611..131e4f2b5431 100644
>> --- a/drivers/irqchip/irq-riscv-imsic-platform.c
>> +++ b/drivers/irqchip/irq-riscv-imsic-platform.c
>> @@ -93,9 +93,13 @@ static void imsic_irq_compose_msg(struct irq_data *d, struct msi_msg *msg)
>>   static void imsic_msi_update_msg(struct irq_data *d, struct imsic_vector *vec)
>>   {
>>   	struct msi_msg msg = { };
>> +	struct irq_chip *irq_chip = irq_data_get_irq_chip(d);
>> +
>> +	if (!irq_chip->irq_write_msi_msg)
>> +		return;
> I have no idea how this ever worked. The irq_data pointer belongs to the
> IMSIC base domain, which definitely does not have a irq_write_msi_msg()
> callback and never can have one.
>
> The write message callback is always implemented by the top most domain,
> in this case the PCI/MSI[x] per device domain.
>
> So this code is simply broken and your NULL pointer check just makes it
> differently broken.
Sorry, my mistake, the NULL pointer issue has been fixed by commit 
(c475c0b71314 irqchip/riscv-imsic: Remove redundant irq_data lookups)
>>   	imsic_irq_compose_vector_msg(vec, &msg);
>> -	irq_data_get_irq_chip(d)->irq_write_msi_msg(d, &msg);
>> +	irq_chip->irq_write_msi_msg(d, &msg);
>>   }
>>   
>>   static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
>> @@ -173,7 +177,7 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
>>   	/* Move state of the old vector to the new vector */
>>   	imsic_vector_move(old_vec, new_vec);
>>   
>> -	return IRQ_SET_MASK_OK_DONE;
>> +	return IRQ_SET_MASK_OK;
> Have you actually looked at the consequences of this change?
>
>>   }
>>   
>>   static void imsic_irq_force_complete_move(struct irq_data *d)
>> diff --git a/drivers/irqchip/irq-riscv-imsic-state.c b/drivers/irqchip/irq-riscv-imsic-state.c
>> index b6cebfee9461..cd1bf9516878 100644
>> --- a/drivers/irqchip/irq-riscv-imsic-state.c
>> +++ b/drivers/irqchip/irq-riscv-imsic-state.c
>> @@ -362,6 +362,10 @@ static bool imsic_vector_move_update(struct imsic_local_priv *lpriv,
>>   	/* Update enable and move details */
>>   	enabled = READ_ONCE(vec->enable);
>>   	WRITE_ONCE(vec->enable, new_enable);
>> +
>> +	if (!cpu_online(vec->cpu) && is_old_vec)
>> +		goto out;
> This is definitely not correct as this should still cleanup software
> state, no?

if vec->move_next is not cleared when the cpu is offline, the following 
issue will occur.

cat /proc/interrupts
     CPU0  CPU1  CPU2 CPU3
23:   0       0         0       66   PCI-MSIX-0000:00:01.0 eth0-rx-0

echo 0 > /sys/bus/cpu/devices/cpu3/online

cat /proc/interrupts
     CPU0  CPU1  CPU2
23:   0       0         66   PCI-MSIX-0000:00:01.0 eth0-rx-0

echo 0 > /sys/bus/cpu/devices/cpu2/online
[   35.697380] IRQ23: set affinity failed(-16).
[   35.698381] CPU2: off

> Thanks,
>
>          tglx