linux-kernel - Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c0791ea-edb7-44b3-b8ec-ae0452400497@redhat.com>
Date: Sat, 10 Jan 2026 16:47:58 -0500
From: Waiman Long <llong@...hat.com>
To: Marc Zyngier <maz@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 Clark Williams <clrkwllms@...nel.org>, Steven Rostedt <rostedt@...dmis.org>,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-rt-devel@...ts.linux.dev
Subject: Re: [PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in
 allocate_vpe_l1_table()

On 1/8/26 3:26 AM, Marc Zyngier wrote:
> On Wed, 07 Jan 2026 21:53:53 +0000,
> Waiman Long <longman@...hat.com> wrote:
>> When running a PREEMPT_RT debug kernel on a 2-socket Grace arm64 system,
>> the following bug report was produced at bootup time.
>>
>>    BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
>>    in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/72
>>    preempt_count: 1, expected: 0
>>    RCU nest depth: 1, expected: 1
>>     :
>>    CPU: 72 UID: 0 PID: 0 Comm: swapper/72 Tainted: G        W           6.19.0-rc4-test+ #4 PREEMPT_{RT,(full)}
>>    Tainted: [W]=WARN
>>    Call trace:
>>      :
>>     rt_spin_lock+0xe4/0x408
>>     rmqueue_bulk+0x48/0x1de8
>>     __rmqueue_pcplist+0x410/0x650
>>     rmqueue.constprop.0+0x6a8/0x2b50
>>     get_page_from_freelist+0x3c0/0xe68
>>     __alloc_frozen_pages_noprof+0x1dc/0x348
>>     alloc_pages_mpol+0xe4/0x2f8
>>     alloc_frozen_pages_noprof+0x124/0x190
>>     allocate_slab+0x2f0/0x438
>>     new_slab+0x4c/0x80
>>     ___slab_alloc+0x410/0x798
>>     __slab_alloc.constprop.0+0x88/0x1e0
>>     __kmalloc_cache_noprof+0x2dc/0x4b0
>>     allocate_vpe_l1_table+0x114/0x788
>>     its_cpu_init_lpis+0x344/0x790
>>     its_cpu_init+0x60/0x220
>>     gic_starting_cpu+0x64/0xe8
>>     cpuhp_invoke_callback+0x438/0x6d8
>>     __cpuhp_invoke_callback_range+0xd8/0x1f8
>>     notify_cpu_starting+0x11c/0x178
>>     secondary_start_kernel+0xc8/0x188
>>     __secondary_switched+0xc0/0xc8
>>
>> This is due to the fact that allocate_vpe_l1_table() will call
>> kzalloc() to allocate a cpumask_t when the first CPU of the
>> second node of the 72-cpu Grace system is being called from the
>> CPUHP_AP_MIPS_GIC_TIMER_STARTING state inside the starting section of
> Surely *not* that particular state.

My mistake, it should be CPUHP_AP_IRQ_GIC_STARTING. There are three 
static gic_starting_cpu() functions that confuse me.


>> the CPU hotplug bringup pipeline where interrupt is disabled. This is an
>> atomic context where sleeping is not allowed and acquiring a sleeping
>> rt_spin_lock within kzalloc() may lead to system hang in case there is
>> a lock contention.
>>
>> To work around this issue, a static buffer is used for cpumask
>> allocation when running a PREEMPT_RT kernel via the newly introduced
>> vpe_alloc_cpumask() helper. The static buffer is currently set to be
>> 4 kbytes in size. As only one cpumask is needed per node, the current
>> size should be big enough as long as (cpumask_size() * nr_node_ids)
>> is not bigger than 4k.
> What role does the node play here? The GIC topology has nothing to do
> with NUMA. It may be true on your particular toy, but that's
> definitely not true architecturally. You could, at worse, end-up with
> one such cpumask per *CPU*. That'd be a braindead system, but this
> code is written to support the architecture, not any particular
> implementation.
>
It is just what I have observed on the hardware that I used for 
reproducing the problem. I agree that it may be different in other arm64 
CPUs.
>> Signed-off-by: Waiman Long <longman@...hat.com>
>> ---
>>   drivers/irqchip/irq-gic-v3-its.c | 26 +++++++++++++++++++++++++-
>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
>> index ada585bfa451..9185785524dc 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -2896,6 +2896,30 @@ static bool allocate_vpe_l2_table(int cpu, u32 id)
>>   	return true;
>>   }
>>   
>> +static void *vpe_alloc_cpumask(void)
>> +{
>> +	/*
>> +	 * With PREEMPT_RT kernel, we can't call any k*alloc() APIs as they
>> +	 * may acquire a sleeping rt_spin_lock in an atomic context. So use
>> +	 * a pre-allocated buffer instead.
>> +	 */
>> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> +		static unsigned long mask_buf[512];
>> +		static atomic_t	alloc_idx;
>> +		int idx, mask_size = cpumask_size();
>> +		int nr_cpumasks = sizeof(mask_buf)/mask_size;
>> +
>> +		/*
>> +		 * Fetch an allocation index and if it points to a buffer within
>> +		 * mask_buf[], return that. Fall back to kzalloc() otherwise.
>> +		 */
>> +		idx = atomic_fetch_inc(&alloc_idx);
>> +		if (idx < nr_cpumasks)
>> +			return &mask_buf[idx * mask_size/sizeof(long)];
>> +	}
> Err, no. That's horrible. I can see three ways to address this in a
> more appealing way:
>
> - you give RT a generic allocator that works for (small) atomic
>    allocations. I appreciate that's not easy, and even probably
>    contrary to the RT goals. But I'm also pretty sure that the GIC code
>    is not the only pile of crap being caught doing that.
>
> - you pre-compute upfront how many cpumasks you are going to require,
>    based on the actual GIC topology. You do that on CPU0, outside of
>    the hotplug constraints, and allocate what you need. This is
>    difficult as you need to ensure the RD<->CPU matching without the
>    CPUs having booted, which means wading through the DT/ACPI gunk to
>    try and guess what you have.
>
> - you delay the allocation of L1 tables to a context where you can
>    perform allocations, and before we have a chance of running a guest
>    on this CPU. That's probably the simplest option (though dealing
>    with late onlining while guests are already running could be
>    interesting...).
>
> But I'm always going to say no to something that is a poor hack and
> ultimately falling back to the same broken behaviour.

Thanks for the suggestion. I will try  the first alternative of a more 
generic memory allocator.

Cheers,
Longman

>
> Thanks,
>
> 	M.
>