linux-kernel - Re: [PATCH 2/2] rcu: Fix lockup when RCU reader used while IRQ exiting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f1b83c6f-850e-42f7-9750-0a0d45e44623@nvidia.com>
Date: Tue, 10 Jun 2025 11:47:11 -0400
From: Joel Fernandes <joelagnelf@...dia.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>,
 Joel Fernandes <joel@...lfernandes.org>,
 Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
 Uladzislau Rezki <urezki@...il.com>, Steven Rostedt <rostedt@...dmis.org>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Lai Jiangshan <jiangshanlai@...il.com>, Zqiang <qiang.zhang1211@...il.com>,
 Xiongfeng Wang <wangxiongfeng2@...wei.com>, rcu@...r.kernel.org,
 bpf@...r.kernel.org
Subject: Re: [PATCH 2/2] rcu: Fix lockup when RCU reader used while IRQ
 exiting



On 6/10/2025 8:23 AM, Frederic Weisbecker wrote:
> Le Mon, Jun 09, 2025 at 02:01:24PM -0400, Joel Fernandes a écrit :
>> During rcu_read_unlock_special(), if this happens during irq_exit(), we
>> can lockup if an IPI is issued. This is because the IPI itself triggers
>> the irq_exit() path causing a recursive lock up.
>>
>> This is precisely what Xiongfeng found when invoking a BPF program on
>> the trace_tick_stop() tracepoint As shown in the trace below. Fix by
>> using context-tracking to tell us if we're still in an IRQ.
>> context-tracking keeps track of the IRQ until after the tracepoint, so
>> it cures the issues.
>>
>> irq_exit()
>>   __irq_exit_rcu()
>>     /* in_hardirq() returns false after this */
>>     preempt_count_sub(HARDIRQ_OFFSET)
>>     tick_irq_exit()
>>       tick_nohz_irq_exit()
>> 	    tick_nohz_stop_sched_tick()
>> 	      trace_tick_stop()  /* a bpf prog is hooked on this trace point */
>> 		   __bpf_trace_tick_stop()
>> 		      bpf_trace_run2()
>> 			    rcu_read_unlock_special()
>>                               /* will send a IPI to itself */
>> 			      irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
>>
>> A simple reproducer can also be obtained by doing the following in
>> tick_irq_exit(). It will hang on boot without the patch:
>>
>>   static inline void tick_irq_exit(void)
>>   {
>>  +	rcu_read_lock();
>>  +	WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true);
>>  +	rcu_read_unlock();
>>  +
>>
>> While at it, add some comments to this code.
>>
>> Reported-by: Xiongfeng Wang <wangxiongfeng2@...wei.com>
>> Closes: https://lore.kernel.org/all/9acd5f9f-6732-7701-6880-4b51190aa070@huawei.com/
>> Tested-by: Xiongfeng Wang <wangxiongfeng2@...wei.com>
>> Signed-off-by: Joel Fernandes <joelagnelf@...dia.com>
> 
> Acked-by: Frederic Weisbecker <frederic@...nel.org>

Thanks.

> * What is the point of ->defer_qs_iw_pending ? If the irq work is already
>   queued, it won't be requeued because the irq work code already prevent from
>   that.

Sure, but I think maybe we should not even attempt to queue the irq_work if
defer_qs_iw_pending? I understand there's no harm, but we'd depend on irq_work
internals for the intended behavior.

> 
> * CONFIG_PREEMPT_RT && !CONFIG_RCU_STRICT_GRACE_PERIOD would queue a lazy irq
>   work but still raise a hardirq to wake up softirq to handle it. It's pointless
>   because there is nothing to execute in softirq, all we care about is the
>   hardirq.
>   Also since the work is empty it might as well be executed in hard irq, that
>   shouldn't induce more latency in RT.

Oh, hm. So your irq_work_kick() on PREEMPT_RT would only trigger the hard irq?

That does make sense to me. Lets add the RT folks (Sebastian) as well to confirm
this behavior is sound?

> 
> * Empty hard irq work raised to trigger something on irq exit also exist
>   elsewhere (see nohz_full_kick_func()). Would it make sense to have that
>   implemented in irq_work.c instead and trigger that through a simple
>   irq_work_kick()?

Yeah, sure. We'd probably need some serious testing to make sure we didn't break
anything else and perhaps some new test cases. From past experience, this code
path seems easy to break. But, nice change!

thanks,

 - Joel


> And then this would look like (only built-tested):
> 
> diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
> index 136f2980cba3..4149ed516524 100644
> --- a/include/linux/irq_work.h
> +++ b/include/linux/irq_work.h
> @@ -57,6 +57,9 @@ static inline bool irq_work_is_hard(struct irq_work *work)
>  bool irq_work_queue(struct irq_work *work);
>  bool irq_work_queue_on(struct irq_work *work, int cpu);
>  
> +bool irq_work_kick(void);
> +bool irq_work_kick_on(int cpu);
> +
>  void irq_work_tick(void);
>  void irq_work_sync(struct irq_work *work);
>  
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index 73f7e1fd4ab4..383a3e9050d9 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -181,6 +181,22 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
>  #endif /* CONFIG_SMP */
>  }
>  
> +static void kick_func(struct irq_work *work)
> +{
> +}
> +
> +static DEFINE_PER_CPU(struct irq_work, kick_work) = IRQ_WORK_INIT_HARD(kick_func);
> +
> +bool irq_work_kick(void)
> +{
> +	return irq_work_queue(this_cpu_ptr(&kick_work));
> +}
> +
> +bool irq_work_kick_on(int cpu)
> +{
> +	return irq_work_queue_on(per_cpu_ptr(&kick_work, cpu), cpu);
> +}
> +
>  bool irq_work_needs_cpu(void)
>  {
>  	struct llist_head *raised, *lazy;
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index a9a811d9d7a3..b33888071e41 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -191,7 +191,6 @@ struct rcu_data {
>  					/*  during and after the last grace */
>  					/* period it is aware of. */
>  	struct irq_work defer_qs_iw;	/* Obtain later scheduler attention. */
> -	bool defer_qs_iw_pending;	/* Scheduler attention pending? */
>  	struct work_struct strict_work;	/* Schedule readers for strict GPs. */
>  
>  	/* 2) batch handling */
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 3c0bbbbb686f..0c7b7c220b46 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -619,17 +619,6 @@ notrace void rcu_preempt_deferred_qs(struct task_struct *t)
>  	rcu_preempt_deferred_qs_irqrestore(t, flags);
>  }
>  
> -/*
> - * Minimal handler to give the scheduler a chance to re-evaluate.
> - */
> -static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> -{
> -	struct rcu_data *rdp;
> -
> -	rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
> -	rdp->defer_qs_iw_pending = false;
> -}
> -
>  /*
>   * Handle special cases during rcu_read_unlock(), such as needing to
>   * notify RCU core processing or task having blocked during the RCU
> @@ -673,18 +662,10 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  			set_tsk_need_resched(current);
>  			set_preempt_need_resched();
>  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> -			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
> +			    expboost && cpu_online(rdp->cpu)) {
>  				// Get scheduler to re-evaluate and call hooks.
>  				// If !IRQ_WORK, FQS scan will eventually IPI.
> -				if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
> -				    IS_ENABLED(CONFIG_PREEMPT_RT))
> -					rdp->defer_qs_iw = IRQ_WORK_INIT_HARD(
> -								rcu_preempt_deferred_qs_handler);
> -				else
> -					init_irq_work(&rdp->defer_qs_iw,
> -						      rcu_preempt_deferred_qs_handler);
> -				rdp->defer_qs_iw_pending = true;
> -				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> +				irq_work_kick();
>  			}
>  		}
>  		local_irq_restore(flags);
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c527b421c865..84170656334d 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -377,14 +377,6 @@ static bool can_stop_full_tick(int cpu, struct tick_sched *ts)
>  	return true;
>  }
>  
> -static void nohz_full_kick_func(struct irq_work *work)
> -{
> -	/* Empty, the tick restart happens on tick_nohz_irq_exit() */
> -}
> -
> -static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) =
> -	IRQ_WORK_INIT_HARD(nohz_full_kick_func);
> -
>  /*
>   * Kick this CPU if it's full dynticks in order to force it to
>   * re-evaluate its dependency on the tick and restart it if necessary.
> @@ -396,7 +388,7 @@ static void tick_nohz_full_kick(void)
>  	if (!tick_nohz_full_cpu(smp_processor_id()))
>  		return;
>  
> -	irq_work_queue(this_cpu_ptr(&nohz_full_kick_work));
> +	irq_work_kick();
>  }
>  
>  /*
> @@ -408,7 +400,7 @@ void tick_nohz_full_kick_cpu(int cpu)
>  	if (!tick_nohz_full_cpu(cpu))
>  		return;
>  
> -	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> +	irq_work_kick_on(cpu);
>  }
>  
>  static void tick_nohz_kick_task(struct task_struct *tsk)
> 
>   
>