[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <9086EA4F-0393-4E01-AE74-1B428904BFCE@joelfernandes.org>
Date: Wed, 7 Jan 2026 22:35:44 -0500
From: Joel Fernandes <joel@...lfernandes.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Joel Fernandes <joelagnelf@...dia.com>,
Paul E McKenney <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
rcu@...r.kernel.org, Neeraj Upadhyay <neeraj.upadhyay@...nel.org>,
Josh Triplett <josh@...htriplett.org>, Uladzislau Rezki <urezki@...il.com>,
Steven Rostedt <rostedt@...dmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Lai Jiangshan <jiangshanlai@...il.com>, Zqiang <qiang.zhang@...ux.dev>,
Shuah Khan <shuah@...nel.org>, linux-kernel@...r.kernel.org,
linux-kselftest@...r.kernel.org, Kai Yao <yaokai34@...wei.com>,
Tengda Wu <wutengda2@...wei.com>
Subject: Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
> On Jan 7, 2026, at 8:35 PM, Joel Fernandes <joel@...lfernandes.org> wrote:
>
>
>
>> On Jan 7, 2026, at 8:02 PM, Joel Fernandes <joel@...lfernandes.org> wrote:
>>
>>
>>
>>>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@...nel.org> wrote:
>>>
>>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
>>>> From: Yao Kai <yaokai34@...wei.com>
>>>>
>>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
>>>> __rcu_read_unlock()") removes the recursion-protection code from
>>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
>>>> raise_softirq_irqoff() with ftrace enabled as follows:
>>>>
>>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
>>>> Modules linked in: my_irq_work(O)
>>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
>>>> Tainted: [O]=OOT_MODULE
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
>>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
>>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
>>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
>>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
>>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
>>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
>>>> FS: 0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
>>>> PKRU: 55555554
>>>> Call Trace:
>>>> <IRQ>
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> unwind_next_frame+0x203/0x9b0
>>>> __unwind_start+0x15d/0x1c0
>>>> arch_stack_walk+0x62/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> raise_softirq_irqoff+0x6e/0xa0
>>>> rcu_read_unlock_special+0xb1/0x160
>>>> __is_insn_slot_addr+0x54/0x70
>>>> kernel_text_address+0x48/0xc0
>>>> __kernel_text_address+0xd/0x40
>>>> unwind_get_return_address+0x1e/0x40
>>>> arch_stack_walk+0x9c/0xf0
>>>> stack_trace_save+0x48/0x70
>>>> __ftrace_trace_stack.constprop.0+0x144/0x180
>>>> trace_buffer_unlock_commit_regs+0x6d/0x220
>>>> trace_event_buffer_commit+0x5c/0x260
>>>> trace_event_raw_event_softirq+0x47/0x80
>>>> __raise_softirq_irqoff+0x61/0x80
>>>> __flush_smp_call_function_queue+0x115/0x420
>>>> __sysvec_call_function_single+0x17/0xb0
>>>> sysvec_call_function_single+0x8c/0xc0
>>>> </IRQ>
>>>>
>>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
>>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
>>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
>>>> setting the same flag before calling raise_softirq_irqoff() and rename the
>>>> flag to defer_qs_pending for more common.
>>>>
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <wutengda2@...wei.com>
>>>> Signed-off-by: Yao Kai <yaokai34@...wei.com>
>>>> Reviewed-by: Joel Fernandes <joelagnelf@...dia.com>
>>>> Signed-off-by: Joel Fernandes <joelagnelf@...dia.com>
>>>
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>>
>> Could you provide an example of when that can happen?
>>
>> If rcu_qs() results in reporting of a quiescent state up the node tree before the deferred reporting work had a chance to act, then indeed we should be clearing the flag (after canceling the pending raise_softirq_irqoff()).
>>
>>>> flag to defer_qs_pending for more common.
>>>>
>>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
>>>> Reported-by: Tengda Wu <wutengda2@...wei.com>
>>>> Signed-off-by: Yao Kai <yaokai34@...wei.com>
>>>> Reviewed-by: Joel Fernandes <joelagnelf@...dia.com>
>>>> Signed-off-by: Joel Fernandes <joelagnelf@...dia.com>
>>>
>>> Looks good but, BTW, what happens if rcu_qs() is called
>>> before rcu_preempt_deferred_qs() had a chance to be called?
>>
>> Could you provide an example of when that can happen?
>>
>> As far as I can see, even if that were to happen, which I think you are right it can happen, we will still go through the path to report deferred quiescent states and cancel the pending work (reset the flag).
>>
>>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>>> through rcu_core() will be ignored as well.
>>
>> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>>>
>>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>>> the next grace period. And if rcu_read_unlock_special() is called again
>>> during the next GP on unfortunate place needing deferred qs, the state machine
>>> will spuriously assume that either rcu_core or the irq_work are pending, when
>>> none are anymore.
>>>
>>> The state should be reset by rcu_qs().
>>
>> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
>>
>> Maybe we should reset it from rcu_report_qs_rdp/rnp?
>>
>> Unfortunately, all of this is coming from me being on a phone and not at a computer, so I will revise my response, but probably tomorrow, because today the human body is not cooperating.
>>
>> thanks,
>>
>> - Joel
>>
>>
>>> current->rcu_read_unlock_special.b.need_qs is reset by rcu_qs()
>>> so subsequent calls to rcu_read_unlock() won't issue rcu_read_unlock_special()
>>> (unless the task is blocked). And further calls to rcu_preempt_deferred_qs()
>>> through rcu_core() will be ignored as well.
>>
>> I am not sure if this implies that deferred quiescent state gets cancelled because we have already called unlock once. We have to go through the deferred quiescent state path on all subsequent quiescent state reporting, even if need_qs reset. How else will the GP complete.
>>>
>>> But rdp->defer_qs_pending will remain in the DEFER_QS_PENDING state until
>>> the next grace period. And if rcu_read_unlock_special() is called again
>>> during the next GP on unfortunate place needing deferred qs, the state machine
>>> will spuriously assume that either rcu_core or the irq_work are pending, when
>>> none are anymore.
>>>
>>> The state should be reset by rcu_qs().
>>
>> In fact I would say if a deferred QS is pending, we should absolutely not reset its state from rcu_qs..
>>
>> Maybe we should reset it from rcu_report_qs_rdp/rnp?
>>
>> thanks,
>
>
> By the way, when I last tried to do it from rcu_qs, it was not fixing the original bug with the IRQ work recursion.
>
> I found that it was always resetting the flag. But probably it is not even the right place to do it in the first place.
I think we need to reset the flag in rcu_report_exp_rdp() as well if exp_hint is set and we reported exp qs.
I am working on a series to cover all cases and will send RFC soon. However this patch we are
reviewing can go in for this merge window and the rest I am preparing (for further improvement) for the next merge window, if that sounds good.
Thanks!
- Joel
>
> Thanks,
>
> - Joel
>
>
>
>
>
>
>
>
>
>
>>
>> - Joel
>>
>>
>>>
>>> Thanks.
>>>
>>> --
>>> Frederic Weisbecker
>>> SUSE Labs
>>>
Powered by blists - more mailing lists