[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e7354668-2573-4564-834b-44d76d983222@nvidia.com>
Date: Thu, 5 Jun 2025 14:56:25 -0400
From: Joel Fernandes <joelagnelf@...dia.com>
To: paulmck@...nel.org, Xiongfeng Wang <wangxiongfeng2@...wei.com>
Cc: Joel Fernandes <joel@...lfernandes.org>, ankur.a.arora@...cle.com,
Frederic Weisbecker <frederic@...nel.org>, Boqun Feng
<boqun.feng@...il.com>, neeraj.upadhyay@...nel.org, urezki@...il.com,
rcu@...r.kernel.org, linux-kernel@...r.kernel.org, xiqi2@...wei.com,
"Wangshaobo (bobo)" <bobo.shaobowang@...wei.com>,
Xie XiuQi <xiexiuqi@...wei.com>
Subject: Re: [QUESTION] problems report: rcu_read_unlock_special() called in
irq_exit() causes dead loop
On 6/4/2025 8:26 AM, Paul E. McKenney wrote:
>>>>>>> Or just don't send subsequent self-IPIs if we just sent one for the
>>>>>>> rdp. Chances are, if we did not get the scheduler's attention during
>>>>>>> the first one, we may not in subsequent ones I think. Plus we do send
>>>>>>> other IPIs already if the grace period was over extended (from the FQS
>>>>>>> loop), maybe we can tweak that?
>>>>>> Thanks a lot for your reply. I think it's hard for me to fix this issue as
>>>>>> above without introducing new bugs. I barely understand the RCU code. But I'm
>>>>>> very glad to help test if you have any code modifiction need to. I have
>>>>>> the VM and the syskaller benchmark which can reproduce the problem.
>>>>> Sure, I understand. This is already incredibly valuable so thank you again.
>>>>> Will request for your testing help soon. I also have a test module now which
>>>>> can sort-off reproduce this. Keep you posted!
>>>>
>>>> Oh sorry I meant to ask - could you provide the full kernel log and also is
>>>> there a standalone reproducer syzcaller binary one can run to reproduce it in a VM?
>>
>> Sorry, I communicate with the teams who maintain the syzkaller tools. He said
>> I can't send the syskaller binary out of the company. Sorry, but I can help to
>> reproduce. It's not complicate and not time consuming.
>>
>> I found the origin log which use kernel v6.6. But it's not complete.
>> Then I reprouce the problem using the latest kernel.
>> Both logs are attached as attachments.
>>
> Looking at both the v6.6 version and Joel's fix, I am forced to conclude
> that this bug has been there for a very long time. Thank you for your
> testing efforts and Joel for the fix!
Thanks. I am still working on polishing the fix Xiongfeng tested. I hope to have
it out next week for review. As we discussed I will split the context-tracking
API into a separate patch and will also add a separate documentation
comment-patch on why we need the irq_work.
thanks,
- Joel
Powered by blists - more mailing lists