[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <16F11103-75C4-4DCE-8272-D619ECAA96BD@joelfernandes.org>
Date: Mon, 28 Aug 2023 17:51:09 -0400
From: Joel Fernandes <joel@...lfernandes.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
Z qiang <qiang.zhang1211@...il.com>,
linux-kernel@...r.kernel.org, rcu@...r.kernel.org
Subject: Re: [PATCH] rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle()
> On Aug 28, 2023, at 11:17 AM, Frederic Weisbecker <frederic@...nel.org> wrote:
>
> Le Sat, Aug 26, 2023 at 06:06:20AM -0700, Paul E. McKenney a écrit :
>> On Sat, Aug 26, 2023 at 02:13:39PM +0800, Z qiang wrote:
>>>>
>>>> On Fri, Aug 25, 2023 at 10:28:37AM +0800, Z qiang wrote:
>>>>>>
>>>>>>> On Thu, Aug 24, 2023 at 04:42:06PM +0800, Zqiang wrote:
>>>>>>>> Currently, the maxcpu is set by traversing online CPUs, however, if
>>>>>>>> the rcutorture.onoff_holdoff is set zero and onoff_interval is set
>>>>>>>> non-zero, and the some CPUs with larger cpuid has been offline before
>>>>>>>> setting maxcpu, for these CPUs, even if they are online again, also
>>>>>>>> cannot be offload or deoffload.
>>>>>>>>
>>>>>>>> This commit therefore use for_each_possible_cpu() instead of
>>>>>>>> for_each_online_cpu() in rcu_nocb_toggle().
>>>>>>>>
>>>>>>>> Signed-off-by: Zqiang <qiang.zhang1211@...il.com>
>>>>>>>> ---
>>>>>>>> kernel/rcu/rcutorture.c | 2 +-
>>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
>>>>>>>> index a58372bdf0c1..b75d0fe558ce 100644
>>>>>>>> --- a/kernel/rcu/rcutorture.c
>>>>>>>> +++ b/kernel/rcu/rcutorture.c
>>>>>>>> @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
>>>>>>>> VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
>>>>>>>> while (!rcu_inkernel_boot_has_ended())
>>>>>>>> schedule_timeout_interruptible(HZ / 10);
>>>>>>>> - for_each_online_cpu(cpu)
>>>>>>>> + for_each_possible_cpu(cpu)
>>>>>>>
>>>>>>> Last I checked, bad things could happen if the code attempted to
>>>>>>> nocb_toggle a CPU that had not yet come online. Has that changed?
>>>>>>
>>>>>> For example, there are 8 online CPUs in the system, before we traversing online
>>>>>> CPUs and set maxcpu, CPU7 has been offline, this causes us to miss nocb_toggle
>>>>>> for CPU7(maxcpu=6)
>>>>>>
>>>>>> Even though we still use for_each_online_cpu(), the things described
>>>>>> above also happen. before we toggle the CPU, this CPU has been offline.
>>>>>
>>>>> Suppose we have a system whose possible CPUs are 0, 1, 2, and 3. However,
>>>>> only 0 and 1 are present in this system, and until some manual action is
>>>>> taken, only 0 and 1 will ever be online. (Yes, this really can happen!)
>>>>> In that state, won't toggling CPU 2 and 3 result in failures?
>>>>>
>>>
>>> Agree.
>>> As long as we enabled rcutorture.onoff_interval, regardless of whether we use
>>> online CPUs or possible CPUs to set maxcpu, It is all possible to
>>> toggling the CPUs failure
>>> and print "NOCB: Cannot CB-offload offline CPU" log. but the failures
>>> due to CPU offline are acceptable.
>>>
>>> but at least the toggling operation on CPU7 will not be missed. when
>>> CPU7 comes online again.
>>>
>>> Would it be better to use for_each_present_cpu() ?
>>
>> The problem we face is that RCU and rcutorture have no reasonable way
>> of knowing when the boot-time CPU bringup has completed. If there was a
>> way of knowing that, then my approach would be to make rcutorture react
>> to a holdoff of zero by waiting for all the CPUs to come online.
>>
>> Failing that, for_each_present_cpu() with a holdoff of zero will likely
>> get us transient failures between the time rcutorture starts and the
>> last CPU has come online.
>>
>> Or is there now a way for in-kernel code know when boot-time CPU onlining
>> has completed?
>
> We don't need to wait for all CPUs to be online though. Toggling
> already handles well failures due to offline CPUs and since toggling
> happens concurrently with offlining in rcutorture, we already see lots
> of failures reported in the logs.
I think the issue is the loop later in the function does
not try to toggle cpus that came online too late.
So it does not test offloading on all CPUs just because max got updated too late.
One fix could be to periodically check in the loop if a new cpu at maxcpu + 1
ever got onlined. If it did, update the maxcpu.
Thanks.
>
> Thanks.
Powered by blists - more mailing lists