[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f6fc868-b420-bcf1-6b4d-1ca616aa6e4c@cn.fujitsu.com>
Date: Thu, 24 Jan 2019 15:00:37 +0800
From: Su Yue <suy.fnst@...fujitsu.com>
To: <paulmck@...ux.ibm.com>
CC: <linux-kernel@...r.kernel.org>, <josh@...htriplett.org>,
<rostedt@...dmis.org>, <mathieu.desnoyers@...icios.com>,
<jiangshanlai@...il.com>, "Li, Philip" <philip.li@...el.com>,
<lkp-developer@...ists.intel.com>
Subject: Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"
On 1/23/19 11:22 AM, Paul E. McKenney wrote:
> On Tue, Jan 22, 2019 at 04:42:19PM +0800, Su Yue wrote:
>> Thanks for your quick reply! Paul
>>
>> On 1/22/19 12:01 PM, Paul E. McKenney wrote:
>>> On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote:
>>>> Hi, guys
>>>> While running rcutorture tests with "onoff_interval", some tests
>>>> failed and results show like:
>>>>
>>>> =====================================================================
>>>> [ 316.354501] srcud-torture:--- End of test: RCU_HOTPLUG:
>>>> nreaders=1 nfakewriters=4 stat_interval=60 verbose=2
>>>> test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\
>>>> s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0
>>>> test_boost_interval=7 test_boost_duration=4 shutdown_secs=0
>>>> stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\
>>>> rrier_cbs=0 onoff_interval=3 onoff_holdoff=0
>>>> ====================================================================
>>>>
>>>> I am wondering that meaning of "RCU_HOTPLUG". Is it expected because
>>>> cpu hotplug is enabled in the test? Or just represents another type of
>>>> failure?
>>>
>>> This says that at least one CPU hotplug operation failed, that is,
>>> the CPU didn't actually come online or go offline as requested. If you
>>> are introducing CPU hotplug to an architecture, this usually indicates
>>> that you have bugs in your CPU-hotplug code. Or it nmight be that
>>
>> It should hit the case since there is no RCU CPU stall warnings.
>>
>>> RCU grace periods failed to progress -- though this would normally
>>> also result in RCU CPU stall warnings.
>>>
>>> There should be lines containing "ver:" in your console output. What
>>> does one of the later one of these say?
>>>
>>
>> The line says:
>> ======================================================================
>> [ 318.850175] busted_srcud-torture: rtc: (null) ver:
>> 27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke:
>> 0 rtbre: 0 rtbf: 0 rtb: 0 \
>> nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542
>> (HZ=1000) barrier: 0/0:0
>
> Yes, you have many more offline attempts than successes, which is
> why RCU_HOTPLUG was printed.
>
>> =====================================================================
>>
>> And here are useful errors:
>> =====================================================================
>> kern :info : [ 135.379693] KVM setup async PF for cpu 1
>> kern :info : [ 135.381412] kvm-stealtime: cpu 1, msr 23fd16180
>> kern :alert : [ 135.386897] busted_srcud-torture:torture_onoff
>
> Just so your know, busted_srcud can sometimes fail by design. Hence
> the "busted" in the name. But failure didn't happen this time.
>
Yes..The corner case I mentioned actually happened in every "onoff"
tests whatever the torture_type is.
>> task: onlined 1
>> kern :alert : [ 135.408241] busted_srcud-torture:torture_onoff
>> task: offlining 1
>> kern :info : [ 135.423310] Unregister pv shared memory for cpu 1
>> kern :info : [ 135.427940] smpboot: CPU 1 is now offline
>> kern :alert : [ 135.430106] busted_srcud-torture:torture_onoff
>> task: offlined 1
>> kern :alert : [ 135.436404] busted_srcud-torture:torture_onoff
>> task: offlining 0
>> kern :alert : [ 135.446173] busted_srcud-torture:torture_onoff
>> task: offline 0 failed: errno -16
>> kern :alert : [ 135.453076] busted_srcud-torture:torture_onoff
>> task: offlining 0
>> kern :alert : [ 135.457461] busted_srcud-torture:torture_onoff
>> task: offline 0 failed: errno -16
>>
>>
>> =====================================================================
>> There are only two CPUs on the VM. Torture try to offline the last one
>> but -EBUSY occured.
>>
>> I spent time to understand kernel/torture.c.
>> There is torture_onoff():
>>
>> 225 while (!torture_must_stop()) {
>> 226 cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
>> 227 if (!torture_offline(cpu,
>> 228 &n_offline_attempts,
>> &n_offline_successes,
>> 229 &sum_offline, &min_offline,
>> &max_offline))
>> 230 torture_online(cpu,
>> 231 &n_online_attempts,
>> &n_online_successes,
>> 232 &sum_online, &min_online,
>> &max_online);
>> 233 schedule_timeout_interruptible(onoff_interval);
>> 234 }
>> 235
>>
>> torture_offline() and torture_offline() don't pre judge if the current
>> cpu is only one usable.
>
> That does appear to be the case, and that would be a problem with
> the CONFIG_BOOTPARAM_HOTPLUG_CPU0 listed below.
>
> Good catch!
>
>> Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If
>> there are only one oneline and hotplugable cpux, then
>> n_offline_successes != n_offline_attempts which caused "End of test:
>> RCU_HOTPLUG".
>>
>> Does I misunderstand something above? Feel free to correct me.
>
> Does the following patch help?
>
Yes, no more "errnor: -16" in dmesg and "End of test: SUCCESS" is in
the end.
Thanks for your patch.
If the patch is to be sent in format, you can add:
Tested-By: Su Yue <suy.fnst@...fujitsu.com>
---
Su
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/torture.c b/kernel/torture.c
> index a03ff722352b..2b6700ca2a43 100644
> --- a/kernel/torture.c
> +++ b/kernel/torture.c
> @@ -101,6 +101,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
>
> if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
> return false;
> + if (num_online_cpus() <= 1)
> + return false; /* Can't offline the last CPU. */
>
> if (verbose > 1)
> pr_alert("%s" TORTURE_FLAG
>
>
>
Powered by blists - more mailing lists