linux-kernel - Re: rcutorture: meaning of "End of test: RCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f6fc868-b420-bcf1-6b4d-1ca616aa6e4c@cn.fujitsu.com>
Date:   Thu, 24 Jan 2019 15:00:37 +0800
From:   Su Yue <suy.fnst@...fujitsu.com>
To:     <paulmck@...ux.ibm.com>
CC:     <linux-kernel@...r.kernel.org>, <josh@...htriplett.org>,
        <rostedt@...dmis.org>, <mathieu.desnoyers@...icios.com>,
        <jiangshanlai@...il.com>, "Li, Philip" <philip.li@...el.com>,
        <lkp-developer@...ists.intel.com>
Subject: Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"



On 1/23/19 11:22 AM, Paul E. McKenney wrote:
> On Tue, Jan 22, 2019 at 04:42:19PM +0800, Su Yue wrote:
>> Thanks for your quick reply! Paul
>>
>> On 1/22/19 12:01 PM, Paul E. McKenney wrote:
>>> On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote:
>>>> Hi, guys
>>>>    While running rcutorture tests with "onoff_interval", some tests
>>>> failed and results show like:
>>>>
>>>> =====================================================================
>>>> [  316.354501] srcud-torture:--- End of test: RCU_HOTPLUG:
>>>> nreaders=1 nfakewriters=4 stat_interval=60 verbose=2
>>>> test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\
>>>> s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0
>>>> test_boost_interval=7 test_boost_duration=4 shutdown_secs=0
>>>> stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\
>>>> rrier_cbs=0 onoff_interval=3 onoff_holdoff=0
>>>> ====================================================================
>>>>
>>>> I am wondering that meaning of "RCU_HOTPLUG". Is it expected because
>>>> cpu hotplug is enabled in the test? Or just represents another type of
>>>> failure?
>>>
>>> This says that at least one CPU hotplug operation failed, that is,
>>> the CPU didn't actually come online or go offline as requested.  If you
>>> are introducing CPU hotplug to an architecture, this usually indicates
>>> that you have bugs in your CPU-hotplug code.  Or it nmight be that
>>
>> It should hit the case since there is no RCU CPU stall warnings.
>>
>>> RCU grace periods failed to progress -- though this would normally
>>> also result in RCU CPU stall warnings.
>>>
>>> There should be lines containing "ver:" in your console output.  What
>>> does one of the later one of these say?
>>>
>>
>> The line says:
>> ======================================================================
>> [  318.850175] busted_srcud-torture: rtc:           (null) ver:
>> 27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke:
>> 0 rtbre: 0 rtbf: 0 rtb: 0 \
>> nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542
>> (HZ=1000) barrier: 0/0:0
> 
> Yes, you have many more offline attempts than successes, which is
> why RCU_HOTPLUG was printed.
> 
>> =====================================================================
>>
>> And here are useful errors:
>> =====================================================================
>> kern  :info  : [  135.379693] KVM setup async PF for cpu 1
>> kern  :info  : [  135.381412] kvm-stealtime: cpu 1, msr 23fd16180
>> kern  :alert : [  135.386897] busted_srcud-torture:torture_onoff
> 
> Just so your know, busted_srcud can sometimes fail by design.  Hence
> the "busted" in the name.  But failure didn't happen this time.
> 

Yes..The corner case I mentioned actually happened in every "onoff"
tests whatever the torture_type is.

>> task: onlined 1
>> kern  :alert : [  135.408241] busted_srcud-torture:torture_onoff
>> task: offlining 1
>> kern  :info  : [  135.423310] Unregister pv shared memory for cpu 1
>> kern  :info  : [  135.427940] smpboot: CPU 1 is now offline
>> kern  :alert : [  135.430106] busted_srcud-torture:torture_onoff
>> task: offlined 1
>> kern  :alert : [  135.436404] busted_srcud-torture:torture_onoff
>> task: offlining 0
>> kern  :alert : [  135.446173] busted_srcud-torture:torture_onoff
>> task: offline 0 failed: errno -16
>> kern  :alert : [  135.453076] busted_srcud-torture:torture_onoff
>> task: offlining 0
>> kern  :alert : [  135.457461] busted_srcud-torture:torture_onoff
>> task: offline 0 failed: errno -16
>>
>>
>> =====================================================================
>> There are only two CPUs on the VM. Torture try to offline the last one
>> but -EBUSY occured.
>>
>> I spent time to understand kernel/torture.c.
>> There is torture_onoff():
>>
>> 225        while (!torture_must_stop()) {
>> 226                cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
>> 227                if (!torture_offline(cpu,
>> 228                                     &n_offline_attempts,
>> &n_offline_successes,
>> 229                                     &sum_offline, &min_offline,
>> &max_offline))
>> 230                        torture_online(cpu,
>> 231                                       &n_online_attempts,
>> &n_online_successes,
>> 232                                       &sum_online, &min_online,
>> &max_online);
>> 233                schedule_timeout_interruptible(onoff_interval);
>> 234        }
>> 235
>>
>> torture_offline() and torture_offline() don't pre judge if the current
>> cpu is only one usable.
> 
> That does appear to be the case, and that would be a problem with
> the CONFIG_BOOTPARAM_HOTPLUG_CPU0 listed below.
> 
> Good catch!
> 
>> Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If
>> there are only one oneline and hotplugable cpux, then
>> n_offline_successes != n_offline_attempts which caused "End of test:
>> RCU_HOTPLUG".
>>
>> Does I misunderstand something above? Feel free to correct me.
> 
> Does the following patch help?
> 

Yes, no more "errnor: -16" in dmesg and "End of test: SUCCESS" is in
the end.

Thanks for your patch.
If the patch is to be sent in format, you can add:

Tested-By: Su Yue <suy.fnst@...fujitsu.com>


---
Su
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/torture.c b/kernel/torture.c
> index a03ff722352b..2b6700ca2a43 100644
> --- a/kernel/torture.c
> +++ b/kernel/torture.c
> @@ -101,6 +101,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
>   
>   	if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
>   		return false;
> +	if (num_online_cpus() <= 1)
> +		return false;  /* Can't offline the last CPU. */
>   
>   	if (verbose > 1)
>   		pr_alert("%s" TORTURE_FLAG
> 
> 
>