linux-kernel - Re: rcutorture: meaning of "End of test: RCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20190124205452.GU4240@linux.ibm.com>
Date:   Thu, 24 Jan 2019 12:54:52 -0800
From:   "Paul E. McKenney" <paulmck@...ux.ibm.com>
To:     Su Yue <suy.fnst@...fujitsu.com>
Cc:     linux-kernel@...r.kernel.org, josh@...htriplett.org,
        rostedt@...dmis.org, mathieu.desnoyers@...icios.com,
        jiangshanlai@...il.com, "Li, Philip" <philip.li@...el.com>,
        lkp-developer@...ists.intel.com
Subject: Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"

On Thu, Jan 24, 2019 at 03:00:37PM +0800, Su Yue wrote:
> On 1/23/19 11:22 AM, Paul E. McKenney wrote:
> >On Tue, Jan 22, 2019 at 04:42:19PM +0800, Su Yue wrote:
> >>Thanks for your quick reply! Paul
> >>
> >>On 1/22/19 12:01 PM, Paul E. McKenney wrote:
> >>>On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote:
> >>>>Hi, guys
> >>>>   While running rcutorture tests with "onoff_interval", some tests
> >>>>failed and results show like:
> >>>>
> >>>>=====================================================================
> >>>>[  316.354501] srcud-torture:--- End of test: RCU_HOTPLUG:
> >>>>nreaders=1 nfakewriters=4 stat_interval=60 verbose=2
> >>>>test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\
> >>>>s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0
> >>>>test_boost_interval=7 test_boost_duration=4 shutdown_secs=0
> >>>>stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\
> >>>>rrier_cbs=0 onoff_interval=3 onoff_holdoff=0
> >>>>====================================================================
> >>>>
> >>>>I am wondering that meaning of "RCU_HOTPLUG". Is it expected because
> >>>>cpu hotplug is enabled in the test? Or just represents another type of
> >>>>failure?
> >>>
> >>>This says that at least one CPU hotplug operation failed, that is,
> >>>the CPU didn't actually come online or go offline as requested.  If you
> >>>are introducing CPU hotplug to an architecture, this usually indicates
> >>>that you have bugs in your CPU-hotplug code.  Or it nmight be that
> >>
> >>It should hit the case since there is no RCU CPU stall warnings.
> >>
> >>>RCU grace periods failed to progress -- though this would normally
> >>>also result in RCU CPU stall warnings.
> >>>
> >>>There should be lines containing "ver:" in your console output.  What
> >>>does one of the later one of these say?
> >>>
> >>
> >>The line says:
> >>======================================================================
> >>[  318.850175] busted_srcud-torture: rtc:           (null) ver:
> >>27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke:
> >>0 rtbre: 0 rtbf: 0 rtb: 0 \
> >>nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542
> >>(HZ=1000) barrier: 0/0:0
> >
> >Yes, you have many more offline attempts than successes, which is
> >why RCU_HOTPLUG was printed.
> >
> >>=====================================================================
> >>
> >>And here are useful errors:
> >>=====================================================================
> >>kern  :info  : [  135.379693] KVM setup async PF for cpu 1
> >>kern  :info  : [  135.381412] kvm-stealtime: cpu 1, msr 23fd16180
> >>kern  :alert : [  135.386897] busted_srcud-torture:torture_onoff
> >
> >Just so your know, busted_srcud can sometimes fail by design.  Hence
> >the "busted" in the name.  But failure didn't happen this time.
> >
> 
> Yes..The corner case I mentioned actually happened in every "onoff"
> tests whatever the torture_type is.
> 
> >>task: onlined 1
> >>kern  :alert : [  135.408241] busted_srcud-torture:torture_onoff
> >>task: offlining 1
> >>kern  :info  : [  135.423310] Unregister pv shared memory for cpu 1
> >>kern  :info  : [  135.427940] smpboot: CPU 1 is now offline
> >>kern  :alert : [  135.430106] busted_srcud-torture:torture_onoff
> >>task: offlined 1
> >>kern  :alert : [  135.436404] busted_srcud-torture:torture_onoff
> >>task: offlining 0
> >>kern  :alert : [  135.446173] busted_srcud-torture:torture_onoff
> >>task: offline 0 failed: errno -16
> >>kern  :alert : [  135.453076] busted_srcud-torture:torture_onoff
> >>task: offlining 0
> >>kern  :alert : [  135.457461] busted_srcud-torture:torture_onoff
> >>task: offline 0 failed: errno -16
> >>
> >>
> >>=====================================================================
> >>There are only two CPUs on the VM. Torture try to offline the last one
> >>but -EBUSY occured.
> >>
> >>I spent time to understand kernel/torture.c.
> >>There is torture_onoff():
> >>
> >>225        while (!torture_must_stop()) {
> >>226                cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
> >>227                if (!torture_offline(cpu,
> >>228                                     &n_offline_attempts,
> >>&n_offline_successes,
> >>229                                     &sum_offline, &min_offline,
> >>&max_offline))
> >>230                        torture_online(cpu,
> >>231                                       &n_online_attempts,
> >>&n_online_successes,
> >>232                                       &sum_online, &min_online,
> >>&max_online);
> >>233                schedule_timeout_interruptible(onoff_interval);
> >>234        }
> >>235
> >>
> >>torture_offline() and torture_offline() don't pre judge if the current
> >>cpu is only one usable.
> >
> >That does appear to be the case, and that would be a problem with
> >the CONFIG_BOOTPARAM_HOTPLUG_CPU0 listed below.
> >
> >Good catch!
> >
> >>Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If
> >>there are only one oneline and hotplugable cpux, then
> >>n_offline_successes != n_offline_attempts which caused "End of test:
> >>RCU_HOTPLUG".
> >>
> >>Does I misunderstand something above? Feel free to correct me.
> >
> >Does the following patch help?
> >
> 
> Yes, no more "errnor: -16" in dmesg and "End of test: SUCCESS" is in
> the end.
> 
> Thanks for your patch.
> If the patch is to be sent in format, you can add:
> 
> Tested-By: Su Yue <suy.fnst@...fujitsu.com>

Very good, applied!  And especially thank you for finding this and
bringing it to my attention!!!

							Thanx, Paul

> ---
> Su
> >							Thanx, Paul
> >
> >------------------------------------------------------------------------
> >
> >diff --git a/kernel/torture.c b/kernel/torture.c
> >index a03ff722352b..2b6700ca2a43 100644
> >--- a/kernel/torture.c
> >+++ b/kernel/torture.c
> >@@ -101,6 +101,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> >  	if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
> >  		return false;
> >+	if (num_online_cpus() <= 1)
> >+		return false;  /* Can't offline the last CPU. */
> >  	if (verbose > 1)
> >  		pr_alert("%s" TORTURE_FLAG
> >
> >
> >
> 
>