linux-kernel - Re: RCU related performance regression in 3.3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F9FA50C.9070600@wanadoo.fr>
Date:	Tue, 01 May 2012 10:55:40 +0200
From:	Pascal Chapperon <pascal.chapperon@...adoo.fr>
To:	paulmck@...ux.vnet.ibm.com
CC:	Josh Boyer <jwboyer@...hat.com>, linux-kernel@...r.kernel.org,
	kernel-team@...oraproject.org
Subject: Re: RCU related performance regression in 3.3

Le 01/05/2012 02:02, Paul E. McKenney a écrit :
> On Fri, Apr 27, 2012 at 08:42:58PM -0700, Paul E. McKenney wrote:
>> On Fri, Apr 27, 2012 at 02:15:20PM +0200, Pascal Chapperon wrote:
>>> Le 18/04/2012 17:23, Paul E. McKenney a écrit :
>>>> On Wed, Apr 18, 2012 at 05:00:14PM +0200, Pascal Chapperon wrote:
>>>>> Le 18/04/2012 16:01, Paul E. McKenney a écrit :
>>>>>> On Wed, Apr 18, 2012 at 11:37:28AM +0200, Pascal Chapperon wrote:
>>>>>>> Mount and umount operations are not slower with RCU_FAST_NO_HZ during
>>>>>>> runtime; systemctl start and stop operations are also not slower. In
>>>>>>> fact, i couldn't find a single operation slower during runtime with
>>>>>>> RCU_FAST_NO_HZ.
>>>>>>
>>>>>> Your boot-time setup is such that all CPUs are online before the
>>>>>> boot-time mount operations take place, right?
>>>>> Yes :
>>>>> [ 0.242697] Brought up 8 CPUs
>>>>> [ 0.242699] Total of 8 processors activated (35118.33 BogoMIPS).
>>>>>
>>>>>> Struggling to understand
>>>>>> how RCU can tell the difference between post-CPU-bringup boot time
>>>>>> and run time...
>>>>>>
>>>>> systemd is controlling the whole boot process including mount
>>>>> operation (apart root filesystem) and as I can see, uses heavily
>>>>> sockets to do it (not to mention cpu-affinity). It controls also the
>>>>> major part of umount operations. Is it possible that your patch hits
>>>>> a systemd bug ?
>>>>
>>>> Is it possible that systemd is using network operations that include
>>>> synchronize_rcu()? Then if you did the same operation from the
>>>> command line at runtime, you might not see the slowdown.
>>>>
>>>> Is it possible for you to convince systemd to collect RCU event tracing
>>>> during the slow operation? RCU event tracing is available under
>>>> /sys/kernel/debug/tracing/rcu.
>>>>
>>> .
>>> I have collected the RCU event tracing during a slow boot with
>>> FAST_NO_HZ (and the same without FAST_NO_HZ, same kernel config).
>>> The full logs and associated "systemd-analyze plot" can be found
>>> (in comment 32) at :
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=806548
>>>
>>> With FAST_NO_HZ, almost each rcu_prep_idle is followed by ksoftirqd
>>> (75000 ksoftirqd lines with FAST_NO_HZ, 4000 without).
>>>
>>> Sorry, the logs are very huge, but I can't figure where are the
>>> plots of some interest.
>>
>> Thank you for collecting them!  I clearly will need to do some scripting.  ;-)
>
> And it appears that your system is migrating timers without waking up
> the CPU on which the timer was posted.  This explains the slowdowns:
> RCU assumes that the timer will either fire on the CPU that it was posted
> on or that that CPU will be awakened when it goes offline.  If the timer
> does not fire on that CPU and that CPU is not otherwise awakened, then
> that CPU's RCU callbacks can be indefinitely postponed, which could account
> for the slowdowns that you were seeing.
>
> Please see below for a lightly tested patch that should address this
> problem, and thank you again for your patient testing efforts!
>
> 							Thanx, Paul
>
> ------------------------------------------------------------------------
>
> rcu: Make RCU_FAST_NO_HZ handle timer migration
>
> The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
> CPU goes offline, in which case it assumes that the CPU will have to come
> out of dyntick-idle mode (cancelling the timer) in order to go offline.
> This is important because when RCU_FAST_NO_HZ permits a CPU to enter
> dyntick-idle mode despite having RCU callbacks pending, it posts a timer
> on that CPU to force a wakeup on that CPU.  This wakeup ensures that the
> CPU will eventually handle the end of the grace period, including invoking
> its RCU callbacks.
>
> However, Pascal Chapperon's test setup shows that the timer handler
> rcu_idle_gp_timer_func() really does get invoked in some cases.  This is
> problematic because this can cause the CPU that entered dyntick-idle
> mode despite still having RCU callbacks pending to remain in
> dyntick-idle mode indefinitely, which means that its RCU callbacks might
> never be invoked.  This situation can result in grace-period delays or
> even system hangs, which matches Pascal's observations of slow boot-up
> and shutdown (https://lkml.org/lkml/2012/4/5/142).  See also the bugzilla:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=806548
>
> This commit therefore causes the "should never be invoked" timer handler
> rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
> the CPU for which the timer was intended, allowing that CPU to invoke
> its RCU callbacks in a timely manner.
>
> Reported-by: Pascal Chapperon<pascal.chapperon@...adoo.fr>
> Signed-off-by: Paul E. McKenney<paul.mckenney@...aro.org>
> Signed-off-by: Paul E. McKenney<paulmck@...ux.vnet.ibm.com>
> ---
>
>   include/trace/events/rcu.h |    1 +
>   kernel/rcutree_plugin.h    |   23 ++++++++++++++++++++---
>   2 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
> index aaa55e1..1480900 100644
> --- a/include/trace/events/rcu.h
> +++ b/include/trace/events/rcu.h
> @@ -292,6 +292,7 @@ TRACE_EVENT(rcu_dyntick,
>    *	"More callbacks": Still more callbacks, try again to clear them out.
>    *	"Callbacks drained": All callbacks processed, off to dyntick idle!
>    *	"Timer": Timer fired to cause CPU to continue processing callbacks.
> + *	"Demigrate": Timer fired on wrong CPU, woke up correct CPU.
>    *	"Cleanup after idle": Idle exited, timer canceled.
>    */
>   TRACE_EVENT(rcu_prep_idle,
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index dc12efc..bbd064a 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -1994,16 +1994,33 @@ static bool rcu_cpu_has_nonlazy_callbacks(int cpu)
>   }
>
>   /*
> + *
> + */
> +void rcu_idle_demigrate(void *unused)
> +{
> +	trace_rcu_prep_idle("Demigrate");
> +}
> +
> +/*
>    * Timer handler used to force CPU to start pushing its remaining RCU
>    * callbacks in the case where it entered dyntick-idle mode with callbacks
>    * pending.  The hander doesn't really need to do anything because the
>    * real work is done upon re-entry to idle, or by the next scheduling-clock
>    * interrupt should idle not be re-entered.
> + *
> + * One special case: the timer gets migrated without awakening the CPU
> + * on which the timer was scheduled on.  In this case, we must wake up
> + * that CPU.  We do so with smp_call_function_single().
>    */
> -static void rcu_idle_gp_timer_func(unsigned long unused)
> +static void rcu_idle_gp_timer_func(unsigned long cpu_in)
>   {
> -	WARN_ON_ONCE(1); /* Getting here can hang the system... */
> +	int cpu = (int)cpu_in;
> +
>   	trace_rcu_prep_idle("Timer");
> +	if (cpu != smp_processor_id())
> +		smp_call_function_single(cpu, rcu_idle_demigrate, NULL, 0);
> +	else
> +		WARN_ON_ONCE(1); /* Getting here can hang the system... */
>   }
>
>   /*
> @@ -2012,7 +2029,7 @@ static void rcu_idle_gp_timer_func(unsigned long unused)
>   static void rcu_prepare_for_idle_init(int cpu)
>   {
>   	setup_timer(&per_cpu(rcu_idle_gp_timer, cpu),
> -		    rcu_idle_gp_timer_func, 0);
> +		    rcu_idle_gp_timer_func, cpu);
>   }
>
>   /*
>
>
Paul, I can't apply your patch on top of master branch; perhaps I need 
to pull your own git repository ?
Among other things, you have:
static void rcu_idle_gp_timer_func(unsigned long unused)
and I have:
static enum hrtimer_restart rcu_idle_gp_timer_func(struct hrtimer *hrtp)

Pascal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/