linux-kernel - Re: [Bug #12650] Strange load average and ksoftirqd behavior with 2.6.29-rc2-git1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 17 Feb 2009 05:34:23 +0100
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Damien Wyart <damien.wyart@...e.fr>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Mike Galbraith <efault@....de>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Kernel Testers List <kernel-testers@...r.kernel.org>
Subject: Re: [Bug #12650] Strange load average and ksoftirqd behavior with
	2.6.29-rc2-git1

On Mon, Feb 16, 2009 at 02:39:44PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 16, 2009 at 09:09:23PM +0100, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paulmck@...ux.vnet.ibm.com> wrote:
> > 
> > > Here the calls to rcu_process_callbacks() are only 75 
> > > microseconds apart, so that this function is consuming more 
> > > than 10% of a CPU.  The strange thing is that I don't see a 
> > > raise_softirq() in between, though perhaps it gets inlined or 
> > > something that makes it invisible to ftrace.
> > 
> > look at the latest trace please, that has even the most inline 
> > raise-softirq method instrumented, so all the raising is 
> > visible.
> 
> Ah, my apologies!  This time looking at:
> 
> http://damien.wyart.free.fr/ksoftirqd_pb/trace_tip_2009.02.16_ksoftirqd_pb_abstime_proc.txt.gz
> 
> 
>   799.521187 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.521371 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.521555 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.521738 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.521934 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522068 |   1)  ksoftir-2324  |               |                rcu_check_callbacks() {
>   799.522208 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522392 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522575 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522759 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522956 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523074 |   1)  ksoftir-2324  |               |                  rcu_check_callbacks() {
>   799.523214 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523397 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523579 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523762 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523960 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.524079 |   1)  ksoftir-2324  |               |                  rcu_check_callbacks() {
>   799.524220 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.524403 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.524587 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.524770 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
> [ . . . ]
> 
> Yikes!!!
> 
> Why is rcu_check_callbacks() being invoked so often?  It should be called
> but once per jiffy, and here it is called no less than 22 times in about
> 3.5 milliseconds, meaning one call every 160 microseconds or so.
> 
> Hmmm...
> 
> Looks like we never return from:
> 
>   799.521142 |   1)    <idle>-0    |          | tick_nohz_stop_sched_tick() {
> 
> Perhaps we are taking an interrupt immediately after the
> local_irq_restore()?  And at 799.521209 deciding to exit nohz mode.
> And then deciding to go back into nohz mode at 799.521326, 117
> microseconds later, after which we re-invoke rcu_check_callbacks(),
> which again raises RCU's softirq.
> 
> And the reason we are invoking rcu_check_callbacks() so often appears
> to be in in arch/x86/kernel/process_32.c cpu_idle() near line 107,
> which explains my failure to reproduce on a 64-bit system:
> 
> 	void cpu_idle(void)
> 	{
> 		int cpu = smp_processor_id();
> 
> 		current_thread_info()->status |= TS_POLLING;
> 
> 		/* endless idle loop with no priority at all */
> 		while (1) {
> 			tick_nohz_stop_sched_tick(1);
> 			while (!need_resched()) {
> 
> 				check_pgt_cache();
> 				rmb();
> 
> 				if (rcu_pending(cpu))
> 					rcu_check_callbacks(cpu, 0);
> 
> 				if (cpu_is_offline(cpu))
> 					play_dead();
> 
> 				local_irq_disable();
> 				__get_cpu_var(irq_stat).idle_timestamp = jiffies;
> 				/* Don't trace irqs off for idle */
> 				stop_critical_timings();
> 				pm_idle();
> 				start_critical_timings();
> 			}
> 			tick_nohz_restart_sched_tick();
> 			preempt_enable_no_resched();
> 			schedule();
> 			preempt_disable();
> 		}
> 	}
> 
> If we go in and out of nohz mode quickly, we will invoke rcu_pending()
> each time.  I would expect rcu_pending() to return 0 most of the time,
> but that apparently isn't the case with treercu...
> 
> What is the easiest way for me to make it easy to trace the return path
> from __rcu_pending()?  Make each return path call an empty function
> located off where the compiler cannot see it, I guess...  Diagnostic
> patch along these lines below.  Frederic, Damien, could you please give
> it a go?  (And of course please let me know if something else is
> needed.)


No, you don't need that, you can use ftrace_printk, it will generate a C-comment like
inside the functions, ie:

__rcu_pending() {
	 /* pending_qs */
}

I've converted your below patch with ftrace_printks and tested it under an old P2
with rcu_tree and 1000 Hz. I made a trace during an idle state, and well, looks like I'm
lucky :-) 
I guess I successfully reproduced the softirq/rcu overhead.
Please find the below patch to trace the rcu_pending return path, as well as the trace I made.
Sorry, the trace is a bit buggy with sometimes flying orphans C like comments.
When I will have more time, I will fix that.

The trace is here http://dl.free.fr/uyWGgCbx4

It looks like it mostly returns 1 because of the waiting for quiescent state:

$ cat rcutrace | grep "/* pending_none" | wc -l
221
$ cat rcutrace | grep "/* pending_qs" | wc -l
248
$ cat rcutrace | grep "/* pending" | wc -l
469


diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b2fd602..c9e78f6 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -45,6 +45,7 @@
 #include <linux/cpu.h>
 #include <linux/mutex.h>
 #include <linux/time.h>
+#include <linux/ftrace.h>
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 static struct lock_class_key rcu_lock_key;
@@ -1249,31 +1250,44 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 	check_cpu_stall(rsp, rdp);
 
 	/* Is the RCU core waiting for a quiescent state from this CPU? */
-	if (rdp->qs_pending)
+	if (rdp->qs_pending) {
+		ftrace_printk("pending_qs\n");
 		return 1;
+	}
 
 	/* Does this CPU have callbacks ready to invoke? */
-	if (cpu_has_callbacks_ready_to_invoke(rdp))
+	if (cpu_has_callbacks_ready_to_invoke(rdp)) {
+		ftrace_printk("pending_ready_invoke\n");
 		return 1;
+	}
 
 	/* Has RCU gone idle with this CPU needing another grace period? */
-	if (cpu_needs_another_gp(rsp, rdp))
+	if (cpu_needs_another_gp(rsp, rdp)) {
+		ftrace_printk("pending_gp\n");
 		return 1;
+	}
 
 	/* Has another RCU grace period completed?  */
-	if (ACCESS_ONCE(rsp->completed) != rdp->completed) /* outside of lock */
+	if (ACCESS_ONCE(rsp->completed) != rdp->completed) {/* outside of lock */
+		ftrace_printk("pending_gp_completed\n");
 		return 1;
+	}
 
 	/* Has a new RCU grace period started? */
-	if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) /* outside of lock */
+	if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) { /* outside of lock */
+		ftrace_printk("pending_gp_new_started\n");
 		return 1;
+	}
 
 	/* Has an RCU GP gone long enough to send resched IPIs &c? */
 	if (ACCESS_ONCE(rsp->completed) != ACCESS_ONCE(rsp->gpnum) &&
 	    ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0 ||
-	     (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0))
+	     (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0)) {
+		ftrace_printk("pending_ipi\n");
 		return 1;
+	}
 
+	ftrace_printk("pending_none\n");
 	/* nothing to do */
 	return 0;
 }

 
> Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> ---
> 
>  rcupdate.c |   23 +++++++++++++++++++++++
>  rcutree.c  |   31 +++++++++++++++++++++++++------
>  2 files changed, 48 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> index d92a76a..42bbf03 100644
> --- a/kernel/rcupdate.c
> +++ b/kernel/rcupdate.c
> @@ -175,3 +175,26 @@ void __init rcu_init(void)
>  	__rcu_init();
>  }
>  
> +void __rcu_pending_qs_pending(void)
> +{
> +}
> +
> +void __rcu_pending_callbacks_ready(void)
> +{
> +}
> +
> +void __rcu_pending_needs_gp(void)
> +{
> +}
> +
> +void __rcu_pending_new_completed(void)
> +{
> +}
> +
> +void __rcu_pending_new_gp(void)
> +{
> +}
> +
> +void __rcu_pending_fqs(void)
> +{
> +}
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index b2fd602..e2d72c3 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1234,6 +1234,13 @@ void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
>  }
>  EXPORT_SYMBOL_GPL(call_rcu_bh);
>  
> +extern void __rcu_pending_qs_pending(void);
> +extern void __rcu_pending_callbacks_ready(void);
> +extern void __rcu_pending_needs_gp(void);
> +extern void __rcu_pending_new_completed(void);
> +extern void __rcu_pending_new_gp(void);
> +extern void __rcu_pending_fqs(void);
> +
>  /*
>   * Check to see if there is any immediate RCU-related work to be done
>   * by the current CPU, for the specified type of RCU, returning 1 if so.
> @@ -1249,30 +1256,42 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
>  	check_cpu_stall(rsp, rdp);
>  
>  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> -	if (rdp->qs_pending)
> +	if (rdp->qs_pending) {
> +		__rcu_pending_qs_pending();
>  		return 1;
> +	}
>  
>  	/* Does this CPU have callbacks ready to invoke? */
> -	if (cpu_has_callbacks_ready_to_invoke(rdp))
> +	if (cpu_has_callbacks_ready_to_invoke(rdp)) {
> +		__rcu_pending_callbacks_ready();
>  		return 1;
> +	}
>  
>  	/* Has RCU gone idle with this CPU needing another grace period? */
> -	if (cpu_needs_another_gp(rsp, rdp))
> +	if (cpu_needs_another_gp(rsp, rdp)) {
> +		__rcu_pending_needs_gp();
>  		return 1;
> +	}
>  
>  	/* Has another RCU grace period completed?  */
> -	if (ACCESS_ONCE(rsp->completed) != rdp->completed) /* outside of lock */
> +	if (ACCESS_ONCE(rsp->completed) != rdp->completed) /* outside of lock */ {
> +		__rcu_pending_new_completed();
>  		return 1;
> +	}
>  
>  	/* Has a new RCU grace period started? */
> -	if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) /* outside of lock */
> +	if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) /* outside of lock */ {
> +		__rcu_pending_new_gp();
>  		return 1;
> +	}
>  
>  	/* Has an RCU GP gone long enough to send resched IPIs &c? */
>  	if (ACCESS_ONCE(rsp->completed) != ACCESS_ONCE(rsp->gpnum) &&
>  	    ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0 ||
> -	     (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0))
> +	     (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0)) {
> +		__rcu_pending_fqs();
>  		return 1;
> +	}
>  
>  	/* nothing to do */
>  	return 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/