linux-kernel - Re: [PATCH] tracing/osnoise: Force quiescent states while tracing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220301175621.GP4285@paulmck-ThinkPad-P17-Gen-1>
Date:   Tue, 1 Mar 2022 09:56:21 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Nicolas Saenz Julienne <nsaenzju@...hat.com>
Cc:     rostedt@...dmis.org, bristot@...nel.org, mingo@...hat.com,
        linux-kernel@...r.kernel.org, mtosatti@...hat.com
Subject: Re: [PATCH] tracing/osnoise: Force quiescent states while tracing

On Tue, Mar 01, 2022 at 11:00:08AM +0100, Nicolas Saenz Julienne wrote:
> On Mon, 2022-02-28 at 14:11 -0800, Paul E. McKenney wrote:
> > On Mon, Feb 28, 2022 at 03:14:23PM +0100, Nicolas Saenz Julienne wrote:
> > > At the moment running osnoise on an isolated CPU and a PREEMPT_RCU
> > > kernel might have the side effect of extending grace periods too much.
> > > This will eventually entice RCU to schedule a task on the isolated CPU
> > > to end the overly extended grace period, adding unwarranted noise to the
> > > CPU being traced in the process.

Ah, I misread the above paragraph.  Apologies!

Nevertheless, could you please add something explicit to the effect that
RCU is completing grace periods as required?

> > > So, check if we're the only ones running on this isolated CPU and that
> > > we're on a PREEMPT_RCU setup. If so, let's force quiescent states in
> > > between measurements.

And yes, if you don't want RCU to try to forcibly extract a quiescent
state from you, you must supply a quiescent state to RCU.  ;-)

> > > Non-PREEMPT_RCU setups don't need to worry about this as osnoise main
> > > loop's cond_resched() will go though a quiescent state for them.
> > > 
> > > Note that this same exact problem is what extended quiescent states were
> > > created for. But adapting them to this specific use-case isn't trivial
> > > as it'll imply reworking entry/exit and dynticks/context tracking code.
> > > 
> > > Signed-off-by: Nicolas Saenz Julienne <nsaenzju@...hat.com>
> > > ---
> > >  kernel/trace/trace_osnoise.c | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > > 
> > > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > > index 870a08da5b48..4928358f6e88 100644
> > > --- a/kernel/trace/trace_osnoise.c
> > > +++ b/kernel/trace/trace_osnoise.c
> > > @@ -21,7 +21,9 @@
> > >  #include <linux/uaccess.h>
> > >  #include <linux/cpumask.h>
> > >  #include <linux/delay.h>
> > > +#include <linux/tick.h>
> > >  #include <linux/sched/clock.h>
> > > +#include <linux/sched/isolation.h>
> > >  #include <uapi/linux/sched/types.h>
> > >  #include <linux/sched.h>
> > >  #include "trace.h"
> > > @@ -1295,6 +1297,7 @@ static int run_osnoise(void)
> > >  	struct osnoise_sample s;
> > >  	unsigned int threshold;
> > >  	u64 runtime, stop_in;
> > > +	unsigned long flags;
> > >  	u64 sum_noise = 0;
> > >  	int hw_count = 0;
> > >  	int ret = -1;
> > > @@ -1386,6 +1389,22 @@ static int run_osnoise(void)
> > >  					osnoise_stop_tracing();
> > >  		}
> > >  
> > > +		/*
> > > +		 * Check if we're the only ones running on this nohz_full CPU
> > > +		 * and that we're on a PREEMPT_RCU setup. If so, let's fake a
> > > +		 * QS since there is no way for RCU to know we're not making
> > > +		 * use of it.
> > > +		 *
> > > +		 * Otherwise it'll be done through cond_resched().
> > > +		 */
> > > +		if (IS_ENABLED(CONFIG_PREEMPT_RCU) &&
> > > +		    !housekeeping_cpu(raw_smp_processor_id(), HK_FLAG_MISC) &&
> > > +		    tick_nohz_tick_stopped()) {
> > > +			local_irq_save(flags);
> > > +			rcu_momentary_dyntick_idle();

And yes, rcu_momentary_dyntick_idle() is a good way to supply a quiescent
state to RCU.  This won't help Tasks RCU or Tasks Rude RCU, but you can
avoid those by avoiding changing tracing state while running osnoise.

> > > +			local_irq_restore(flags);
> > 
> > What is supposed to happen in this case is that RCU figures out that
> > there is a nohz_full CPU running for an extended period of time in the
> > kernel and takes matters into its own hands.  This goes as follows on
> > a HZ=1000 kernel with default RCU settings:
> > 
> > o	At about 20 milliseconds into the grace period, RCU makes
> > 	cond_resched() report quiescent states, among other things.
> > 	As you say, this does not help for CONFIG_PREEMPT=n kernels.
> > 
> > o	At about 30 milliseconds into the grace period, RCU forces an
> > 	explicit context switch on the wayward CPU.  This should get
> > 	the CPU's attention even in CONFIG_PREEMPT=y kernels.
> > 
> > So what is happening for you instead?
> 
> Well, that's exactly what I'm seeing, but it doesn't play well with osnoise.

Whew!!!  ;-)

> Here's a simplified view of what the tracer does:
> 
> 	time1 = get_time();
> 	while(1) {
> 		time2 = get_time();
> 		if (time2 - time1 > threshold)
> 			trace_noise();
> 		cond_resched();
> 		time1 = time2;
> 	}
> 
> This is pinned to a specific CPU, and in the most extreme cases is expected to
> take 100% of CPU time. Eventually, some SMI, NMI/interrupt, or process
> execution will trigger the threshold, and osnoise will provide some nice traces
> explaining what happened.
> 
> RCU forcing a context switch on the wayward CPU is introducing unwarranted
> noise as it's triggered by the fact we're measuring and wouldn't happen
> otherwise.
> 
> If this were user-space, we'd be in an EQS, which would make this problem go
> away. An option would be mimicking this behaviour (assuming irq entry/exit code
> did the right thing):
> 
> 	rcu_eqs_enter(); <--
> 	time1 = get_time();
> 	while(1) {
> 		time2 = get_time();
> 		if (time2 - time1 > threshold)
> 			trace_noise();
> 		rcu_eqs_exit(); <--
> 		cond_resched();
> 		rcu_eqs_enter(); <--
> 		time1 = time2;
> 	}
> 
> But given the tight loop this isn't much different than what I'm proposing at
> the moment, isn't it? rcu_momentary_dyntick_idle() just emulates a really fast
> EQS entry/exit.

And that is in fact exactly what rcu_momentary_dyntick_idle() was
intended for:

Acked-by: Paul E. McKenney <paulmck@...nel.org>