linux-kernel - Re: RCU vs NOHZ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <809045ca-0adc-9ec2-af5b-03d68ff34c2b@joelfernandes.org>
Date:   Thu, 29 Sep 2022 12:05:45 -0400
From:   Joel Fernandes <joel@...lfernandes.org>
To:     paulmck@...nel.org, Peter Zijlstra <peterz@...radead.org>
Cc:     Frederic Weisbecker <fweisbec@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org, Boqun Feng <boqun.feng@...il.com>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>
Subject: Re: RCU vs NOHZ



On 9/29/2022 11:46 AM, Paul E. McKenney wrote:
> On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:
>> On Thu, Sep 29, 2022 at 12:55:58PM +0200, Peter Zijlstra wrote:
>>> On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wro
[..]
>>>> And this of course means that any additional schemes to reduce RCU's
>>>> power consumption must be compared (with real measurements on real
>>>> hardware!) to Joel et al.'s work, whether in combination or as an
>>>> alternative.  And either way, the power savings must of course justify
>>>> the added code and complexity.
>>>
>>> Well, Joel's lazy scheme has the difficulty that you can wreck things by
>>> improperly marking the callback as lazy when there's an explicit
>>> dependency on it. The talk even called that out.
>>>
>>> I was hoping to construct a scheme that doesn't need the whole lazy
>>> approach.

Peter, when constructing such scheme, please do consider that the power savings
needs to be comparable to power testing done with large jiffies_till_first_fqs
values. Otherwise, such solution is 'not that good' (IMO). In other words, the
ideal savings is one you get when not having to ask for RCU's services too soon
(rather than optimizing RCU itself).  Of course, the tick being turned off also
could/should be optimized for when you do need RCU's services.

> I agree that this is a risk that must be addressed.

Right, it is encouraging to see that we're making good progress on this. And
also Thomas mentioned in LPC that if call_rcu() users are expecting time-bounded
callback invocation, then _that_ needs to be fixed.

thanks,

 - Joel


> 
>>> To recap; we want the CPU to go into deeper idle states, no?
>>>
>>> RCU can currently inhibit this by having callbacks pending for this CPU
>>> -- in this case RCU inhibits NOHZ-IDLE and deep power states are not
>>> selected or less effective.
>>>
>>> Now, deep idle states actually purge the caches, so cache locality
>>> cannot be an argument to keep the callbacks local.
>>>
>>> We know when we're doing deep idle we stop the tick.
>>>
>>> So why not, when stopping the tick, move the RCU pending crud elsewhere
>>> and let the CPU get on with going idle instead of inhibiting the
>>> stopping of the tick and wrecking deep idle?
>>
>> Because doing so in the past has cost more energy than is saved.
> 
> And I should hasten to add that I have no intention of sending this
> commit upstream unless/until it is demonstrated to save real energy on
> real hardware.  In the meantime, please see below for an updated version
> that avoids indefinitely postponing the tick on systems having CPUs that
> enter and exit idle frequently.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit e30960e87d58db50bbe4fd09a2ff1e5eeeaad754
> Author: Paul E. McKenney <paulmck@...nel.org>
> Date:   Wed Sep 21 13:30:24 2022 -0700
> 
>     rcu: Let non-offloaded idle CPUs with callbacks defer tick
>     
>     When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
>     not RCU needs the scheduler-clock tick to keep interrupting.  Right now,
>     RCU keeps the tick on for a given idle CPU if there are any non-offloaded
>     callbacks queued on that CPU.
>     
>     But if all of these callbacks are waiting for a grace period to finish,
>     there is no point in scheduling a tick before that grace period has any
>     reasonable chance of completing.  This commit therefore delays the tick
>     in the case where all the callbacks are waiting for a specific grace
>     period to elapse.  In theory, this should result in a 50-70% reduction in
>     RCU-induced scheduling-clock ticks on mostly-idle CPUs.  In practice, TBD.
>     /bin/bash: fm: command not found
>     
>     Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
>     Cc: Peter Zijlstra <peterz@...radead.org>
> 
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 9bc025aa79a3..84e930c11065 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
>  		rcu_tasks_qs(current, (preempt)); \
>  	} while (0)
>  
> -static inline int rcu_needs_cpu(void)
> +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>  {
>  	return 0;
>  }
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 70795386b9ff..3066e0975022 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -19,7 +19,7 @@
>  
>  void rcu_softirq_qs(void);
>  void rcu_note_context_switch(bool preempt);
> -int rcu_needs_cpu(void);
> +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
>  void rcu_cpu_stall_reset(void);
>  
>  /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 5ec97e3f7468..1930cee1ccdb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -676,12 +676,40 @@ void __rcu_irq_enter_check_tick(void)
>   * scheduler-clock interrupt.
>   *
>   * Just check whether or not this CPU has non-offloaded RCU callbacks
> - * queued.
> + * queued that need immediate attention.
>   */
> -int rcu_needs_cpu(void)
> +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>  {
> -	return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
> -		!rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
> +	unsigned long j;
> +	unsigned long jlast;
> +	unsigned long jwait;
> +	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> +	struct rcu_segcblist *rsclp = &rdp->cblist;
> +
> +	// Disabled, empty, or offloaded means nothing to do.
> +	if (!rcu_segcblist_is_enabled(rsclp) ||
> +	    rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
> +		*nextevt = KTIME_MAX;
> +		return 0;
> +	}
> +
> +	// Callbacks ready to invoke or that have not already been
> +	// assigned a grace period need immediate attention.
> +	if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
> +	    !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
> +		return 1;
> +
> +	// There are callbacks waiting for some later grace period.
> +	// Wait for about a grace period or two since the last tick, at which
> +	// point there is high probability that this CPU will need to do some
> +	// work for RCU.
> +	j = jiffies;
> +	jlast = __this_cpu_read(rcu_data.last_sched_clock);
> +	jwait = READ_ONCE(jiffies_till_first_fqs) + READ_ONCE(jiffies_till_next_fqs) + 1;
> +	if (time_after(j, jlast + jwait))
> +		return 1;
> +	*nextevt = basemono + TICK_NSEC * (jlast + jwait - j);
> +	return 0;
>  }
>  
>  /*
> @@ -2324,11 +2352,9 @@ void rcu_sched_clock_irq(int user)
>  {
>  	unsigned long j;
>  
> -	if (IS_ENABLED(CONFIG_PROVE_RCU)) {
> -		j = jiffies;
> -		WARN_ON_ONCE(time_before(j, __this_cpu_read(rcu_data.last_sched_clock)));
> -		__this_cpu_write(rcu_data.last_sched_clock, j);
> -	}
> +	j = jiffies;
> +	WARN_ON_ONCE(time_before(j, __this_cpu_read(rcu_data.last_sched_clock)));
> +	__this_cpu_write(rcu_data.last_sched_clock, j);
>  	trace_rcu_utilization(TPS("Start scheduler-tick"));
>  	lockdep_assert_irqs_disabled();
>  	raw_cpu_inc(rcu_data.ticks_this_gp);
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index b0e3c9205946..303ea15cdb96 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -784,7 +784,7 @@ static inline bool local_timer_softirq_pending(void)
>  
>  static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>  {
> -	u64 basemono, next_tick, delta, expires;
> +	u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
>  	unsigned long basejiff;
>  	unsigned int seq;
>  
> @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>  	 * minimal delta which brings us back to this place
>  	 * immediately. Lather, rinse and repeat...
>  	 */
> -	if (rcu_needs_cpu() || arch_needs_cpu() ||
> +	if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
>  	    irq_work_needs_cpu() || local_timer_softirq_pending()) {
>  		next_tick = basemono + TICK_NSEC;
>  	} else {
> @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>  		 * disabled this also looks at the next expiring
>  		 * hrtimer.
>  		 */
> -		next_tick = get_next_timer_interrupt(basejiff, basemono);
> -		ts->next_timer = next_tick;
> +		next_tmr = get_next_timer_interrupt(basejiff, basemono);
> +		ts->next_timer = next_tmr;
> +		/* Take the next rcu event into account */
> +		next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
>  	}
>  
>  	/*