linux-kernel - Re: RCU vs NOHZ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-Id: <9BE889A7-C196-47A5-9C10-9409C0FD3DE2@joelfernandes.org>
Date:   Fri, 23 Sep 2022 15:47:06 -0400
From:   Joel Fernandes <joel@...lfernandes.org>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Boqun Feng <boqun.feng@...il.com>,
        Frederic Weisbecker <fweisbec@...il.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: RCU vs NOHZ

Hi  Paul,

On Fri, Sep 23, 2022 at 2:15 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> 
> On Fri, Sep 23, 2022 at 01:47:40PM -0400, Joel Fernandes wrote:
>> On Fri, Sep 23, 2022 at 12:01 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>> [...]
>>>>> And here is an untested patch that in theory might allow much of the
>>>>> reduction in power with minimal complexity/overhead for kernels without
>>>>> rcu_nocbs CPUs.  On the off-chance you know of someone who would be
>>>>> willing to do a realistic evaluation of it.
>>>>>                                                    Thanx, Paul
>>>>> ------------------------------------------------------------------------
>>>>> commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
>>>>> Author: Paul E. McKenney <paulmck@...nel.org>
>>>>> Date:   Wed Sep 21 13:30:24 2022 -0700
>>>>>    rcu: Let non-offloaded idle CPUs with callbacks defer tick
>>>>>    When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
>>>>>    not RCU needs the scheduler-clock tick to keep interrupting.  Right now,
>>>>>    RCU keeps the tick on for a given idle CPU if there are any non-offloaded
>>>>>    callbacks queued on that CPU.
>>>>>    But if all of these callbacks are waiting for a grace period to finish,
>>>>>    there is no point in scheduling a tick before that grace period has any
>>>>>    reasonable chance of completing.  This commit therefore delays the tick
>>>>>    in the case where all the callbacks are waiting for a specific grace
>>>>>    period to elapse.  In theory, this should result in a 50-70% reduction in
>>>>>    RCU-induced scheduling-clock ticks on mostly-idle CPUs.  In practice, TBD.
>>>>>    Signed-off-by: Paul E. McKenney <paulmck@...nel.org>
>>>>>    Cc: Peter Zijlstra <peterz@...radead.org>
>>>>> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
>>>>> index 9bc025aa79a3..84e930c11065 100644
>>>>> --- a/include/linux/rcutiny.h
>>>>> +++ b/include/linux/rcutiny.h
>>>>> @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
>>>>>            rcu_tasks_qs(current, (preempt)); \
>>>>>    } while (0)
>>>>> -static inline int rcu_needs_cpu(void)
>>>>> +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>>>>> {
>>>>>    return 0;
>>>>> }
>>>>> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
>>>>> index 70795386b9ff..3066e0975022 100644
>>>>> --- a/include/linux/rcutree.h
>>>>> +++ b/include/linux/rcutree.h
>>>>> @@ -19,7 +19,7 @@
>>>>> void rcu_softirq_qs(void);
>>>>> void rcu_note_context_switch(bool preempt);
>>>>> -int rcu_needs_cpu(void);
>>>>> +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
>>>>> void rcu_cpu_stall_reset(void);
>>>>> /*
>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>> index 5ec97e3f7468..47cd3b0d2a07 100644
>>>>> --- a/kernel/rcu/tree.c
>>>>> +++ b/kernel/rcu/tree.c
>>>>> @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
>>>>>  * scheduler-clock interrupt.
>>>>>  *
>>>>>  * Just check whether or not this CPU has non-offloaded RCU callbacks
>>>>> - * queued.
>>>>> + * queued that need immediate attention.
>>>>>  */
>>>>> -int rcu_needs_cpu(void)
>>>>> +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>>>>> {
>>>>> -   return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
>>>>> -           !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
>>>>> +   struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>>>>> +   struct rcu_segcblist *rsclp = &rdp->cblist;
>>>>> +
>>>>> +   // Disabled, empty, or offloaded means nothing to do.
>>>>> +   if (!rcu_segcblist_is_enabled(rsclp) ||
>>>>> +       rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
>>>>> +           *nextevt = KTIME_MAX;
>>>>> +           return 0;
>>>>> +   }
>>>>> +
>>>>> +   // Callbacks ready to invoke or that have not already been
>>>>> +   // assigned a grace period need immediate attention.
>>>>> +   if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
>>>>> +       !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
>>>>> +           return 1;> +
>>>>> +   // There are callbacks waiting for some later grace period.
>>>>> +   // Wait for about a grace period or two for the next tick, at which
>>>>> +   // point there is high probability that this CPU will need to do some
>>>>> +   // work for RCU.
>>>>> +   *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > +                                           READ_ONCE(jiffies_till_next_fqs) + 1);
>>>> Looks like nice idea. Could this race with the main GP thread on another CPU
>>>> completing the grace period, then on this CPU there is actually some work to do
>>>> but rcu_needs_cpu() returns 0.
>>>> I think it is plausible but not common, in which case the extra delay is
>>>> probably Ok.
>>> Glad you like it!
>>> Yes, that race can happen, but it can also happen today.
>>> A scheduling-clock interrupt might arrive at a CPU just as a grace
>>> period finishes.  Yes, the delay is longer with this patch.  If this
>>> proves to be a problem, then the delay heuristic might expanded to
>>> include the age of the current grace period.
>>> But keeping it simple to start with.
>> Sure sounds good and yes I agree to the point of the existing issue
>> but the error is just 1 jiffie there as you pointed.
> 
> One jiffy currently, but it would typically be about seven jiffies with
> the patch

Yes exactly, that’s what I meant.

> .  Systems with smaller values of HZ would have fewer jiffies,
> and systems with more than 128 CPUs would have more jiffies.  Systems
> booted with explicit values for the rcutree.jiffies_till_first_fqs and
> rcutree.jiffies_till_next_fqs kernel boot parameters could have whatever
> the administrator wanted.  ;-)


Makes sense, thanks for clarifying.

> But the key point is that the grace period itself can be extended by
> that value just due to timing and distribution of idle CPUs.
> 
>>>> Also, if the RCU readers take a long time, then we'd still wake up the system
>>>> periodically although with the above change, much fewer times, which is a good
>>>> thing.
>>> And the delay heuristic could also be expanded to include a digitally
>>> filtered estimate of grace-period duration.  But again, keeping it simple
>>> to start with.  ;-)
>>> My guess is that offloading gets you more power savings, but I don't
>>> have a good way of testing this guess.
>> I could try to run turbostat on Monday on our Intel SoCs, and see how
>> it reacts, but I was thinking of tracing this first to see the
>> behavior. Another thing I was thinking of was updating (the future)
>> rcutop to see how many 'idle ticks' are RCU related, vs others; and
>> then see how this patch effects that.
> 
> Such testing would be very welcome, thank you!
> 
> This patch might also need to keep track of the last tick on a given
> CPU in order to prevent frequent short idle periods from indefinitely
> delaying the tick.

I know what you mean! I had the exact same issue with the lazy timer initially, now the behavior is any lazy enqueue which happened will piggy back onto the existing timer already running.

- Joel


> 
> 
>                                                        Thanx, Paul
> 
>> thanks,
>> - Joel
>>>>>    unsigned long basejiff;
>>>>>    unsigned int seq;
>>>>> @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>>>>>     * minimal delta which brings us back to this place
>>>>>     * immediately. Lather, rinse and repeat...
>>>>>     */
>>>>> -   if (rcu_needs_cpu() || arch_needs_cpu() ||
>>>>> +   if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
>>>>>        irq_work_needs_cpu() || local_timer_softirq_pending()) {
>>>>>            next_tick = basemono + TICK_NSEC;
>>>>>    } else {
>>>>> @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>>>>>             * disabled this also looks at the next expiring
>>>>>             * hrtimer.
>>>>>             */
>>>>> -           next_tick = get_next_timer_interrupt(basejiff, basemono);
>>>>> -           ts->next_timer = next_tick;
>>>>> +           next_tmr = get_next_timer_interrupt(basejiff, basemono);
>>>>> +           ts->next_timer = next_tmr;
>>>>> +           /* Take the next rcu event into account */
>>>>> +           next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
>>>>>    }
>>>>>    /*