linux-kernel - Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after extended grace periods

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTikbZUO6MbSC=uWYp4JRCdHPHrr5Z2_=to1YsC1X@mail.gmail.com>
Date:	Wed, 24 Nov 2010 18:38:45 +0100
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	paulmck@...ux.vnet.ibm.com
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Ingo Molnar <mingo@...e.hu>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [PATCH 1/2] rcu: Don't chase unnecessary quiescent states after
 extended grace periods

2010/11/24 Paul E. McKenney <paulmck@...ux.vnet.ibm.com>:
> On Wed, Nov 24, 2010 at 04:45:11PM +0100, Frederic Weisbecker wrote:
>> 2010/11/24 Paul E. McKenney <paulmck@...ux.vnet.ibm.com>:
>> > On Wed, Nov 24, 2010 at 02:48:46PM +0100, Frederic Weisbecker wrote:
>> CPU 1, the one that was idle :-D
>>
>> So CPU 1 rdp did catch up with node and state for its completed field.
>> But not its pgnum yet.
>
> OK, I will need to take a closer look at the rdp->gpnum setting.

Ok, do you want me to resend the patch with the changelog changed accordingly
to our discussion or?


>> >> >> > Now we call rcu_check_quiescent_state() -> check_for_new_grace_period()
>> >> >> > -> note_new_gpnum() and then we end up a requested quiescent state while
>> >> >> > every grace periods are completed.
>> >> >>
>> >> >> Sorry I should have described that in the changelogs but my ideas
>> >> >> weren't as clear as they
>> >> >> are now (at least I feel they are, doesn't mean they actually are ;)
>> >> >> Chasing these RCU bugs for too much hours has toasted my brain..
>> >> >
>> >> > Welcome to my world!!!  But keep in mind that an extra timer tick
>> >> > or two is much preferable to a potential hang!  And you only get
>> >> > the extra timer tick if there was some other reason that the
>> >> > CPU came out of nohz mode, correct?
>> >>
>> >> Yeah, either because of a timer, hrtimer, or a reschedule.
>> >> But you still generate a spurious softirq in this scheme.
>> >
>> > Eliminating spurious softirqs is a good thing, but let's review the
>> > overall priorities (probably missing a few, but this should give you
>> > an overall view):
>> >
>> > 1.      No too-short grace periods!!!
>>
>> Oh, I did not know that. Is it to avoid too much
>> short series of callbacks execution or?
>
> To avoid breaking RCU's fundamental guarantee.  ;-)
>
> It is easy to create bugs where a CPU thinks that it needs to respond
> to the grace period based on a prior quiescent state, but unknown to
> it that grace period has finished and a new one has started.  This
> CPU might then wrongly report a quiescent state against the new grace
> period.  Needless to say, this is very very very very very very bad.

Aaah ok. I thought you were talking about the time of a correct grace period,
not a bug that would make it seem shorter than it actually is.

Ok. Of course, priority number 1.


>> > 7.      Let CPUs that can do so go to sleep immediately (as opposed to
>> >        waiting a few milliseconds).
>> >
>> > 8.      Avoid spurious softirqs.
>>
>> Avoiding spurious softirqs bring us back to 7, even though it's not a matter
>> of ms but rather us.
>
> Microseconds should not be a problem as long as they are rare.  Milliseconds
> are a problem for the battery-powered folks as well as for people who care
> about OS jitter.

Right.


>> >> Two in fact: one because of the rnp->completed != rsp->completed condition
>> >> in rcu_pending(), another one because when we update the pgnum, we
>> >> always start chasing QS, regardless of the last GP beeing completed or not.
>> >
>> > OK -- is this an example of #8 above, or is it really #7?  I am absolutely
>> > not worried about a pair of back-to-back softirqs, and I don't believe
>> > that you should be, either.  ;-)
>>
>> This seems to be 8 and 7.
>>
>> A pair of back-to-back softirqs is no huge deal, but still time
>> consuming, spurious, etc...
>> if we can easily spott they are useless, why not get rid of them. At
>> least we can easily
>> and reliably avoid the second one from note_new_gpnum().
>
> As long as I can prove to myself that the patch that gets rid of them
> does not cause other problems.  ;-)

:)

>> >> > Which is why checking the rnp fields makes more sense to me, actually.
>> >> > Acquiring rnp->lock is much less painful than pinning down the rsp state.
>> >>
>> >> Right.
>> >>
>> >> Another thing, we  already have the (rnp->gpnum) != rdp->gpnu check in
>> >> rcu_pending(),
>> >> why also checking (rnp->completed) != rdp->completed) ?
>> >
>> > Because if (rnp->completed != rdp->completed), we might need to process
>> > some callbacks, either advancing them or invoking them.
>>
>> The rdp->nxttail in rcu is still an obscur part for me, so I just believe you :)
>
> It is just the place that the pending rcu_head callbacks are stored.

Yeah. I mean, I need to read how the code manages the different queues.
But __rcu_process_gp_end() seems to sum it up quite well.

>> > By the way, have you introduced a config variable for your HPC dynticks
>> > changes?
>>
>> I will, because I'll have to touch some fast path and I would prefer to do
>> that compile-conditional.
>>
>> But for now I don't care and experiment drafts :)
>
> Well, if any of the softirq/tick streamlining in RCU is risky, it needs
> to go under that same #ifdef!  ;-)

Sure, if something needs a reorg or a change even in the dyntick-hpc
off-case code,
I'll propose that to you, like I did for these two patches.

But I'll try to keep the invasive changes that are only for
dyntick-hpc only under that config.

>> I'm quite close to something that seems to work well BTW, those series of
>> softirq were my latest problem as it made rcu_pending() returning 1 for
>> a good while and I coudn't stop the tick. After the 2nd patch it should be
>> fine now, I hope.
>
> By "couldn't stop the tick" you mean "the tick went on a few times more
> than you like" or do you really mean "couldn't ever stop the tick"?
> My guess is the former.  If there were problems like the latter, Arjan
> would probably have beat me up about it long ago!

It's more like couldn't ever stop the tick. But that doesn't concern mainline.
This is because I have a hook that prevents the tick from beeing stopped
until rcu_pending() == 0.

In mainline it doesn't prevent the CPU from going nohz idle though, because
the softirq is armed from the tick. Once the softirq is processed, the CPU
can go to sleep. On the next timer tick it would again raise the softirq and
then could again go to sleep, etc..

I still have a trace of that, with my rcu_pending() hook, in
dyntick-hpc, that kept
returning 1 during at least 100 seconds and on each tick.
I did not go really further into this from my code as I immediately
switched to tip:master
to check if the problem came from my code or not.
And then I discovered that rcu_pending() indeed kept returning 1 for some while
in mainline (don't remember how much could be "some while" though), I
saw all these
spurious rcu softirq at each ticks caused by rcu_pending() and for
random time slices:
probably between a wake up from idle and the next grace period, if my
theory is right, and I
think that happened likely with bh flavour probably because it's
subject to less grace periods.

And this is what the second patch fixes in mainline and that also
seems to fix my issue in
dyntick-hpc.

Probably it happened more easily on dynctick-hpc as I was calling
rcu_pending() after
calling rcu_enter_nohz() (some buggy part of mine).

>> >  Longer term, __rcu_pending() for your HPC dynticks case
>> > could check for the current CPU having any callbacks before the call to
>> > check_cpu_stalls(), as in rcu_needs_cpu_quick_check().  That way, the
>> > CPUs with callbacks would drive the RCU core state machine.
>>
>> Ah, because currently every CPUs calling rcu_pending() can be pulled into
>> handling the state machine because of this check, right?
>>
>>         /* Has an RCU GP gone long enough to send resched IPIs &c? */
>>       if (rcu_gp_in_progress(rsp) &&
>>           ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) {
>>               rdp->n_rp_need_fqs++;
>>               return 1;
>>
>> So the goal would be to let this job to CPUs that have callbacks, right?
>
> In the dynticks-HPC case, yes.
>
>> > We don't
>> > necessarily want this in the common case because it can increase
>> > grace-period latency
>>
>> Because it makes less CPUs to handle the grace period state machine?
>
> Not quite.  Here I am not worried about the dyntick-HPC case, but rather
> the usual more-runnable-tasks-than-CPUs case.  Here, CPUs are context
> switching frequently, so if they report their own quiescent states (even
> when they don't happen to have any callbacks queued) the grace period will
> complete more quickly.  After all, force_quiescent_state() doesn't even
> get started until the third tick following the start of the grace period.


Ah, I see what you mean. So you would suggest to even ignore those
explicit QS report when in dynticj-hpc mode for CPUs that don't have callbacks?

Why not keeping them?


>> > but it could be very useful in the HPC-dyntick
>> > case -- eliminate any number of sources of spurious ticks with low
>> > risk to the high-priority RCU properties.  There are some corresponding
>> > changes required on the force_quiescent_state() path, but these are
>> > reasonably straightforward.
>>
>> Ok, I may have a try at it (if I completely understand what you suggest :-)
>> But first I'll try to get that dyntick mode with the current state, which seems
>> to be enough for the basic desired functionality.
>
> If the two changes you have posted thus far are all you need, this does
> sound like a good starting point.

For now yeah, it seems to work well. I may run into further surpises though :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/