linux-kernel - Re: [RFC PATCH 47/86] rcu: select PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <89797f8c-bd1e-4b6f-8efd-44cd0b633523@paulmck-laptop>
Date:   Thu, 7 Dec 2023 20:28:22 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ankur Arora <ankur.a.arora@...cle.com>,
        linux-kernel@...r.kernel.org, peterz@...radead.org,
        torvalds@...ux-foundation.org, linux-mm@...ck.org, x86@...nel.org,
        akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        willy@...radead.org, mgorman@...e.de, jon.grimm@....com,
        bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        jgross@...e.com, andrew.cooper3@...rix.com, mingo@...nel.org,
        bristot@...nel.org, mathieu.desnoyers@...icios.com,
        geert@...ux-m68k.org, glaubitz@...sik.fu-berlin.de,
        anton.ivanov@...bridgegreys.com, mattst88@...il.com,
        krypton@...ich-teichert.org, David.Laight@...lab.com,
        richard@....at, mjguzik@...il.com,
        Simon Horman <horms@...ge.net.au>,
        Julian Anastasov <ja@....bg>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>
Subject: Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT

On Thu, Dec 07, 2023 at 08:44:57AM -0500, Steven Rostedt wrote:
> On Wed, 6 Dec 2023 20:34:11 -0800
> "Paul E. McKenney" <paulmck@...nel.org> wrote:
> 
> > > > I like the concept, but those with mutex_lock() of rarely-held mutexes
> > > > in their fastpaths might have workloads that have a contrary opinion.  
> > > 
> > > I don't understand your above statement. Maybe I wasn't clear with my
> > > statement? The above is more about PREEMPT_FULL, as it currently will
> > > preempt immediately. My above comment is that we can have an option for
> > > PREEMPT_FULL where if the scheduler decided to preempt even in a fast path,
> > > it would at least hold off until there's no mutex held. Who cares if it's a
> > > fast path when a task needs to give up the CPU for another task? What I
> > > worry about is scheduling out while holding a mutex which increases the
> > > chance of that mutex being contended upon. Which does have drastic impact
> > > on performance.  
> > 
> > As I understand the current mutex_lock() code, the fastpaths leave no
> > scheduler-visible clue that a mutex is in fact held.  If there is no
> > such clue, it is quite likely that those fastpaths will need to do some
> > additional clue-leaving work, increasing their overhead.  And while it
> > is always possible that this overhead will be down in the noise, if it
> > was too far down in the noise there would be no need for those fastpaths.
> > 
> > So it is possible (but by no means certain) that some workloads will end
> > up caring.
> 
> OK, that makes more sense, and I do agree with that statement. It would
> need to do something like spin locks do with preempt disable, but I agree,
> this would need to be done in a way not to cause performance regressions.

Whew!  ;-)

> > > > > > Another is the aforementioned situations where removing the cond_resched()
> > > > > > increases latency.  Yes, capping the preemption latency is a wonderful
> > > > > > thing, and the people I chatted with are all for that, but it is only
> > > > > > natural that there would be a corresponding level of concern about the
> > > > > > cases where removing the cond_resched() calls increases latency.    
> > > > > 
> > > > > With the "capped preemption" I'm not sure that would still be the case.
> > > > > cond_resched() currently only preempts if NEED_RESCHED is set. That means
> > > > > the system had to already be in a situation that a schedule needs to
> > > > > happen. There's lots of places in the kernel that run for over a tick
> > > > > without any cond_resched(). The cond_resched() is usually added for
> > > > > locations that show tremendous latency (where either a watchdog triggered,
> > > > > or showed up in some analysis that had a latency that was much greater than
> > > > > a tick).    
> > > > 
> > > > For non-real-time workloads, the average case is important, not just the
> > > > worst case.  In the new lazily preemptible mode of thought, a preemption
> > > > by a non-real-time task will wait a tick.  Earlier, it would have waited
> > > > for the next cond_resched().  Which, in the average case, might have
> > > > arrived much sooner than one tick.  
> > > 
> > > Or much later. It's random. And what's nice about this model, we can add
> > > more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to
> > > say "this task needs to preempt immediately" and not just for RT tasks.
> > > 
> > > This allows the user to decide which task preempts more and which does not
> > > (defined by the scheduler), instead of some random cond_resched() that can
> > > also preempt a higher priority task that just finished its quota to run a
> > > low priority task causing latency for the higher priority task.
> > > 
> > > This is what I mean by "think differently".  
> > 
> > I did understand your meaning, and it is a source of some concern.  ;-)
> > 
> > When things become sufficiently stable, larger-scale tests will of course
> > be needed, not just different thought..
> 
> Fair enough.
> 
> > 
> > > > > The point is, if/when we switch to the new preemption model, we would need
> > > > > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be
> > > > > done to prevent regressions. But the reasons I see cond_resched() being
> > > > > added today, should no longer exist with this new model.    
> > > > 
> > > > This I agree with.  Also, with the new paradigm and new mode of thought
> > > > in place, it should be safe to drop any cond_resched() that is in a loop
> > > > that consumes more than a tick of CPU time per iteration.  
> > > 
> > > Why does that matter? Is the loop not important? Why stop it from finishing
> > > for some random task that may not be important, and cond_resched() has no
> > > idea if it is or not.  
> > 
> > Because if it takes more than a tick to reach the next cond_resched(),
> > lazy preemption is likely to preempt before that cond_resched() is
> > reached.  Which suggests that such a cond_resched() would not be all
> > that valuable in the new thought paradigm.  Give or take potential issues
> > with exactly where the preemption happens.
> 
> I'm just saying there's lots of places that the above happens, which is why
> we are still scattering cond_resched() all over the place.

And I agree that greatly reducing (if not eliminating) such scattering
is a great benefit of lazy preemption.

> > > > > > There might be others as well.  These are the possibilities that have
> > > > > > come up thus far.
> > > > > >     
> > > > > > > They all suck and keeping some of them is just counterproductive as
> > > > > > > again people will sprinkle them all over the place for the very wrong
> > > > > > > reasons.      
> > > > > > 
> > > > > > Yes, but do they suck enough and are they counterproductive enough to
> > > > > > be useful and necessary?  ;-)    
> > > > > 
> > > > > They are only useful and necessary because of the way we handle preemption
> > > > > today. With the new preemption model, they are all likely to be useless and
> > > > > unnecessary ;-)    
> > > > 
> > > > The "all likely" needs some demonstration.  I agree that a great many
> > > > of them would be useless and unnecessary.  Maybe even the vast majority.
> > > > But that is different than "all".  ;-)  
> > > 
> > > I'm betting it is "all" ;-) But I also agree that this "needs some
> > > demonstration". We are not there yet, and likely will not be until the
> > > second half of next year. So we have plenty of time to speak rhetorically
> > > to each other!  
> > 
> > You know, we usually find time to engage in rhetorical conversation.  ;-)
> > 
> > > > > The conflict is with the new paradigm (I love that word! It's so "buzzy").
> > > > > As I mentioned above, cond_resched() is usually added when a problem was
> > > > > seen. I really believe that those problems would never had been seen if
> > > > > the new paradigm had already been in place.    
> > > > 
> > > > Indeed, that sort of wording does quite the opposite of raising my
> > > > confidence levels.  ;-)  
> > > 
> > > Yes, I admit the "manager speak" isn't something to brag about here. But I
> > > really do like that word. It's just fun to say (and spell)! Paradigm,
> > > paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should
> > > be like gnu, and pronounce it when speaking about free software? Although,
> > > that makes the word sound worse. :-p  
> > 
> > Pair a' dime, pair a' quarter, pair a' fifty-cent pieces, whatever it takes!
> 
>  Pair a' two-bits : that's all it's worth
> 
> Or
> 
>  Pair a' two-cents : as it's my two cents that I'm giving.

I must confess that the occasional transliteration of paradigm to
pair-of-dimes has been a great sanity-preservation device over the
decades.  ;-)

> > > > You know, the ancient Romans would have had no problem dealing with the
> > > > dot-com boom, cryptocurrency, some of the shadier areas of artificial
> > > > intelligence and machine learning, and who knows what all else.  As the
> > > > Romans used to say, "Beware of geeks bearing grifts."
> > > >   
> > > > > > >   3) Looking at the initial problem Ankur was trying to solve there is
> > > > > > >      absolutely no acceptable solution to solve that unless you think
> > > > > > >      that the semantically invers 'allow_preempt()/disallow_preempt()'
> > > > > > >      is anywhere near acceptable.      
> > > > > > 
> > > > > > I am not arguing for allow_preempt()/disallow_preempt(), so for that
> > > > > > argument, you need to find someone else to argue with.  ;-)    
> > > > > 
> > > > > Anyway, there's still a long path before cond_resched() can be removed. It
> > > > > was a mistake by Ankur to add those removals this early (and he has
> > > > > acknowledged that mistake).    
> > > > 
> > > > OK, that I can live with.  But that seems to be a bit different of a
> > > > take than that of some earlier emails in this thread.  ;-)  
> > > 
> > > Well, we are also stating the final goal as well. I think there's some
> > > confusion to what's going to happen immediately and what's going to happen
> > > in the long run.  
> > 
> > If I didn't know better, I might suspect that in addition to the
> > confusion, there are a few differences of opinion.  ;-)
> 
> Confusion enhances differences of opinion.

That can happen, but then again confusion can also result in the
mere appearance of agreement.  ;-)

> > > > > First we need to get the new preemption modeled implemented. When it is, it
> > > > > can be just a config option at first. Then when that config option is set,
> > > > > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch
> > > > > between them at run time as they are just a way to tell the scheduler when
> > > > > to set NEED_RESCHED_LAZY vs NEED_RSECHED.    
> > > > 
> > > > Assuming CONFIG_PREEMPT_RCU=y, agreed.  With CONFIG_PREEMPT_RCU=n,
> > > > the runtime switching needs to be limited to NONE and VOLUNTARY.
> > > > Which is fine.  
> > > 
> > > But why? Because the run time switches of NONE and VOLUNTARY are no
> > > different than FULL.
> > > 
> > > Why I say that? Because:
> > > 
> > > For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out
> > > or NEED_RESCHED will be set (of course that one tick may be configurable).
> > > Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL.
> > > 
> > > Even if the user sets the mode to "NONE", after the above scenario (one tick
> > > after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently
> > > than PREEMPT_FULL.
> > > 
> > > So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only
> > > NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE
> > > and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set.  
> > 
> > Because PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
> > combination.  All of the gains from PREEMPT_FULL=y are more than lost
> > due to PREEMPT_RCU=n, especially when the kernel decides to do something
> > like walk a long task list under RCU protection.  We should not waste
> > people's time getting burned by this combination, nor should we waste
> > cycles testing it.
> 
> The issue I see here is that PREEMPT_RCU is not something that we can
> convert at run time, where the NONE, VOLUNTARY, FULL (and more to come) can
> be. And you have stated that PREEMPT_RCU adds some more overhead that
> people may not care about. But even though you say PREEMPT_RCU=n makes no
> sense with PREEMPT_FULL, it doesn't mean we should not allow it. Especially
> if we have to make sure that it still works (even NONE and VOLUNTARY turn
> to FULL after that one-tick).
> 
> Remember, what we are looking at is having:
> 
> N : NEED_RESCHED - schedule at next possible location
> L : NEED_RESCHED_LAZY - schedule when going into user space.
> 
> When to set what for a task needing to schedule?
> 
>  Model           SCHED_OTHER         RT/DL(or user specified)
>  -----           -----------         ------------------------
>  NONE                 L                         L
>  VOLUNTARY            L                         N
>  FULL                 N                         N
> 
> By saying FULL, you are saying that you want the SCHED_OTHER as well as
> RT/DL tasks to schedule as soon as possible and not wait to going into user
> space. This is still applicable even with PREEMPT_RCU=n
> 
> It may be that someone wants better latency for all tasks (like VOLUNTARY)
> but not the overhead that PREEMPT_RCU gives, and is OK with the added
> latency as a result.

Given the additional testing burden and given the likelihood that it won't
do what people want, let's find someone who really needs it (as opposed
to someone who merely wants it) before allowing it to be selected.
It is after all an easy check far from any fastpath to prevent the
combination of PREEMPT_RCU and PREEMPT_FULL.

							Thanx, Paul