[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87pm1c3wbn.ffs@tglx>
Date: Wed, 18 Oct 2023 15:16:12 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: paulmck@...nel.org
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>,
Ankur Arora <ankur.a.arora@...cle.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
willy@...radead.org, mgorman@...e.de, rostedt@...dmis.org,
jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
jgross@...e.com, andrew.cooper3@...rix.com,
Frederic Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Paul!
On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote:
> Belatedly calling out some RCU issues. Nothing fatal, just a
> (surprisingly) few adjustments that will need to be made. The key thing
> to note is that from RCU's viewpoint, with this change, all kernels
> are preemptible, though rcu_read_lock() readers remain
> non-preemptible.
Why? Either I'm confused or you or both of us :)
With this approach the kernel is by definition fully preemptible, which
means means rcu_read_lock() is preemptible too. That's pretty much the
same situation as with PREEMPT_DYNAMIC.
For throughput sake this fully preemptible kernel provides a mechanism
to delay preemption for SCHED_OTHER tasks, i.e. instead of setting
NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY.
That means the preemption points in preempt_enable() and return from
interrupt to kernel will not see NEED_RESCHED and the tasks can run to
completion either to the point where they call schedule() or when they
return to user space. That's pretty much what PREEMPT_NONE does today.
The difference to NONE/VOLUNTARY is that the explicit cond_resched()
points are not longer required because the scheduler can preempt the
long running task by setting NEED_RESCHED instead.
That preemption might be suboptimal in some cases compared to
cond_resched(), but from my initial experimentation that's not really an
issue.
> With that:
>
> 1. As an optimization, given that preempt_count() would always give
> good information, the scheduling-clock interrupt could sense RCU
> readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the
> IPI handlers for expedited grace periods. A nice optimization.
> Except that...
>
> 2. The quiescent-state-forcing code currently relies on the presence
> of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix
> would be to do resched_cpu() more quickly, but some workloads
> might not love the additional IPIs. Another approach to do #1
> above to replace the quiescent states from cond_resched() with
> scheduler-tick-interrupt-sensed quiescent states.
Right. The tick can see either the lazy resched bit "ignored" or some
magic "RCU needs a quiescent state" and force a reschedule.
> Plus...
>
> 3. For nohz_full CPUs that run for a long time in the kernel,
> there are no scheduling-clock interrupts. RCU reaches for
> the resched_cpu() hammer a few jiffies into the grace period.
> And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> interrupt-entry code will re-enable its scheduling-clock interrupt
> upon receiving the resched_cpu() IPI.
You can spare the IPI by setting NEED_RESCHED on the remote CPU which
will cause it to preempt.
> So nohz_full CPUs should be OK as far as RCU is concerned.
> Other subsystems might have other opinions.
>
> 4. As another optimization, kvfree_rcu() could unconditionally
> check preempt_count() to sense a clean environment suitable for
> memory allocation.
Correct. All the limitations of preempt count being useless are gone.
> 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must
> instead say "select TASKS_RCU". This means that the #else
> in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> vanilla RCU must go. There might be be some fallout if something
> fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> rcu_tasks_classic_qs() do do something useful.
In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob
remaining would be CONFIG_PREEMPT_RT, which should be renamed to
CONFIG_RT or such as it does not really change the preemption
model itself. RT just reduces the preemption disabled sections with the
lock conversions, forced interrupt threading and some more.
> 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace
> or RCU Tasks Rude) would need those pesky cond_resched() calls
> to stick around. The reason is that RCU Tasks readers are ended
> only by voluntary context switches. This means that although a
> preemptible infinite loop in the kernel won't inconvenience a
> real-time task (nor an non-real-time task for all that long),
> and won't delay grace periods for the other flavors of RCU,
> it would indefinitely delay an RCU Tasks grace period.
>
> However, RCU Tasks grace periods seem to be finite in preemptible
> kernels today, so they should remain finite in limited-preemptible
> kernels tomorrow. Famous last words...
That's an issue which you have today with preempt FULL, right? So if it
turns out to be a problem then it's not a problem of the new model.
> 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> any algorithmic difference from this change.
>
> 8. As has been noted elsewhere, in this new limited-preemption
> mode of operation, rcu_read_lock() readers remain preemptible.
> This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.
Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no?
> 9. The rcu_preempt_depth() macro could do something useful in
> limited-preemption kernels. Its current lack of ability in
> CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
Correct.
> 10. The cond_resched_rcu() function must remain because we still
> have non-preemptible rcu_read_lock() readers.
Where?
> 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> unchanged, but I must defer to the include/net/ip_vs.h people.
*blink*
> 12. I need to check with the BPF folks on the BPF verifier's
> definition of BTF_ID(func, rcu_read_unlock_strict).
>
> 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> function might have some redundancy across the board instead
> of just on CONFIG_PREEMPT_RCU=y. Or might not.
>
> 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function
> might need to do something for non-preemptible RCU to make
> up for the lack of cond_resched() calls. Maybe just drop the
> "IS_ENABLED()" and execute the body of the current "if" statement
> unconditionally.
Again. There is no non-preemtible RCU with this model, unless I'm
missing something important here.
> 15. I must defer to others on the mm/pgtable-generic.c file's
> #ifdef that depends on CONFIG_PREEMPT_RCU.
All those ifdefs should die :)
> While in the area, I noted that KLP seems to depend on cond_resched(),
> but on this I must defer to the KLP people.
Yeah, KLP needs some thoughts, but that's not rocket science to fix IMO.
> I am sure that I am missing something, but I have not yet seen any
> show-stoppers. Just some needed adjustments.
Right. If it works out as I think it can work out the main adjustments
are to remove a large amount of #ifdef maze and related gunk :)
Thanks,
tglx
Powered by blists - more mailing lists