linux-kernel - Re: [PATCH v2 7/9] sched: define TIF_ALLOW

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87il74qghh.fsf@oracle.com>
Date:   Wed, 18 Oct 2023 05:09:46 -0700
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     paulmck@...nel.org
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ankur Arora <ankur.a.arora@...cle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
        akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        willy@...radead.org, mgorman@...e.de, rostedt@...dmis.org,
        jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        jgross@...e.com, andrew.cooper3@...rix.com,
        Frederic Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED


Paul E. McKenney <paulmck@...nel.org> writes:

> On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote:
>> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
>> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
>> >> That said - I think as a proof of concept and "look, with this we get
>> >> the expected scheduling event counts", that patch is perfect. I think
>> >> you more than proved the concept.
>> >
>> > There is certainly quite some analyis work to do to make this a one to
>> > one replacement.
>> >
>> > With a handful of benchmarks the PoC (tweaked with some obvious fixes)
>> > is pretty much on par with the current mainline variants (NONE/FULL),
>> > but the memtier benchmark makes a massive dent.
>> >
>> > It sports a whopping 10% regression with the LAZY mode versus the mainline
>> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
>> >
>> > That benchmark is really sensitive to the preemption model. With current
>> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
>> > performance drop versus preempt=NONE.
>>
>> That 20% was a tired pilot error. The real number is in the 5% ballpark.
>>
>> > I have no clue what's going on there yet, but that shows that there is
>> > obviously quite some work ahead to get this sorted.
>>
>> It took some head scratching to figure that out. The initial fix broke
>> the handling of the hog issue, i.e. the problem that Ankur tried to
>> solve, but I hacked up a "solution" for that too.
>>
>> With that the memtier benchmark is roughly back to the mainline numbers,
>> but my throughput benchmark know how is pretty close to zero, so that
>> should be looked at by people who actually understand these things.
>>
>> Likewise the hog prevention is just at the PoC level and clearly beyond
>> my knowledge of scheduler details: It unconditionally forces a
>> reschedule when the looping task is not responding to a lazy reschedule
>> request before the next tick. IOW it forces a reschedule on the second
>> tick, which is obviously different from the cond_resched()/might_sleep()
>> behaviour.
>>
>> The changes vs. the original PoC aside of the bug and thinko fixes:
>>
>>     1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
>>        lazy preempt bit as the trace_entry::flags field is full already.
>>
>>        That obviously breaks the tracer ABI, but if we go there then
>>        this needs to be fixed. Steven?
>>
>>     2) debugfs file to validate that loops can be force preempted w/o
>>        cond_resched()
>>
>>        The usage is:
>>
>>        # taskset -c 1 bash
>>        # echo 1 > /sys/kernel/debug/sched/hog &
>>        # echo 1 > /sys/kernel/debug/sched/hog &
>>        # echo 1 > /sys/kernel/debug/sched/hog &
>>
>>        top shows ~33% CPU for each of the hogs and tracing confirms that
>>        the crude hack in the scheduler tick works:
>>
>>             bash-4559    [001] dlh2.  2253.331202: resched_curr <-__update_curr
>>             bash-4560    [001] dlh2.  2253.340199: resched_curr <-__update_curr
>>             bash-4561    [001] dlh2.  2253.346199: resched_curr <-__update_curr
>>             bash-4559    [001] dlh2.  2253.353199: resched_curr <-__update_curr
>>             bash-4561    [001] dlh2.  2253.358199: resched_curr <-__update_curr
>>             bash-4560    [001] dlh2.  2253.370202: resched_curr <-__update_curr
>>             bash-4559    [001] dlh2.  2253.378198: resched_curr <-__update_curr
>>             bash-4561    [001] dlh2.  2253.389199: resched_curr <-__update_curr
>>
>>        The 'l' instead of the usual 'N' reflects that the lazy resched
>>        bit is set. That makes __update_curr() invoke resched_curr()
>>        instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
>>        and folds it into preempt_count so that preemption happens at the
>>        next possible point, i.e. either in return from interrupt or at
>>        the next preempt_enable().
>
> Belatedly calling out some RCU issues.  Nothing fatal, just a
> (surprisingly) few adjustments that will need to be made.  The key thing
> to note is that from RCU's viewpoint, with this change, all kernels
> are preemptible, though rcu_read_lock() readers remain non-preemptible.

Yeah, in Thomas' patch CONFIG_PREEMPTION=y and preemption models
none/voluntary/full are just scheduler tweaks on top of that. And, so
this would always have PREEMPT_RCU=y. So, shouldn't rcu_read_lock()
readers be preemptible?

(An alternate configuration might be:
   config PREEMPT_NONE
      select PREEMPT_COUNT

    config PREEMPT_FULL
      select PREEMPTION

 This probably allows for more configuration flexibility across archs?
 Would allow for TREE_RCU=y, for instance. That said, so far I've only
 been working with PREEMPT_RCU=y.)

> With that:
>
> 1.	As an optimization, given that preempt_count() would always give
> 	good information, the scheduling-clock interrupt could sense RCU
> 	readers for new-age CONFIG_PREEMPT_NONE=y kernels.  As might the
> 	IPI handlers for expedited grace periods.  A nice optimization.
> 	Except that...
>
> 2.	The quiescent-state-forcing code currently relies on the presence
> 	of cond_resched() in CONFIG_PREEMPT_RCU=n kernels.  One fix
> 	would be to do resched_cpu() more quickly, but some workloads
> 	might not love the additional IPIs.  Another approach to do #1
> 	above to replace the quiescent states from cond_resched() with
> 	scheduler-tick-interrupt-sensed quiescent states.

Right, the call to rcu_all_qs(). Just to see if I have it straight,
something like this for PREEMPT_RCU=n kernels?

   if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0)
           rcu_all_qs();

(Masked because PREEMPT_NONE might not do any folding for
NEED_RESCHED_LAZY in the tick.)

Though the comment around rcu_all_qs() mentions that rcu_all_qs()
reports a quiescent state only if urgently needed. Given that the tick
executes less frequently than calls to cond_resched(), could we just
always report instead? Or I'm completely on the wrong track?

   if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) {
             preempt_disable();
             rcu_qs();
             preempt_enable();
   }

On your point about the preempt_count() being dependable, there's a
wrinkle. As Linus mentions in
https://lore.kernel.org/lkml/CAHk-=wgUimqtF7PqFfRw4Ju5H1KYkp6+8F=hBz7amGQ8GaGKkA@mail.gmail.com/,
that might not be true for architectures that define ARCH_NO_PREEMPT.

My plan was to limit those archs to do preemption only at user space boundary
but there are almost certainly RCU implications that I missed.

> 	Plus...
>
> 3.	For nohz_full CPUs that run for a long time in the kernel,
> 	there are no scheduling-clock interrupts.  RCU reaches for
> 	the resched_cpu() hammer a few jiffies into the grace period.
> 	And it sets the ->rcu_urgent_qs flag so that the holdout CPU's
> 	interrupt-entry code will re-enable its scheduling-clock interrupt
> 	upon receiving the resched_cpu() IPI.
>
> 	So nohz_full CPUs should be OK as far as RCU is concerned.
> 	Other subsystems might have other opinions.

Ah, that's what I thought from my reading of the RCU comments. Good to
have that confirmed. Thanks.

> 4.	As another optimization, kvfree_rcu() could unconditionally
> 	check preempt_count() to sense a clean environment suitable for
> 	memory allocation.

Had missed this completely. Could you elaborate?

> 5.	Kconfig files with "select TASKS_RCU if PREEMPTION" must
> 	instead say "select TASKS_RCU".  This means that the #else
> 	in include/linux/rcupdate.h that defines TASKS_RCU in terms of
> 	vanilla RCU must go.  There might be be some fallout if something
> 	fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y,
> 	and expects call_rcu_tasks(), synchronize_rcu_tasks(), or
> 	rcu_tasks_classic_qs() do do something useful.

Ack.

> 6.	You might think that RCU Tasks (as opposed to RCU Tasks Trace
> 	or RCU Tasks Rude) would need those pesky cond_resched() calls
> 	to stick around.  The reason is that RCU Tasks readers are ended
> 	only by voluntary context switches.  This means that although a
> 	preemptible infinite loop in the kernel won't inconvenience a
> 	real-time task (nor an non-real-time task for all that long),
> 	and won't delay grace periods for the other flavors of RCU,
> 	it would indefinitely delay an RCU Tasks grace period.
>
> 	However, RCU Tasks grace periods seem to be finite in preemptible
> 	kernels today, so they should remain finite in limited-preemptible
> 	kernels tomorrow.  Famous last words...
>
> 7.	RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice
> 	any algorithmic difference from this change.

So, essentially, as long as RCU tasks eventually, in the fullness of
time, call schedule(), removing cond_resched() shouldn't have any
effect :).

> 8.	As has been noted elsewhere, in this new limited-preemption
> 	mode of operation, rcu_read_lock() readers remain preemptible.
> 	This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain.

Ack.

> 9.	The rcu_preempt_depth() macro could do something useful in
> 	limited-preemption kernels.  Its current lack of ability in
> 	CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past.
>
> 10.	The cond_resched_rcu() function must remain because we still
> 	have non-preemptible rcu_read_lock() readers.

For configurations with PREEMPT_RCU=n? Yes, agreed. Though it need
only be this, right?:

   static inline void cond_resched_rcu(void)
   {
   #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
           rcu_read_unlock();

           rcu_read_lock();
   #endif
   }

> 11.	My guess is that the IPVS_EST_TICK_CHAINS heuristic remains
> 	unchanged, but I must defer to the include/net/ip_vs.h people.
>
> 12.	I need to check with the BPF folks on the BPF verifier's
> 	definition of BTF_ID(func, rcu_read_unlock_strict).
>
> 13.	The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner()
> 	function might have some redundancy across the board instead
> 	of just on CONFIG_PREEMPT_RCU=y.  Or might not.

I don't think I understand any of these well enough to comment. Will
Cc the relevant folks when I send out the RFC.

> 14.	The kernel/trace/trace_osnoise.c file's run_osnoise() function
> 	might need to do something for non-preemptible RCU to make
> 	up for the lack of cond_resched() calls.  Maybe just drop the
> 	"IS_ENABLED()" and execute the body of the current "if" statement
> 	unconditionally.

Aah, yes this is a good idea. Thanks.

> 15.	I must defer to others on the mm/pgtable-generic.c file's
> 	#ifdef that depends on CONFIG_PREEMPT_RCU.
>
> While in the area, I noted that KLP seems to depend on cond_resched(),
> but on this I must defer to the KLP people.

Yeah, as part of this work, I ended up unhooking most of the KLP
hooks in cond_resched() and of course, cond_resched() itself.
Will poke the livepatching people.

> I am sure that I am missing something, but I have not yet seen any
> show-stoppers.  Just some needed adjustments.

Appreciate this detailed list. Makes me think that everything might
not go up in smoke after all!

Thanks
Ankur

> Thoughts?
>
> 							Thanx, Paul
>
>> That's as much as I wanted to demonstrate and I'm not going to spend
>> more cycles on it as I have already too many other things on flight and
>> the resulting scheduler woes are clearly outside of my expertice.
>>
>> Though definitely I'm putting a permanent NAK in place for any attempts
>> to duct tape the preempt=NONE model any further by sprinkling more
>> cond*() and whatever warts around.
>>
>> Thanks,
>>
>>         tglx
>> ---
>>  arch/x86/Kconfig                   |    1
>>  arch/x86/include/asm/thread_info.h |    6 ++--
>>  drivers/acpi/processor_idle.c      |    2 -
>>  include/linux/entry-common.h       |    2 -
>>  include/linux/entry-kvm.h          |    2 -
>>  include/linux/sched.h              |   12 +++++---
>>  include/linux/sched/idle.h         |    8 ++---
>>  include/linux/thread_info.h        |   24 +++++++++++++++++
>>  include/linux/trace_events.h       |    8 ++---
>>  kernel/Kconfig.preempt             |   17 +++++++++++-
>>  kernel/entry/common.c              |    4 +-
>>  kernel/entry/kvm.c                 |    2 -
>>  kernel/sched/core.c                |   51 +++++++++++++++++++++++++------------
>>  kernel/sched/debug.c               |   19 +++++++++++++
>>  kernel/sched/fair.c                |   46 ++++++++++++++++++++++-----------
>>  kernel/sched/features.h            |    2 +
>>  kernel/sched/idle.c                |    3 --
>>  kernel/sched/sched.h               |    1
>>  kernel/trace/trace.c               |    2 +
>>  kernel/trace/trace_output.c        |   16 ++++++++++-
>>  20 files changed, 171 insertions(+), 57 deletions(-)
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct
>>
>>  #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
>>  /*
>> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
>> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
>>   * this avoids any races wrt polling state changes and thereby avoids
>>   * spurious IPIs.
>>   */
>> -static inline bool set_nr_and_not_polling(struct task_struct *p)
>> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
>>  {
>>  	struct thread_info *ti = task_thread_info(p);
>> -	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
>> +
>> +	return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG);
>>  }
>>
>>  /*
>> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas
>>  	for (;;) {
>>  		if (!(val & _TIF_POLLING_NRFLAG))
>>  			return false;
>> -		if (val & _TIF_NEED_RESCHED)
>> +		if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>>  			return true;
>>  		if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED))
>>  			break;
>> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas
>>  }
>>
>>  #else
>> -static inline bool set_nr_and_not_polling(struct task_struct *p)
>> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit)
>>  {
>> -	set_tsk_need_resched(p);
>> +	set_tsk_thread_flag(p, tif_bit);
>>  	return true;
>>  }
>>
>> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head)
>>   * might also involve a cross-CPU call to trigger the scheduler on
>>   * the target CPU.
>>   */
>> -void resched_curr(struct rq *rq)
>> +static void __resched_curr(struct rq *rq, int lazy)
>>  {
>> +	int cpu, tif_bit = TIF_NEED_RESCHED + lazy;
>>  	struct task_struct *curr = rq->curr;
>> -	int cpu;
>>
>>  	lockdep_assert_rq_held(rq);
>>
>> -	if (test_tsk_need_resched(curr))
>> +	if (unlikely(test_tsk_thread_flag(curr, tif_bit)))
>>  		return;
>>
>>  	cpu = cpu_of(rq);
>>
>>  	if (cpu == smp_processor_id()) {
>> -		set_tsk_need_resched(curr);
>> -		set_preempt_need_resched();
>> +		set_tsk_thread_flag(curr, tif_bit);
>> +		if (!lazy)
>> +			set_preempt_need_resched();
>>  		return;
>>  	}
>>
>> -	if (set_nr_and_not_polling(curr))
>> -		smp_send_reschedule(cpu);
>> -	else
>> +	if (set_nr_and_not_polling(curr, tif_bit)) {
>> +		if (!lazy)
>> +			smp_send_reschedule(cpu);
>> +	} else {
>>  		trace_sched_wake_idle_without_ipi(cpu);
>> +	}
>> +}
>> +
>> +void resched_curr(struct rq *rq)
>> +{
>> +	__resched_curr(rq, 0);
>> +}
>> +
>> +void resched_curr_lazy(struct rq *rq)
>> +{
>> +	int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ?
>> +		TIF_NEED_RESCHED_LAZY_OFFSET : 0;
>> +
>> +	if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED)))
>> +		return;
>> +
>> +	__resched_curr(rq, lazy);
>>  }
>>
>>  void resched_cpu(int cpu)
>> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu)
>>  	if (cpu == smp_processor_id())
>>  		return;
>>
>> -	if (set_nr_and_not_polling(rq->idle))
>> +	if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED))
>>  		smp_send_reschedule(cpu);
>>  	else
>>  		trace_sched_wake_idle_without_ipi(cpu);
>> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init(
>>  		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
>>  		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
>>  	}									 \
>> -	EXPORT_SYMBOL_GPL(preempt_model_##mode)
>>
>>  PREEMPT_MODEL_ACCESSOR(none);
>>  PREEMPT_MODEL_ACCESSOR(voluntary);
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -59,6 +59,16 @@ enum syscall_work_bit {
>>
>>  #include <asm/thread_info.h>
>>
>> +#ifdef CONFIG_PREEMPT_AUTO
>> +# define TIF_NEED_RESCHED_LAZY		TIF_ARCH_RESCHED_LAZY
>> +# define _TIF_NEED_RESCHED_LAZY		_TIF_ARCH_RESCHED_LAZY
>> +# define TIF_NEED_RESCHED_LAZY_OFFSET	(TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
>> +#else
>> +# define TIF_NEED_RESCHED_LAZY		TIF_NEED_RESCHED
>> +# define _TIF_NEED_RESCHED_LAZY		_TIF_NEED_RESCHED
>> +# define TIF_NEED_RESCHED_LAZY_OFFSET	0
>> +#endif
>> +
>>  #ifdef __KERNEL__
>>
>>  #ifndef arch_set_restart_data
>> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res
>>  			     (unsigned long *)(&current_thread_info()->flags));
>>  }
>>
>> +static __always_inline bool tif_need_resched_lazy(void)
>> +{
>> +	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
>> +		arch_test_bit(TIF_NEED_RESCHED_LAZY,
>> +			      (unsigned long *)(&current_thread_info()->flags));
>> +}
>> +
>>  #else
>>
>>  static __always_inline bool tif_need_resched(void)
>> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res
>>  			(unsigned long *)(&current_thread_info()->flags));
>>  }
>>
>> +static __always_inline bool tif_need_resched_lazy(void)
>> +{
>> +	return IS_ENABLED(CONFIG_PREEMPT_AUTO) &&
>> +		test_bit(TIF_NEED_RESCHED_LAZY,
>> +			 (unsigned long *)(&current_thread_info()->flags));
>> +}
>> +
>>  #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>>
>>  #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
>> --- a/kernel/Kconfig.preempt
>> +++ b/kernel/Kconfig.preempt
>> @@ -11,6 +11,13 @@ config PREEMPT_BUILD
>>  	select PREEMPTION
>>  	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
>>
>> +config PREEMPT_BUILD_AUTO
>> +	bool
>> +	select PREEMPT_BUILD
>> +
>> +config HAVE_PREEMPT_AUTO
>> +	bool
>> +
>>  choice
>>  	prompt "Preemption Model"
>>  	default PREEMPT_NONE
>> @@ -67,9 +74,17 @@ config PREEMPT
>>  	  embedded system with latency requirements in the milliseconds
>>  	  range.
>>
>> +config PREEMPT_AUTO
>> +	bool "Automagic preemption mode with runtime tweaking support"
>> +	depends on HAVE_PREEMPT_AUTO
>> +	select PREEMPT_BUILD_AUTO
>> +	help
>> +	  Add some sensible blurb here
>> +
>>  config PREEMPT_RT
>>  	bool "Fully Preemptible Kernel (Real-Time)"
>>  	depends on EXPERT && ARCH_SUPPORTS_RT
>> +	select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
>>  	select PREEMPTION
>>  	help
>>  	  This option turns the kernel into a real-time kernel by replacing
>> @@ -95,7 +110,7 @@ config PREEMPTION
>>
>>  config PREEMPT_DYNAMIC
>>  	bool "Preemption behaviour defined on boot"
>> -	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
>> +	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
>>  	select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
>>  	select PREEMPT_BUILD
>>  	default y if HAVE_PREEMPT_DYNAMIC_CALL
>> --- a/include/linux/entry-common.h
>> +++ b/include/linux/entry-common.h
>> @@ -60,7 +60,7 @@
>>  #define EXIT_TO_USER_MODE_WORK						\
>>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
>>  	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
>> -	 ARCH_EXIT_TO_USER_MODE_WORK)
>> +	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
>>
>>  /**
>>   * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
>> --- a/include/linux/entry-kvm.h
>> +++ b/include/linux/entry-kvm.h
>> @@ -18,7 +18,7 @@
>>
>>  #define XFER_TO_GUEST_MODE_WORK						\
>>  	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
>> -	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
>> +	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
>>
>>  struct kvm_vcpu;
>>
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l
>>
>>  		local_irq_enable_exit_to_user(ti_work);
>>
>> -		if (ti_work & _TIF_NEED_RESCHED)
>> +		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>>  			schedule();
>>
>>  		if (ti_work & _TIF_UPROBE)
>> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void
>>  		rcu_irq_exit_check_preempt();
>>  		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>>  			WARN_ON_ONCE(!on_thread_stack());
>> -		if (need_resched())
>> +		if (test_tsk_need_resched(current))
>>  			preempt_schedule_irq();
>>  	}
>>  }
>> --- a/kernel/sched/features.h
>> +++ b/kernel/sched/features.h
>> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
>>  SCHED_FEAT(LATENCY_WARN, false)
>>
>>  SCHED_FEAT(HZ_BW, true)
>> +
>> +SCHED_FEAT(FORCE_NEED_RESCHED, false)
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void);
>>  extern void reweight_task(struct task_struct *p, int prio);
>>
>>  extern void resched_curr(struct rq *rq);
>> +extern void resched_curr_lazy(struct rq *rq);
>>  extern void resched_cpu(int cpu);
>>
>>  extern struct rt_bandwidth def_rt_bandwidth;
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla
>>  	update_ti_thread_flag(task_thread_info(tsk), flag, value);
>>  }
>>
>> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
>>  {
>>  	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
>>  }
>>
>> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
>>  {
>>  	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
>>  }
>>
>> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
>> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
>>  {
>>  	return test_ti_thread_flag(task_thread_info(tsk), flag);
>>  }
>> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched(
>>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
>>  {
>>  	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
>> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
>> +		clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY);
>>  }
>>
>> -static inline int test_tsk_need_resched(struct task_struct *tsk)
>> +static inline bool test_tsk_need_resched(struct task_struct *tsk)
>>  {
>>  	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
>>  }
>> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc
>>
>>  static __always_inline bool need_resched(void)
>>  {
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(tif_need_resched_lazy() || tif_need_resched());
>>  }
>>
>>  /*
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq
>>   * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>>   * this is probably good enough.
>>   */
>> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick)
>>  {
>> +	struct rq *rq = rq_of(cfs_rq);
>> +
>>  	if ((s64)(se->vruntime - se->deadline) < 0)
>>  		return;
>>
>> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r
>>  	/*
>>  	 * The task has consumed its request, reschedule.
>>  	 */
>> -	if (cfs_rq->nr_running > 1) {
>> -		resched_curr(rq_of(cfs_rq));
>> -		clear_buddies(cfs_rq, se);
>> +	if (cfs_rq->nr_running < 2)
>> +		return;
>> +
>> +	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) {
>> +		resched_curr(rq);
>> +	} else {
>> +		/* Did the task ignore the lazy reschedule request? */
>> +		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
>> +			resched_curr(rq);
>> +		else
>> +			resched_curr_lazy(rq);
>>  	}
>> +	clear_buddies(cfs_rq, se);
>>  }
>>
>>  #include "pelt.h"
>> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf
>>  /*
>>   * Update the current task's runtime statistics.
>>   */
>> -static void update_curr(struct cfs_rq *cfs_rq)
>> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick)
>>  {
>>  	struct sched_entity *curr = cfs_rq->curr;
>>  	u64 now = rq_clock_task(rq_of(cfs_rq));
>> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c
>>  	schedstat_add(cfs_rq->exec_clock, delta_exec);
>>
>>  	curr->vruntime += calc_delta_fair(delta_exec, curr);
>> -	update_deadline(cfs_rq, curr);
>> +	update_deadline(cfs_rq, curr, tick);
>>  	update_min_vruntime(cfs_rq);
>>
>>  	if (entity_is_task(curr)) {
>> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c
>>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
>>  }
>>
>> +static inline void update_curr(struct cfs_rq *cfs_rq)
>> +{
>> +	__update_curr(cfs_rq, false);
>> +}
>> +
>>  static void update_curr_fair(struct rq *rq)
>>  {
>>  	update_curr(cfs_rq_of(&rq->curr->se));
>> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>>  	/*
>>  	 * Update run-time statistics of the 'current'.
>>  	 */
>> -	update_curr(cfs_rq);
>> +	__update_curr(cfs_rq, true);
>>
>>  	/*
>>  	 * Ensure that runnable average is periodically updated.
>> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>>  	 * validating it and just reschedule.
>>  	 */
>>  	if (queued) {
>> -		resched_curr(rq_of(cfs_rq));
>> +		resched_curr_lazy(rq_of(cfs_rq));
>>  		return;
>>  	}
>>  	/*
>> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str
>>  	 * hierarchy can be throttled
>>  	 */
>>  	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
>> -		resched_curr(rq_of(cfs_rq));
>> +		resched_curr_lazy(rq_of(cfs_rq));
>>  }
>>
>>  static __always_inline
>> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>>
>>  	/* Determine whether we need to wake up potentially idle CPU: */
>>  	if (rq->curr == rq->idle && rq->cfs.nr_running)
>> -		resched_curr(rq);
>> +		resched_curr_lazy(rq);
>>  }
>>
>>  #ifdef CONFIG_SMP
>> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq
>>
>>  		if (delta < 0) {
>>  			if (task_current(rq, p))
>> -				resched_curr(rq);
>> +				resched_curr_lazy(rq);
>>  			return;
>>  		}
>>  		hrtick_start(rq, delta);
>> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct
>>  	 * prevents us from potentially nominating it as a false LAST_BUDDY
>>  	 * below.
>>  	 */
>> -	if (test_tsk_need_resched(curr))
>> +	if (need_resched())
>>  		return;
>>
>>  	/* Idle tasks are by definition preempted by non-idle tasks. */
>> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct
>>  	return;
>>
>>  preempt:
>> -	resched_curr(rq);
>> +	resched_curr_lazy(rq);
>>  }
>>
>>  #ifdef CONFIG_SMP
>> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct
>>  	 */
>>  	if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 &&
>>  	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>> -		resched_curr(rq);
>> +		resched_curr_lazy(rq);
>>  }
>>
>>  /*
>> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct
>>  	 */
>>  	if (task_current(rq, p)) {
>>  		if (p->prio > oldprio)
>> -			resched_curr(rq);
>> +			resched_curr_lazy(rq);
>>  	} else
>>  		check_preempt_curr(rq, p, 0);
>>  }
>> --- a/drivers/acpi/processor_idle.c
>> +++ b/drivers/acpi/processor_idle.c
>> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces
>>   */
>>  static void __cpuidle acpi_safe_halt(void)
>>  {
>> -	if (!tif_need_resched()) {
>> +	if (!need_resched()) {
>>  		raw_safe_halt();
>>  		raw_local_irq_disable();
>>  	}
>> --- a/include/linux/sched/idle.h
>> +++ b/include/linux/sched/idle.h
>> @@ -63,7 +63,7 @@ static __always_inline bool __must_check
>>  	 */
>>  	smp_mb__after_atomic();
>>
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(need_resched());
>>  }
>>
>>  static __always_inline bool __must_check current_clr_polling_and_test(void)
>> @@ -76,7 +76,7 @@ static __always_inline bool __must_check
>>  	 */
>>  	smp_mb__after_atomic();
>>
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(need_resched());
>>  }
>>
>>  #else
>> @@ -85,11 +85,11 @@ static inline void __current_clr_polling
>>
>>  static inline bool __must_check current_set_polling_and_test(void)
>>  {
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(need_resched());
>>  }
>>  static inline bool __must_check current_clr_polling_and_test(void)
>>  {
>> -	return unlikely(tif_need_resched());
>> +	return unlikely(need_resched());
>>  }
>>  #endif
>>
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p
>>  	ct_cpuidle_enter();
>>
>>  	raw_local_irq_enable();
>> -	while (!tif_need_resched() &&
>> -	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
>> +	while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired()))
>>  		cpu_relax();
>>  	raw_local_irq_disable();
>>
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>>
>>  	if (tif_need_resched())
>>  		trace_flags |= TRACE_FLAG_NEED_RESCHED;
>> +	if (tif_need_resched_lazy())
>> +		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
>>  	if (test_preempt_need_resched())
>>  		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
>>  	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -271,6 +271,7 @@ config X86
>>  	select HAVE_STATIC_CALL
>>  	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
>>  	select HAVE_PREEMPT_DYNAMIC_CALL
>> +	select HAVE_PREEMPT_AUTO
>>  	select HAVE_RSEQ
>>  	select HAVE_RUST			if X86_64
>>  	select HAVE_SYSCALL_TRACEPOINTS
>> --- a/arch/x86/include/asm/thread_info.h
>> +++ b/arch/x86/include/asm/thread_info.h
>> @@ -81,8 +81,9 @@ struct thread_info {
>>  #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
>>  #define TIF_SIGPENDING		2	/* signal pending */
>>  #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
>> -#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
>> -#define TIF_SSBD		5	/* Speculative store bypass disable */
>> +#define TIF_ARCH_RESCHED_LAZY	4	/* Lazy rescheduling */
>> +#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
>> +#define TIF_SSBD		6	/* Speculative store bypass disable */
>>  #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
>>  #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
>>  #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
>> @@ -104,6 +105,7 @@ struct thread_info {
>>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>> +#define _TIF_ARCH_RESCHED_LAZY	(1 << TIF_ARCH_RESCHED_LAZY)
>>  #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
>>  #define _TIF_SSBD		(1 << TIF_SSBD)
>>  #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
>> --- a/kernel/entry/kvm.c
>> +++ b/kernel/entry/kvm.c
>> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc
>>  			return -EINTR;
>>  		}
>>
>> -		if (ti_work & _TIF_NEED_RESCHED)
>> +		if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY))
>>  			schedule();
>>
>>  		if (ti_work & _TIF_NOTIFY_RESUME)
>> --- a/include/linux/trace_events.h
>> +++ b/include/linux/trace_events.h
>> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un
>>
>>  enum trace_flag_type {
>>  	TRACE_FLAG_IRQS_OFF		= 0x01,
>> -	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
>> -	TRACE_FLAG_NEED_RESCHED		= 0x04,
>> +	TRACE_FLAG_NEED_RESCHED		= 0x02,
>> +	TRACE_FLAG_NEED_RESCHED_LAZY	= 0x04,
>>  	TRACE_FLAG_HARDIRQ		= 0x08,
>>  	TRACE_FLAG_SOFTIRQ		= 0x10,
>>  	TRACE_FLAG_PREEMPT_RESCHED	= 0x20,
>> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c
>>
>>  static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
>>  {
>> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
>> +	return tracing_gen_ctx_irq_test(0);
>>  }
>>  static inline unsigned int tracing_gen_ctx(void)
>>  {
>> -	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
>> +	return tracing_gen_ctx_irq_test(0);
>>  }
>>  #endif
>>
>> --- a/kernel/trace/trace_output.c
>> +++ b/kernel/trace/trace_output.c
>> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq
>>  		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
>>  		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
>>  		bh_off ? 'b' :
>> -		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
>> +		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
>>  		'.';
>>
>> -	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
>> +	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
>>  				TRACE_FLAG_PREEMPT_RESCHED)) {
>> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
>> +		need_resched = 'B';
>> +		break;
>>  	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
>>  		need_resched = 'N';
>>  		break;
>> +	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
>> +		need_resched = 'L';
>> +		break;
>> +	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
>> +		need_resched = 'b';
>> +		break;
>>  	case TRACE_FLAG_NEED_RESCHED:
>>  		need_resched = 'n';
>>  		break;
>> +	case TRACE_FLAG_NEED_RESCHED_LAZY:
>> +		need_resched = 'l';
>> +		break;
>>  	case TRACE_FLAG_PREEMPT_RESCHED:
>>  		need_resched = 'p';
>>  		break;
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -333,6 +333,23 @@ static const struct file_operations sche
>>  	.release	= seq_release,
>>  };
>>
>> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf,
>> +			       size_t cnt, loff_t *ppos)
>> +{
>> +	unsigned long end = jiffies + 60 * HZ;
>> +
>> +	for (; time_before(jiffies, end) && !signal_pending(current);)
>> +		cpu_relax();
>> +
>> +	return cnt;
>> +}
>> +
>> +static const struct file_operations sched_hog_fops = {
>> +	.write		= sched_hog_write,
>> +	.open		= simple_open,
>> +	.llseek		= default_llseek,
>> +};
>> +
>>  static struct dentry *debugfs_sched;
>>
>>  static __init int sched_init_debug(void)
>> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void)
>>
>>  	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>>
>> +	debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops);
>> +
>>  	return 0;
>>  }
>>  late_initcall(sched_init_debug);
>>


--
ankur