[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D9C86E6-C4AC-42A6-AB51-5D90FD4FD95D@oracle.com>
Date: Thu, 7 Aug 2025 16:13:37 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"peterz@...radead.org" <peterz@...radead.org>,
"rostedt@...dmis.org"
<rostedt@...dmis.org>,
"mathieu.desnoyers@...icios.com"
<mathieu.desnoyers@...icios.com>,
"bigeasy@...utronix.de"
<bigeasy@...utronix.de>,
"kprateek.nayak@....com" <kprateek.nayak@....com>,
"vineethr@...ux.ibm.com" <vineethr@...ux.ibm.com>
Subject: Re: [PATCH V7 01/11] sched: Scheduler time slice extension
> On Aug 6, 2025, at 1:34 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
>
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>> @@ -304,7 +304,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
>> * exit_to_user_mode_loop - do any pending work before leaving to user space
>> */
>> unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> - unsigned long ti_work);
>> + unsigned long ti_work, bool irq);
>
> I know the kernel-doc already lacks the description for the existing
> arguments, but adding more undocumented ones is not the right thing
> either.
>
> Also please name this argument 'from_irq' to make it clear what this is
> about.
Ok, will change it.
>
>> /**
>> * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
>> @@ -316,7 +316,7 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> * EXIT_TO_USER_MODE_WORK are set
>> * 4) check that interrupts are still disabled
>> */
>> -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>> +static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs, bool irq)
>
> New argument not documented in kernel-doc.
Will add necessary documentation.
>
>> {
>> unsigned long ti_work;
>>
>> @@ -327,7 +327,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>>
>> ti_work = read_thread_flags();
>> if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>> - ti_work = exit_to_user_mode_loop(regs, ti_work);
>> + ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
>> +
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) && irq)
>> + rseq_delay_resched_arm_timer();
>
> This is still an unconditional function call which is a NOOP for
> everyone who does not use this. It's not that hard to inline the
> check. How often do I have to explain that?
Will fix.
>
>> arch_exit_to_user_mode_prepare(regs, ti_work);
>>
>> @@ -396,6 +399,9 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>>
>> CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>>
>> + /* Reschedule if scheduler time delay was granted */
>
> This is not rescheduling. It sets NEED_RESCHED, which is a completely
> different thing.
>
>> + rseq_delay_set_need_resched();
>
> I fundamentally hate this hack as it goes out to user space with
> NEED_RESCHED set and absolutely zero debug mechanism which validates
> it. Currently going out with NEED_RESCHED set is a plain bug, rigthfully
> so.
>
> But now this muck comes along and sets the flag, which is semantically
> just wrong and ill defined.
>
> The point is that NEED_RESCHED has been cleared by requesting and
> granting the extension, which means the task can go out to userspace,
> until it either relinquishes the CPU or hrtick() whacks it over the
> head.
>
> And your muck requires this insane hack with sched_yield():
>
>> SYSCALL_DEFINE0(sched_yield)
>> {
>> + /* Reschedule if scheduler time delay was granted */
>> + if (rseq_delay_set_need_resched())
>> + return 0;
>> +
>> do_sched_yield();
>> return 0;
>> }
>
> That's just completely wrong. Relinquishing the CPU should be possible
> by any arbitrary syscall and not require to make sched_yield() more
> ill-defined as it is already.
>
> The obvious way to solve both issues is to clear NEED_RESCHED when
> the delay is granted and then do in syscall_enter_from_user_mode_work()
>
> rseq_delay_sys_enter()
> {
> if (unlikely(current->rseq_delay_resched == GRANTED)) {
> set_tsk_need_resched(current);
> schedule();
> }
> }
>
> No?
>
> It's debatable whether the schedule() there is necessary. Removing it
> would allow the task to either complete the syscall and reschedule on
> exit to user space or go to sleep in the syscall. But that's a trivial
> detail.
>
> The important point is that the NEED_RESCHED semantics stay sane and the
> problem is solved right on the next syscall entry.
>
> This delay is not for extending CPU time accross syscalls, it's solely
> to allow user space to complete a _user space_ critical
> section. Everything else is just wrong and we don't implement it as an
> invitation for abuse.
>
> For the record: I used GRANTED on purpose, because REQUESTED is
> bogus. At the point where __rseq_delay_resched() is invoked _AND_
> observes the user space request, it grants the delay, no?
>
> This random nomenclature is just making this stuff annoyingly hard to
> follow.
>
Ok I can move the check to relinquish cpu in syscall_enter_from_user_mode_work()
instead of in syscall_exit_to_user_mode_work().
>> +static inline bool rseq_delay_resched(unsigned long ti_work)
>> +{
>> + if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + return false;
>> +
>> + if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
>> + return false;
>
> Why unlikely? The majority of applications do not use this.
WIll change to likely().
>
>> +
>> + if (!(ti_work & (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY)))
>> + return false;
>
> The caller already established that one of these flags is set, no?
That is right, will delete this check here.
>
>> + if (__rseq_delay_resched()) {
>> + clear_tsk_need_resched(current);
>
> Why has this to be inline and is not done in __rseq_delay_resched()?
Sure, it could be in __rseq_delay_resched().
>
>> + return true;
>> + }
>> + return false;
>
>> /**
>> * exit_to_user_mode_loop - do any pending work before leaving to user space
>> * @regs: Pointer to pt_regs on entry stack
>> * @ti_work: TIF work flags as read by the caller
>> */
>> __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> - unsigned long ti_work)
>> + unsigned long ti_work, bool irq)
>> {
>
> Same comments as above.
>
>> +
>> +void rseq_delay_resched_tick(void)
>> +{
>> + if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
>> + set_tsk_need_resched(current);
>
> Small enough to inline into hrtick() with a IS_ENABLED() guard, no?
I can move it to hrtick() and delete the rseq_delay_resched_tick() routine.
>
>> +}
>> +#endif /* CONFIG_RSEQ_RESCHED_DELAY */
>> +
>> #ifdef CONFIG_DEBUG_RSEQ
>>
>> /*
>> @@ -493,6 +527,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>> current->rseq = NULL;
>> current->rseq_sig = 0;
>> current->rseq_len = 0;
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + current->rseq_delay_resched = RSEQ_RESCHED_DELAY_NONE;
>
> What's that conditional for?
>
> t->rseq_delay_resched is unconditionally available. Your choice of
> optimizing the irrelevant places is amazing.
Will fix.
>
>> return 0;
>> }
>>
>> @@ -561,6 +597,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>> current->rseq = rseq;
>> current->rseq_len = rseq_len;
>> current->rseq_sig = sig;
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + current->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
>
> Why is this done unconditionally for rseq?
>
> So that any rseq user needs to do a function call and a copy_from_user()
> just for nothing?
>
> A task, which needs this muck, can very well opt-in for this and leave
> everybody else unaffected, no?
Sure, that seems reasonable.
>
> prctl() exists for a reason and that allows even filtering out the
> request to enable it if the sysadmin sets up filters accordingly.
>
> As code which wants to utilize this has to be modified anyway, adding
> the prctl() is not a unreasonable requirement.
>
>> clear_preempt_need_resched();
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
>> + prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
>> + prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
>
> Yet another code conditional for no reason. These are two bits and you can
> use them smart:
>
> #define ENABLED 1
> #define GRANTED 3
>
> So you can just go and do
>
> prev->rseq_delay_resched &= RSEQ_RESCHED_DELAY_ENABLED;
>
> which clears the GRANTED bit without a conditional and that's correct
> whether the ENABLED bit was set or not.
>
> In the syscall exit path you then do:
>
> static inline bool rseq_delay_resched(void)
> {
> if (prev->rseq_delay_resched != ENABLED)
> return false;
> return __rseq_delay_resched();
> }
>
> and __rseq_delay_resched() does:
>
> rseq_delay_resched = GRANTED;
>
> No?
That is a nice. I can add a prctl() call to enable & disable this functionality
which would help avoid unnecessary copy_from_user() calls.
In addition to registering the ‘rseq’ struct the application that needs the functionality
will have to make the prctl() call to enable it, which I think should be reasonable.
Thanks,
-Prakash.
>
> Thanks,
>
> tglx
Powered by blists - more mailing lists