linux-kernel - Re: [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+CzsjNyd9-QjUupszpULNkJ31U+wPWC81A5jaTFRFdPfMg@mail.gmail.com>
Date: Thu, 13 Nov 2025 20:00:21 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>, 
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, 
	Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers

Hi Prateek，

On Wed, 12 Nov 2025 at 14:40, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Wanpeng,
>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > +/*
> > + * High-frequency yield gating to reduce overhead on compute-intensive workloads.
> > + * Returns true if the yield should be skipped due to frequency limits.
> > + *
> > + * Optimized: single threshold with READ_ONCE/WRITE_ONCE, refresh timestamp on every call.
> > + */
> > +static bool yield_deboost_rate_limit(struct rq *rq, u64 now_ns)
> > +{
> > +     u64 last = READ_ONCE(rq->yield_deboost_last_time_ns);
> > +     bool limited = false;
> > +
> > +     if (last) {
> > +             u64 delta = now_ns - last;
> > +             limited = (delta <= 6000ULL * NSEC_PER_USEC);
> > +     }
> > +
> > +     WRITE_ONCE(rq->yield_deboost_last_time_ns, now_ns);
>
> We only look at local rq so READ_ONCE()/WRITE_ONCE() seems
> unnecessary.

You're right. Since we're under rq->lock and only accessing the local
rq's fields, READ_ONCE()/WRITE_ONCE() provide no benefit here. Will
simplify to direct access.

>
> > +     return limited;
> > +}
> > +
> > +/*
> > + * Validate tasks and basic parameters for yield deboost operation.
> > + * Performs comprehensive safety checks including feature enablement,
> > + * NULL pointer validation, task state verification, and same-rq requirement.
> > + * Returns false with appropriate debug logging if any validation fails,
> > + * ensuring only safe and meaningful yield operations proceed.
> > + */
> > +static bool __maybe_unused yield_deboost_validate_tasks(struct rq *rq, struct task_struct *p_target,
> > +                                       struct task_struct **p_yielding_out,
> > +                                       struct sched_entity **se_y_out,
> > +                                       struct sched_entity **se_t_out)
> > +{
> > +     struct task_struct *p_yielding;
> > +     struct sched_entity *se_y, *se_t;
> > +     u64 now_ns;
> > +
> > +     if (!sysctl_sched_vcpu_debooster_enabled)
> > +             return false;
> > +
> > +     if (!rq || !p_target)
> > +             return false;
> > +
> > +     now_ns = rq->clock;
>
> Brief look at Patch 5 suggests we are under the rq_lock so might
> as well use the rq_clock(rq) helper. Also, you have to do a
> update_rq_clock() since it isn't done until yield_task_fair().

Good catch. Since yield_to() holds rq_lock but doesn't call
update_rq_clock() before invoking yield_to_task(), I need to call
update_rq_clock(rq) at the start of yield_to_deboost() and use
rq_clock(rq) instead of direct rq->clock access. This ensures the
clock is current before rate limiting checks.

>
> > +
> > +     if (yield_deboost_rate_limit(rq, now_ns))
> > +             return false;
> > +
> > +     p_yielding = rq->curr;
> > +     if (!p_yielding || p_yielding == p_target ||
> > +         p_target->sched_class != &fair_sched_class ||
> > +         p_yielding->sched_class != &fair_sched_class)
> > +             return false;
>
> yield_to() in syscall.c has already checked for the sched
> class matching under double_rq_lock. That cannot change by the
> time we are here.

Correct. The sched_class checks are redundant since yield_to() already
validates curr->sched_class == p->sched_class under double_rq_lock(),
and sched_class cannot change while holding the lock. Will remove.

>
> > +
> > +     se_y = &p_yielding->se;
> > +     se_t = &p_target->se;
> > +
> > +     if (!se_t || !se_y || !se_t->on_rq || !se_y->on_rq)
> > +             return false;
> > +
> > +     if (task_rq(p_yielding) != rq || task_rq(p_target) != rq)
>
> yield_to() has already checked for this under double_rq_lock()
> so this too should be unnecessary.

Right. yield_to() already ensures both tasks are on their expected run
queues under double_rq_lock(), so the task_rq(p_yielding) != rq ||
task_rq(p_target) != rq check is redundant. Will remove.

>
> > +             return false;
> > +
> > +     *p_yielding_out = p_yielding;
> > +     *se_y_out = se_y;
> > +     *se_t_out = se_t;
>
> Why do we need these pointers? Can't the caller simply do:
>
>     if (!yield_deboost_validate_tasks(rq, target))
>         return;
>
>     p_yielding = rq->donor;
>     se_y_out = &p_yielding->se;
>     se_t = &target->se;

You're right, the output parameters are unnecessary. The caller can
derive them directly:
   p_yielding = rq->donor (accounting for proxy exec)
   se_y = &p_yielding->se
   se_t = &target->se
I'll simplify yield_deboost_validate_tasks() to just return bool and
let the caller obtain these pointers.

>
> That reminds me - now that we have proxy execution, you need
> to re-evaluate the usage of rq->curr (running context) vs
> rq->donor (vruntime context) when looking at all this.

Good catch. Since we're manipulating vruntime/deadline/vlag, I should
use rq->donor (scheduling context) instead of rq->curr (execution
context). In the yield_to() path, curr should equal donor (the
yielding task is running), but using donor makes the vruntime
semantics clearer and consistent with
update_curr_fair()/check_preempt_wakeup_fair().

Regards,
Wanpeng