linux-kernel - Re: [PATCH 04/10] sched/fair: Add penalty calculation and application logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+CxZCkF-1Vy5bxAsFuAM99euUtQVmNwJzb+4SWRa47T4LQ@mail.gmail.com>
Date: Thu, 13 Nov 2025 21:25:21 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>, 
	Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, 
	Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 04/10] sched/fair: Add penalty calculation and application logic

Hi Prateek，

On Wed, 12 Nov 2025 at 15:25, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Wanpeng,
>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > +/*
> > + * Calculate penalty with debounce logic for EEVDF yield deboost.
> > + * Computes vruntime penalty based on fairness gap (need) plus granularity,
> > + * applies queue-size-based caps to prevent excessive penalties in small queues,
> > + * and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
> > + * Returns 0 if no penalty needed, otherwise returns clamped penalty value.
> > + */
> > +static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> > +                                 struct sched_entity *se_t_lca, struct sched_entity *se_t,
> > +                                 int nr_queued)
> > +{
> > +     u64 gran, need, penalty, maxp;
> > +     u64 gran_floor;
> > +     u64 weighted_need, base;
> > +
> > +     gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
> > +     /* Low-bound safeguard for gran when slice is abnormally small */
> > +     gran_floor = calc_delta_fair(sysctl_sched_base_slice >> 1, se_y_lca);
> > +     if (gran < gran_floor)
>
> Is this even possible?

No. Both use the same weight denominator in calc_delta_fair(), the
check is redundant. Will remove.

>
> > +             gran = gran_floor;
> > +
> > +     need = 0;
> > +     if (se_t_lca->vruntime > se_y_lca->vruntime)
> > +             need = se_t_lca->vruntime - se_y_lca->vruntime;
>
> So I'm assuming you want the yielding task's vruntime to
> cross the target's vruntime simply because one task somewhere
> down the hierarchy said so.

Yes, this is a known tradeoff. We apply the penalty at the LCA where A
and B compete to make B schedulable immediately. Side effect:
independent task C in CG0 loses CPU time. In practice, VMs place all
vCPUs in one cgroup (no independent C). If C exists and shares the
lock, the penalty helps. If C is truly independent, it loses ~one
scheduling slice. Your natural convergence approach avoids this but
needs multiple yield cycles before B gets sustained preference.

>
> > +
> > +     /* Apply 10% boost to need when positive (weighted_need = need * 1.10) */
> > +     penalty = gran;
>
> So at the very least I see it getting weighted(base_slice / 2) penalty
> ...
>
> > +     if (need) {
> > +             /* weighted_need = need + 10% */
> > +             weighted_need = need + need / 10;
> > +             /* clamp to avoid overflow when adding to gran (still capped later) */
> > +             if (weighted_need > U64_MAX - penalty)
> > +                     weighted_need = U64_MAX - penalty;
> > +             penalty += weighted_need;
>
> ... if not more ...

Yes, the floor is gran (weighted ~700µs). Empirically, smaller values
didn't sustain preference—the yielder would re-preempt the target
within 1-2 decisions in dbench testing. This is a workload-specific
heuristic. If too aggressive for general use, I can lower it or tie it
to h_nr_queued . Thoughts?

>
> > +     }
> > +
> > +     /* Apply debounce via helper to avoid ping-pong */
> > +     penalty = yield_deboost_apply_debounce(rq, se_t, penalty, need, gran);
>
> ... since without debounce, penalty remains same.
>
> > +
> > +     /* Upper bound (cap): slightly more aggressive for mid-size queues */
> > +     if (nr_queued == 2)
> > +             maxp = gran * 6;                /* Strongest push for 2-task ping-pong */
> > +     else if (nr_queued == 3)
> > +             maxp = gran * 4;                /* 4.0 * gran */
> > +     else if (nr_queued <= 6)
> > +             maxp = (gran * 5) / 2;          /* 2.5 * gran */
> > +     else if (nr_queued <= 8)
> > +             maxp = gran * 2;                /* 2.0 * gran */
> > +     else if (nr_queued <= 12)
> > +             maxp = (gran * 3) / 2;          /* 1.5 * gran */
> > +     else
> > +             maxp = gran;                    /* 1.0 * gran */
>
> And all the nr_queued calculations are based on the entities queued
> and not the "h_nr_queued" so we can have a boat load of tasks to
> run above but since one task decided to call yield_to() let us make
> them all starve a little?

You're absolutely right. Using nr_queued (entity count) instead of
h_nr_queued (hierarchical task count) is wrong:
CG0 (nr_queued=2, h_nr_queued=100)
  ├─ CG1 (50 tasks)
  └─ CG2 (50 tasks)
My code sees 2 entities and applies maxp = 6×gran (strongest penalty),
but 100 tasks are competing. This starves unrelated tasks. Will switch
to cfs_rq_common->h_nr_queued . The caps should reflect actual task
count, not group count.

>
> > +
> > +     if (penalty < gran)
> > +             penalty = gran;
> > +     if (penalty > maxp)
> > +             penalty = maxp;
> > +
> > +     /* If no need, apply refined baseline push (low risk + mid risk combined). */
> > +     if (need == 0) {
> > +             /*
> > +              * Baseline multiplier for need==0:
> > +              *   2        -> 1.00 * gran
> > +              *   3        -> 0.9375 * gran
> > +              *   4–6      -> 0.625 * gran
> > +              *   7–8      -> 0.50  * gran
> > +              *   9–12     -> 0.375 * gran
> > +              *   >12      -> 0.25  * gran
> > +              */
> > +             base = gran;
> > +             if (nr_queued == 3)
> > +                     base = (gran * 15) / 16;        /* 0.9375 */
> > +             else if (nr_queued >= 4 && nr_queued <= 6)
> > +                     base = (gran * 5) / 8;          /* 0.625 */
> > +             else if (nr_queued >= 7 && nr_queued <= 8)
> > +                     base = gran / 2;                /* 0.5 */
> > +             else if (nr_queued >= 9 && nr_queued <= 12)
> > +                     base = (gran * 3) / 8;          /* 0.375 */
> > +             else if (nr_queued > 12)
> > +                     base = gran / 4;                /* 0.25 */
> > +
> > +             if (penalty < base)
> > +                     penalty = base;
> > +     }
> > +
> > +     return penalty;
> > +}
> > +
> > +/*
> > + * Apply penalty and update EEVDF fields for scheduler consistency.
> > + * Safely applies vruntime penalty with overflow protection, then updates
> > + * EEVDF-specific fields (deadline, vlag) and cfs_rq min_vruntime to maintain
> > + * scheduler state consistency. Returns true on successful application,
> > + * false if penalty cannot be safely applied.
> > + */
> > +static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> > +                              struct cfs_rq *cfs_rq_common, u64 penalty)
> > +{
> > +     u64 new_vruntime;
> > +
> > +     /* Overflow protection */
> > +     if (se_y_lca->vruntime > (U64_MAX - penalty))
> > +             return;
> > +
> > +     new_vruntime = se_y_lca->vruntime + penalty;
> > +
> > +     /* Validity check */
> > +     if (new_vruntime <= se_y_lca->vruntime)
> > +             return;
> > +
> > +     se_y_lca->vruntime = new_vruntime;
> > +     se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
>
> And with that we update vruntime to an arbitrary value simply
> because one task in the hierarchy decided to call yield_to().

Yes, modifying vruntime at se_y_lca affects the entire hierarchy
beneath it, not just the calling task. This is the cost of making
yield_to() work in hierarchical scheduling. Is it worth it? We believe
yes, because:
1. Yield_to() is already a hierarchy-wide decision: When vCPU-A yields
to vCPU-B, it's not just task-A helping task-B—it's the entire VM (the
hierarchy) requesting another vCPU to make progress. Lock-holder
scenarios are VM-wide problems, not individual task problems.
2. The alternative is broken semantics: Without hierarchy-level
adjustment, yield_to() silently fails in cgroup configurations. Users
call yield_to() expecting it to work, but it doesn't—that's worse than
documented unfairness.
3. Bounded impact: The penalty scales conservatively with h_nr_queued
(larger hierarchies get 1.0× gran, not 6.0×), limiting blast radius.
If the position is that hierarchy-wide vruntime perturbation is never
acceptable regardless of use case, then yield_to() should explicitly
fail or be disabled in cgroup configurations rather than pretending to
work.

>
> Since we are on the topic, you are also missing an update_curr()
> which is only done in yield_task_fair() so you are actually
> looking at old vruntime for the yielding entity.

 Will fix it.

Regards,
Wanpeng