[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANRm+CxZCkF-1Vy5bxAsFuAM99euUtQVmNwJzb+4SWRa47T4LQ@mail.gmail.com>
Date: Thu, 13 Nov 2025 21:25:21 +0800
From: Wanpeng Li <kernellwp@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>, Paolo Bonzini <pbonzini@...hat.com>,
Sean Christopherson <seanjc@...gle.com>, Steven Rostedt <rostedt@...dmis.org>,
Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
Wanpeng Li <wanpengli@...cent.com>
Subject: Re: [PATCH 04/10] sched/fair: Add penalty calculation and application logic
Hi Prateek,
On Wed, 12 Nov 2025 at 15:25, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Hello Wanpeng,
>
> On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> > +/*
> > + * Calculate penalty with debounce logic for EEVDF yield deboost.
> > + * Computes vruntime penalty based on fairness gap (need) plus granularity,
> > + * applies queue-size-based caps to prevent excessive penalties in small queues,
> > + * and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
> > + * Returns 0 if no penalty needed, otherwise returns clamped penalty value.
> > + */
> > +static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> > + struct sched_entity *se_t_lca, struct sched_entity *se_t,
> > + int nr_queued)
> > +{
> > + u64 gran, need, penalty, maxp;
> > + u64 gran_floor;
> > + u64 weighted_need, base;
> > +
> > + gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
> > + /* Low-bound safeguard for gran when slice is abnormally small */
> > + gran_floor = calc_delta_fair(sysctl_sched_base_slice >> 1, se_y_lca);
> > + if (gran < gran_floor)
>
> Is this even possible?
No. Both use the same weight denominator in calc_delta_fair(), the
check is redundant. Will remove.
>
> > + gran = gran_floor;
> > +
> > + need = 0;
> > + if (se_t_lca->vruntime > se_y_lca->vruntime)
> > + need = se_t_lca->vruntime - se_y_lca->vruntime;
>
> So I'm assuming you want the yielding task's vruntime to
> cross the target's vruntime simply because one task somewhere
> down the hierarchy said so.
Yes, this is a known tradeoff. We apply the penalty at the LCA where A
and B compete to make B schedulable immediately. Side effect:
independent task C in CG0 loses CPU time. In practice, VMs place all
vCPUs in one cgroup (no independent C). If C exists and shares the
lock, the penalty helps. If C is truly independent, it loses ~one
scheduling slice. Your natural convergence approach avoids this but
needs multiple yield cycles before B gets sustained preference.
>
> > +
> > + /* Apply 10% boost to need when positive (weighted_need = need * 1.10) */
> > + penalty = gran;
>
> So at the very least I see it getting weighted(base_slice / 2) penalty
> ...
>
> > + if (need) {
> > + /* weighted_need = need + 10% */
> > + weighted_need = need + need / 10;
> > + /* clamp to avoid overflow when adding to gran (still capped later) */
> > + if (weighted_need > U64_MAX - penalty)
> > + weighted_need = U64_MAX - penalty;
> > + penalty += weighted_need;
>
> ... if not more ...
Yes, the floor is gran (weighted ~700µs). Empirically, smaller values
didn't sustain preference—the yielder would re-preempt the target
within 1-2 decisions in dbench testing. This is a workload-specific
heuristic. If too aggressive for general use, I can lower it or tie it
to h_nr_queued . Thoughts?
>
> > + }
> > +
> > + /* Apply debounce via helper to avoid ping-pong */
> > + penalty = yield_deboost_apply_debounce(rq, se_t, penalty, need, gran);
>
> ... since without debounce, penalty remains same.
>
> > +
> > + /* Upper bound (cap): slightly more aggressive for mid-size queues */
> > + if (nr_queued == 2)
> > + maxp = gran * 6; /* Strongest push for 2-task ping-pong */
> > + else if (nr_queued == 3)
> > + maxp = gran * 4; /* 4.0 * gran */
> > + else if (nr_queued <= 6)
> > + maxp = (gran * 5) / 2; /* 2.5 * gran */
> > + else if (nr_queued <= 8)
> > + maxp = gran * 2; /* 2.0 * gran */
> > + else if (nr_queued <= 12)
> > + maxp = (gran * 3) / 2; /* 1.5 * gran */
> > + else
> > + maxp = gran; /* 1.0 * gran */
>
> And all the nr_queued calculations are based on the entities queued
> and not the "h_nr_queued" so we can have a boat load of tasks to
> run above but since one task decided to call yield_to() let us make
> them all starve a little?
You're absolutely right. Using nr_queued (entity count) instead of
h_nr_queued (hierarchical task count) is wrong:
CG0 (nr_queued=2, h_nr_queued=100)
├─ CG1 (50 tasks)
└─ CG2 (50 tasks)
My code sees 2 entities and applies maxp = 6×gran (strongest penalty),
but 100 tasks are competing. This starves unrelated tasks. Will switch
to cfs_rq_common->h_nr_queued . The caps should reflect actual task
count, not group count.
>
> > +
> > + if (penalty < gran)
> > + penalty = gran;
> > + if (penalty > maxp)
> > + penalty = maxp;
> > +
> > + /* If no need, apply refined baseline push (low risk + mid risk combined). */
> > + if (need == 0) {
> > + /*
> > + * Baseline multiplier for need==0:
> > + * 2 -> 1.00 * gran
> > + * 3 -> 0.9375 * gran
> > + * 4–6 -> 0.625 * gran
> > + * 7–8 -> 0.50 * gran
> > + * 9–12 -> 0.375 * gran
> > + * >12 -> 0.25 * gran
> > + */
> > + base = gran;
> > + if (nr_queued == 3)
> > + base = (gran * 15) / 16; /* 0.9375 */
> > + else if (nr_queued >= 4 && nr_queued <= 6)
> > + base = (gran * 5) / 8; /* 0.625 */
> > + else if (nr_queued >= 7 && nr_queued <= 8)
> > + base = gran / 2; /* 0.5 */
> > + else if (nr_queued >= 9 && nr_queued <= 12)
> > + base = (gran * 3) / 8; /* 0.375 */
> > + else if (nr_queued > 12)
> > + base = gran / 4; /* 0.25 */
> > +
> > + if (penalty < base)
> > + penalty = base;
> > + }
> > +
> > + return penalty;
> > +}
> > +
> > +/*
> > + * Apply penalty and update EEVDF fields for scheduler consistency.
> > + * Safely applies vruntime penalty with overflow protection, then updates
> > + * EEVDF-specific fields (deadline, vlag) and cfs_rq min_vruntime to maintain
> > + * scheduler state consistency. Returns true on successful application,
> > + * false if penalty cannot be safely applied.
> > + */
> > +static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> > + struct cfs_rq *cfs_rq_common, u64 penalty)
> > +{
> > + u64 new_vruntime;
> > +
> > + /* Overflow protection */
> > + if (se_y_lca->vruntime > (U64_MAX - penalty))
> > + return;
> > +
> > + new_vruntime = se_y_lca->vruntime + penalty;
> > +
> > + /* Validity check */
> > + if (new_vruntime <= se_y_lca->vruntime)
> > + return;
> > +
> > + se_y_lca->vruntime = new_vruntime;
> > + se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);
>
> And with that we update vruntime to an arbitrary value simply
> because one task in the hierarchy decided to call yield_to().
Yes, modifying vruntime at se_y_lca affects the entire hierarchy
beneath it, not just the calling task. This is the cost of making
yield_to() work in hierarchical scheduling. Is it worth it? We believe
yes, because:
1. Yield_to() is already a hierarchy-wide decision: When vCPU-A yields
to vCPU-B, it's not just task-A helping task-B—it's the entire VM (the
hierarchy) requesting another vCPU to make progress. Lock-holder
scenarios are VM-wide problems, not individual task problems.
2. The alternative is broken semantics: Without hierarchy-level
adjustment, yield_to() silently fails in cgroup configurations. Users
call yield_to() expecting it to work, but it doesn't—that's worse than
documented unfairness.
3. Bounded impact: The penalty scales conservatively with h_nr_queued
(larger hierarchies get 1.0× gran, not 6.0×), limiting blast radius.
If the position is that hierarchy-wide vruntime perturbation is never
acceptable regardless of use case, then yield_to() should explicitly
fail or be disabled in cgroup configurations rather than pretending to
work.
>
> Since we are on the topic, you are also missing an update_curr()
> which is only done in yield_task_fair() so you are actually
> looking at old vruntime for the yielding entity.
Will fix it.
Regards,
Wanpeng
Powered by blists - more mailing lists