[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCq7SETQ7j6ifUoF_Pwiv42RNfv9V3AV+=OWg_U4+gZVbA@mail.gmail.com>
Date: Mon, 21 Apr 2025 14:00:34 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>,
Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>,
Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>,
"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>,
Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>,
Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>,
Suleiman Souhlal <suleiman@...gle.com>, kernel-team@...roid.com
Subject: Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec &
sched contexts
On Thu, Apr 17, 2025 at 4:12 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, Apr 11, 2025 at 11:02:38PM -0700, John Stultz wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e43993a4e5807..da8b0970c6655 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1143,22 +1143,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > }
> > #endif /* CONFIG_SMP */
> >
> > -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> > +static s64 update_se_times(struct rq *rq, struct sched_entity *se)
>
> update_se()
Sure thing!
> > {
> > u64 now = rq_clock_task(rq);
> > s64 delta_exec;
> >
> > - delta_exec = now - curr->exec_start;
> > + delta_exec = now - se->exec_start;
> > if (unlikely(delta_exec <= 0))
> > return delta_exec;
> >
> > - curr->exec_start = now;
> > - curr->sum_exec_runtime += delta_exec;
> > + se->exec_start = now;
> > + if (entity_is_task(se)) {
> > + struct task_struct *running = rq->curr;
> > + /*
> > + * If se is a task, we account the time against the running
> > + * task, as w/ proxy-exec they may not be the same.
> > + */
> > + running->se.exec_start = now;
> > + running->se.sum_exec_runtime += delta_exec;
> > + } else {
> > + /* If not task, account the time against se */
> > + se->sum_exec_runtime += delta_exec;
> > + }
>
>
> So I am confused; you're accounting runtime to the actual running task,
> but then accounting the same runtime to the cgroup of the donor.
>
> This seems somewhat irregular.
So, apologies, as it's been a bit since I've deeply thought on this.
In general we want to charge the donor for everything since it's
donating its time, etc. However, without this change, we got some
strange behavior in top, etc, because the proxy tasks that actually
ran didn't seem to gain any exec_runtime. So the split of charging
everything to the donor except the sum_exec_runtime to the actually
running process (the proxy) made sense.
Now, for cgroup accounting, it seems like we'd still want to charge
the donor's cgroup, so whatever restrictions there are in place apply
to the donor, but it's just when we get to the leaf task we charge the
proxy instead.
Does that sound reasonable? Or am I making a bad assumption here
around the cgroup logic?
> Please consider all of update_curr_task(), and if they all want to be
> against rq->curr, rather than rq->donor then more changes are needed.
So I think we are ok here, but it is confusing... see more below.
> > @@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
> > struct task_struct *donor = rq->donor;
> > s64 delta_exec;
> >
> > - delta_exec = update_curr_se(rq, &donor->se);
> > + delta_exec = update_se_times(rq, &donor->se);
> > if (likely(delta_exec > 0))
> > update_curr_task(donor, delta_exec);
> >
> > @@ -1233,7 +1244,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
> > if (unlikely(!curr))
> > return;
> >
> > - delta_exec = update_curr_se(rq, curr);
> > + delta_exec = update_se_times(rq, curr);
> > if (unlikely(delta_exec <= 0))
> > return;
>
> I think I've tripped over this before, on how update_curr_common() uses
> donor and update_curr() curr. This definitely needs a comment. Because
> at first glance they're not the same.
I suspect part of the incongruity/dissonance comes from the
cfs_rq->curr is actually the rq->donor (where rq->donor and rq->curr
are different), as its what the sched-class picked to run.
Renaming that I think might clarify things, but I have been hesitant
to cause too much naming churn in the series, but maybe it's the right
time to do it if it's causing confusion.
My other hesitancy there, is around wanting the proxy logic to be
focused in the core, so the sched-class "curr" can still be what the
class selected to run, its just proxy might pick something else to
actually run. But the top level rq->curr not being the cfs_rq->curr is
prone to confusion, and we already do have rq->donor references in
fair.c so its not like it's perfectly encapsulated and layered.
But I'll take a pass at renaming cfs_rq->curr to cfs_rq->donor, unless
you object.
thanks
-john
Powered by blists - more mailing lists