lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCq7SETQ7j6ifUoF_Pwiv42RNfv9V3AV+=OWg_U4+gZVbA@mail.gmail.com>
Date: Mon, 21 Apr 2025 14:00:34 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>, 
	Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, 
	Suleiman Souhlal <suleiman@...gle.com>, kernel-team@...roid.com
Subject: Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec &
 sched contexts

On Thu, Apr 17, 2025 at 4:12 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, Apr 11, 2025 at 11:02:38PM -0700, John Stultz wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e43993a4e5807..da8b0970c6655 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1143,22 +1143,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  }
> >  #endif /* CONFIG_SMP */
> >
> > -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> > +static s64 update_se_times(struct rq *rq, struct sched_entity *se)
>
> update_se()

Sure thing!

> >  {
> >       u64 now = rq_clock_task(rq);
> >       s64 delta_exec;
> >
> > -     delta_exec = now - curr->exec_start;
> > +     delta_exec = now - se->exec_start;
> >       if (unlikely(delta_exec <= 0))
> >               return delta_exec;
> >
> > -     curr->exec_start = now;
> > -     curr->sum_exec_runtime += delta_exec;
> > +     se->exec_start = now;
> > +     if (entity_is_task(se)) {
> > +             struct task_struct *running = rq->curr;
> > +             /*
> > +              * If se is a task, we account the time against the running
> > +              * task, as w/ proxy-exec they may not be the same.
> > +              */
> > +             running->se.exec_start = now;
> > +             running->se.sum_exec_runtime += delta_exec;
> > +     } else {
> > +             /* If not task, account the time against se */
> > +             se->sum_exec_runtime += delta_exec;
> > +     }
>
>
> So I am confused; you're accounting runtime to the actual running task,
> but then accounting the same runtime to the cgroup of the donor.
>
> This seems somewhat irregular.

So, apologies, as it's been a bit since I've deeply thought on this.
In general we want to charge the donor for everything since it's
donating its time, etc. However, without this change, we got some
strange behavior in top, etc, because the proxy tasks that actually
ran didn't seem to gain any exec_runtime. So the split of charging
everything to the donor except the sum_exec_runtime to the actually
running process (the proxy) made sense.

Now, for cgroup accounting, it seems like we'd still want to charge
the donor's cgroup, so whatever restrictions there are in place apply
to the donor, but it's just when we get to the leaf task we charge the
proxy instead.

Does that sound reasonable? Or am I making a bad assumption here
around the cgroup logic?

> Please consider all of update_curr_task(), and if they all want to be
> against rq->curr, rather than rq->donor then more changes are needed.

So I think we are ok here, but it is confusing... see more below.


> > @@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
> >       struct task_struct *donor = rq->donor;
> >       s64 delta_exec;
> >
> > -     delta_exec = update_curr_se(rq, &donor->se);
> > +     delta_exec = update_se_times(rq, &donor->se);
> >       if (likely(delta_exec > 0))
> >               update_curr_task(donor, delta_exec);
> >
> > @@ -1233,7 +1244,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
> >       if (unlikely(!curr))
> >               return;
> >
> > -     delta_exec = update_curr_se(rq, curr);
> > +     delta_exec = update_se_times(rq, curr);
> >       if (unlikely(delta_exec <= 0))
> >               return;
>
> I think I've tripped over this before, on how update_curr_common() uses
> donor and update_curr() curr. This definitely needs a comment. Because
> at first glance they're not the same.

I suspect part of the incongruity/dissonance comes from the
cfs_rq->curr is actually the rq->donor (where rq->donor and rq->curr
are different), as its what the sched-class picked to run.

Renaming that I think might clarify things, but I have been hesitant
to cause too much naming churn in the series, but maybe it's the right
time to do it if it's causing confusion.

My other hesitancy there, is around wanting the proxy logic to be
focused in the core, so the sched-class "curr" can still be what the
class selected to run, its just proxy might pick something else to
actually run. But the top level rq->curr not being the cfs_rq->curr is
prone to confusion, and we already do have rq->donor references in
fair.c so its not like it's perfectly encapsulated and layered.

But I'll take a pass at renaming cfs_rq->curr to cfs_rq->donor, unless
you object.

thanks
-john

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ