[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANDhNCq_PG6L68Qj0WoV5LyCcoNwjdxv3wOt+EnLFnkU4NAv+w@mail.gmail.com>
Date: Fri, 25 Apr 2025 20:34:32 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>,
Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>,
Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>,
"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>,
Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>,
Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>,
Suleiman Souhlal <suleiman@...gle.com>, kernel-team@...roid.com
Subject: Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec &
sched contexts
On Thu, Apr 24, 2025 at 6:37 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Mon, Apr 21, 2025 at 02:00:34PM -0700, John Stultz wrote:
> > On Thu, Apr 17, 2025 at 4:12 AM Peter Zijlstra <peterz@...radead.org> wrote:
> > > So I am confused; you're accounting runtime to the actual running task,
> > > but then accounting the same runtime to the cgroup of the donor.
> > >
> > > This seems somewhat irregular.
> >
> > So, apologies, as it's been a bit since I've deeply thought on this.
> > In general we want to charge the donor for everything since it's
> > donating its time, etc. However, without this change, we got some
> > strange behavior in top, etc, because the proxy tasks that actually
> > ran didn't seem to gain any exec_runtime. So the split of charging
> > everything to the donor except the sum_exec_runtime to the actually
> > running process (the proxy) made sense.
> >
> > Now, for cgroup accounting, it seems like we'd still want to charge
> > the donor's cgroup, so whatever restrictions there are in place apply
> > to the donor, but it's just when we get to the leaf task we charge the
> > proxy instead.
> >
> > Does that sound reasonable? Or am I making a bad assumption here
> > around the cgroup logic?
>
> Its all rather confusing one way or the other I'm afraid :/
>
> This way when people go add up the task times and compare to cgroups
> it doesn't match up.
So, inherently the point of proxy is that tasks get to run on borrowed time.
Priority inversions from any restrictions via cgroups are resolved
only by letting the task run using the donors time. So for proxy to
work we have to allow those tasks to effectively break through their
restrictions to release the needed locks.
So I'm not sure, outside of accounting everything onto the donor
(which creates its own oddities, which motivated this patch) one could
be perfectly consistent here.
I still think the model used in this patch feels like the best
balance. But I'm wondering, would accounting the "borrowed time" and
the "donated time" separately maybe help folks interpret things more
clearly?
> Also, by adding sum_exec_runtime to curr, but
> account_group_exec_runtime() on donor, the cputimer information is
> inconsistent.
Ah! Ok, this is a good point. I misinterpreted the
account_group_exec_runtime() as being cgroup related not thread group.
Thanks for pointing this out!
>
> Whatever we do, it should be internally consistent, and this ain't it.
>
I'll take a swing at fixing the account_group_exec_runtime(), but let
me know if you have any more thoughts on how we charge cputime between
donor cgroups and proxy tasks running on the donors time.
> > > I think I've tripped over this before, on how update_curr_common() uses
> > > donor and update_curr() curr. This definitely needs a comment. Because
> > > at first glance they're not the same.
> >
> > I suspect part of the incongruity/dissonance comes from the
> > cfs_rq->curr is actually the rq->donor (where rq->donor and rq->curr
> > are different), as its what the sched-class picked to run.
> >
> > Renaming that I think might clarify things, but I have been hesitant
> > to cause too much naming churn in the series, but maybe it's the right
> > time to do it if it's causing confusion.
> >
> > My other hesitancy there, is around wanting the proxy logic to be
> > focused in the core, so the sched-class "curr" can still be what the
> > class selected to run, its just proxy might pick something else to
> > actually run. But the top level rq->curr not being the cfs_rq->curr is
> > prone to confusion, and we already do have rq->donor references in
> > fair.c so its not like it's perfectly encapsulated and layered.
> >
> > But I'll take a pass at renaming cfs_rq->curr to cfs_rq->donor, unless
> > you object.
>
> I was more thinking of a comment near here to clarify. Not sure
> cfs_rq->donor makes much sense.
Ok.
/me quietly deletes that patch from his WIP series. :)
Thanks so much again for the review and feedback here! Very much appreciated!
-john
Powered by blists - more mailing lists