[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABk29NtcRUwskBjrvLKkEKQ0hpNPSrdzrGAGZy+bHSfnznOUSg@mail.gmail.com>
Date: Wed, 16 Jun 2021 18:01:59 -0700
From: Josh Don <joshdon@...gle.com>
To: Tejun Heo <tj@...nel.org>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Paul Turner <pjt@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Oleg Rombakh <olegrom@...gle.com>,
Viresh Kumar <viresh.kumar@...aro.org>,
Steve Sistare <steven.sistare@...cle.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH] sched: cgroup SCHED_IDLE support
Hey Tejun,
Thanks for taking a look.
On Wed, Jun 16, 2021 at 8:42 AM Tejun Heo <tj@...nel.org> wrote:
>
> Hello,
>
> On Tue, Jun 08, 2021 at 04:11:32PM -0700, Josh Don wrote:
> > This extends SCHED_IDLE to cgroups.
> >
> > Interface: cgroup/cpu.idle.
> > 0: default behavior
> > 1: SCHED_IDLE
> >
> > Extending SCHED_IDLE to cgroups means that we incorporate the existing
> > aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
> > descendant threads towards the idle_h_nr_running count of all of its
> > ancestor cgroups. Thus, sched_idle_rq() will work properly.
> > Additionally, SCHED_IDLE cgroups are configured with minimum weight.
> >
> > There are two key differences between the per-task and per-cgroup
> > SCHED_IDLE interface:
> >
> > - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
> > maintain their relative weights. The entity that is "idle" is the
> > cgroup, not the tasks themselves.
> >
> > - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
> > decision is not made by comparing the current task with the woken task,
> > but rather by comparing their matching sched_entity.
> >
> > A typical use-case for this is a user that creates an idle and a
> > non-idle subtree. The non-idle subtree will dominate competition vs
> > the idle subtree, but the idle subtree will still be high priority
> > vs other users on the system. The latter is accomplished via comparing
> > matching sched_entity in the waken preemption path (this could also be
> > improved by making the sched_idle_rq() decision dependent on the
> > perspective of a specific task).
>
> A high-level problem that I see with the proposal is that this would bake
> the current recursive implementation into the interface. The semantics of
> the currently exposed interface, at least the weight based part, is abstract
> and doesn't necessarily dictate how the scheduling is actually performed.
> Adding this would mean that we're now codifying the current behavior of
> fully nested scheduling into the interface.
>
> There are several practical challenges with the current implementation
> caused by the full nesting - e.g. nesting levels are expensive for context
> switch heavy applicaitons often going over >1% per level, and heuristics
> which assume global queue may behave unexpectedly - ie. we can create
> conditions where things like idle-wakeup boost behave very differently
> depending on whether tasks are inside a cgroup or not even when the eventual
> relative weights and past usages are similar.
To be clear, are you talking mainly about the idle_h_nr_running
accounting? The fact that setting cpu.idle propagates changes upwards
to ancestor cgroups?
The goal is to match the SCHED_IDLE semantics here, which requires
this behavior. The end result is no different than a nested cgroup
with SCHED_IDLE tasks. One possible alternative would be to only count
root-level cpu.idle cgroups towards the overall idle_h_nr_running.
I've not opted for that, in order to make this work more similarly to
the task interface. Also note that there isn't additional overhead
from the accounting, since enqueue/dequeue already traverses these
nesting levels.
Could you elaborate a bit more on your statement here?
> Can you please give more details on why this is beneficial? Is the benefit
> mostly around making configuration easy or are there actual scheduling
> behaviors that you can't achieve otherwise?
There are actual scheduling behaviors that cannot be achieved without
the cgroup interface. The task SCHED_IDLE mechanism is a combination
of special SCHED_IDLE handling (idle_h_nr_running,
check_preempt_wakeup), and minimum scheduling weight.
Consider a tree like
root
/ \
A C
/ \ |
B idle t4
| | \
t1 t2 t3
Here, 'idle' is our cpu.idle cgroup. The following properties would
not be possible if we moved t2/t3 into SCHED_IDLE without the cgroup
interface:
- t1 always preempts t2/t3 on wakeup, but t4 does not
- t2 and t3 have different, non-minimum weights. Technically we could
also achieve this by adding another layer of nested cgroups, but that
starts to make the hierarchy much more complex.
- I've also discussed with Peter a possible extension (vruntime
adjustments) to the current SCHED_IDLE semantics. Similarly to the
first bullet here, we'd need a cgroup idle toggle to achieve certain
scheduling behaviors with this.
Thanks,
Josh
Powered by blists - more mailing lists