[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220118173248.amqd3qwyuqc33egk@oracle.com>
Date: Tue, 18 Jan 2022 12:32:48 -0500
From: Daniel Jordan <daniel.m.jordan@...cle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Alexander Duyck <alexanderduyck@...com>,
Alex Williamson <alex.williamson@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Ben Segall <bsegall@...gle.com>,
Cornelia Huck <cohuck@...hat.com>,
Dan Williams <dan.j.williams@...el.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Herbert Xu <herbert@...dor.apana.org.au>,
Ingo Molnar <mingo@...hat.com>,
Jason Gunthorpe <jgg@...dia.com>,
Johannes Weiner <hannes@...xchg.org>,
Josh Triplett <josh@...htriplett.org>,
Michal Hocko <mhocko@...e.com>, Nico Pache <npache@...hat.com>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Steffen Klassert <steffen.klassert@...unet.com>,
Steve Sistare <steven.sistare@...cle.com>,
Tejun Heo <tj@...nel.org>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
linux-mm@...ck.org, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org
Subject: Re: [RFC 15/16] sched/fair: Account kthread runtime debt for CFS
bandwidth
On Fri, Jan 14, 2022 at 10:31:55AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 05, 2022 at 07:46:55PM -0500, Daniel Jordan wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 44c452072a1b..3c2d7f245c68 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4655,10 +4655,19 @@ static inline u64 sched_cfs_bandwidth_slice(void)
> > */
> > void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> > {
> > - if (unlikely(cfs_b->quota == RUNTIME_INF))
> > + u64 quota = cfs_b->quota;
> > + u64 payment;
> > +
> > + if (unlikely(quota == RUNTIME_INF))
> > return;
> >
> > - cfs_b->runtime += cfs_b->quota;
> > + if (cfs_b->debt) {
> > + payment = min(quota, cfs_b->debt);
> > + cfs_b->debt -= payment;
> > + quota -= payment;
> > + }
> > +
> > + cfs_b->runtime += quota;
> > cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
> > }
>
> It might be easier to make cfs_bandwidth::runtime an s64 and make it go
> negative.
Yep, nice, no need for a new field in cfs_bandwidth.
> > @@ -5406,6 +5415,32 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> > rcu_read_unlock();
> > }
> >
> > +static void incur_cfs_debt(struct rq *rq, struct sched_entity *se,
> > + struct task_group *tg, u64 debt)
> > +{
> > + if (!cfs_bandwidth_used())
> > + return;
> > +
> > + while (tg != &root_task_group) {
> > + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > +
> > + if (cfs_rq->runtime_enabled) {
> > + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> > + u64 payment;
> > +
> > + raw_spin_lock(&cfs_b->lock);
> > +
> > + payment = min(cfs_b->runtime, debt);
> > + cfs_b->runtime -= payment;
>
> At this point it might hit 0 (or go negative if/when you do the above)
> and you'll need to throttle the group.
I might not be following you, but there could be cfs_rq's with local
runtime_remaining, so even if it goes 0 or negative, the group might
still have quota left and so shouldn't be throttled right away.
I was thinking the throttling would happen as normal, when a cfs_rq ran
out of runtime_remaining and failed to refill it from
cfs_bandwidth::runtime.
> > + cfs_b->debt += debt - payment;
> > +
> > + raw_spin_unlock(&cfs_b->lock);
> > + }
> > +
> > + tg = tg->parent;
> > + }
> > +}
>
> So part of the problem I have with this is that these external things
> can consume all the bandwidth and basically indefinitely starve the
> group.
>
> This is doulby so if you're going to account things like softirq network
> processing.
Yes. As Tejun points out, I'll make sure remote charging doesn't run
away.
> Also, why does the whole charging API have a task argument? It either is
> current or NULL in case of things like softirq, neither really make
> sense as an argument.
@task distinguishes between NULL for softirq and current for everybody
else.
It's possible to detect this difference internally though, if that's
what you're saying, so @task can go away.
> Also, by virtue of this being a start-stop annotation interface, the
> accrued time might be arbitrarily large and arbitrarily delayed. I'm not
> sure that's sensible.
Yes, that is a risk. With start-stop, users need to be careful to
account often enough and have a "reasonable" upper bound on period
length, however that's defined. Multithreaded jobs are probably the
worst offender since these threads charge a sizable amount at once
compared to the other use cases.
> For tasks it might be better to mark the task and have the tick DTRT
> instead of later trying to 'migrate' the time.
Ok, I'll try that. The start-stop approach keeps remote charging from
adding overhead in the tick for non-remote-charging things, far and away
the common case, but I'll see how expensive the tick-based approach is.
Can hide it behind a static branch for systems not using the cpu
contoller.
Powered by blists - more mailing lists