[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOBoifh2Ya279_jjFBrSAHuczqz_FM4NGUne2Tk0sBV=gD4D+w@mail.gmail.com>
Date: Thu, 20 Mar 2025 11:40:11 -0700
From: Xi Wang <xii@...gle.com>
To: Chengming Zhou <chengming.zhou@...ux.dev>
Cc: K Prateek Nayak <kprateek.nayak@....com>, Josh Don <joshdon@...gle.com>,
Aaron Lu <ziqianlu@...edance.com>, Valentin Schneider <vschneid@...hat.com>,
Ben Segall <bsegall@...gle.com>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Mel Gorman <mgorman@...e.de>, Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle
On Thu, Mar 20, 2025 at 1:39 AM Chengming Zhou <chengming.zhou@...ux.dev> wrote:
>
> Hello Prateek,
>
> On 2025/3/20 14:59, K Prateek Nayak wrote:
> > Hello Chengming,
> >
> > On 3/17/2025 8:24 AM, Chengming Zhou wrote:
> >> On 2025/3/16 11:25, Josh Don wrote:
> >>> Hi Aaron,
> >>>
> >>>> static int tg_throttle_down(struct task_group *tg, void *data)
> >>>> {
> >>>> struct rq *rq = data;
> >>>> struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> >>>> + struct task_struct *p;
> >>>> + struct rb_node *node;
> >>>> +
> >>>> + cfs_rq->throttle_count++;
> >>>> + if (cfs_rq->throttle_count > 1)
> >>>> + return 0;
> >>>>
> >>>> /* group is entering throttled state, stop time */
> >>>> - if (!cfs_rq->throttle_count) {
> >>>> - cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> >>>> - list_del_leaf_cfs_rq(cfs_rq);
> >>>> + cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> >>>> + list_del_leaf_cfs_rq(cfs_rq);
> >>>>
> >>>> - SCHED_WARN_ON(cfs_rq->throttled_clock_self);
> >>>> - if (cfs_rq->nr_queued)
> >>>> - cfs_rq->throttled_clock_self = rq_clock(rq);
> >>>> + SCHED_WARN_ON(cfs_rq->throttled_clock_self);
> >>>> + if (cfs_rq->nr_queued)
> >>>> + cfs_rq->throttled_clock_self = rq_clock(rq);
> >>>> +
> >>>> + WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
> >>>> + /*
> >>>> + * rq_lock is held, current is (obviously) executing this in kernelspace.
> >>>> + *
> >>>> + * All other tasks enqueued on this rq have their saved PC at the
> >>>> + * context switch, so they will go through the kernel before returning
> >>>> + * to userspace. Thus, there are no tasks-in-userspace to handle, just
> >>>> + * install the task_work on all of them.
> >>>> + */
> >>>> + node = rb_first(&cfs_rq->tasks_timeline.rb_root);
> >>>> + while (node) {
> >>>> + struct sched_entity *se = __node_2_se(node);
> >>>> +
> >>>> + if (!entity_is_task(se))
> >>>> + goto next;
> >>>> +
> >>>> + p = task_of(se);
> >>>> + task_throttle_setup_work(p);
> >>>> +next:
> >>>> + node = rb_next(node);
> >>>> + }
> >>>
> >>> I'd like to strongly push back on this approach. This adds quite a lot
> >>> of extra computation to an already expensive path
> >>> (throttle/unthrottle). e.g. this function is part of the cgroup walk
> >>
> >> Actually, with my suggestion in another email that we only need to mark
> >> these cfs_rqs throttled when quote used up, and setup throttle task work
> >> when the tasks under those cfs_rqs get picked, the cost of throttle path
> >> is much like the dual tree approach.
> >>
> >> As for unthrottle path, you're right, it has to iterate each task on
> >> a list to get it queued into the cfs_rq tree, so it needs more thinking.
> >>
> >>> and so it is already O(cgroups) for the number of cgroups in the
> >>> hierarchy being throttled. This gets even worse when you consider that
> >>> we repeat this separately across all the cpus that the
> >>> bandwidth-constrained group is running on. Keep in mind that
> >>> throttle/unthrottle is done with rq lock held and IRQ disabled.
> >>
> >> Maybe we can avoid holding rq lock when unthrottle? As for per-task
> >> unthrottle, it's much like just waking up lots of tasks, so maybe we
> >> can reuse ttwu path to wakeup those tasks, which could utilise remote
> >> IPI to avoid holding remote rq locks. I'm not sure, just some random thinking..
> >>
> >>>
> >>> In K Prateek's last RFC, there was discussion of using context
> >>> tracking; did you consider that approach any further? We could keep
> >>> track of the number of threads within a cgroup hierarchy currently in
> >>> kernel mode (similar to h_nr_runnable), and thus simplify down the
> >>
> >> Yeah, I think Prateek's approach is very creative! The only downside of
> >> it is that the code becomes much complex.. on already complex codebase.
> >> Anyway, it would be great that or this could be merged in mainline kernel.
> >
> > I think the consensus is to move to per-task throttling since it'll
> > simplify the efforts to move to a flat hierarchy sometime in the near
> > future.
>
> Ah, agree! And I'm very much looking forward to seeing a flat hierarchy!
>
> At our clusters, there are often too many levels (3-6) of cgroups, which
> caused too much cost in scheduler activities.
>
> >
> > My original approach was complex since i wanted to seamlessly resume the
> > pick part on unthrottle. In Ben;s patch [1] there is a TODO in
> > pick_next_entity() and that probably worked with the older vruntime based
> > CFS algorithm but can break EEVDF guarantees.
> >
> > [1] https://lore.kernel.org/all/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com/
> >
> > Unfortunately keeping a single rbtree meant replicating the stats and
> > that indeed adds to complexity.
>
> Ok, got it.
>
> Thanks!
>
> >
> >>
> >> At last, I wonder is it possible that we can implement a cgroup-level
> >> bandwidth control, instead of doing it in each sched_class? Then SCX
> >> tasks could be controlled too, without implementing it in BPF code...
> >>
> >> Thanks!
> >>
> >>> throttling code here.
> >>>
> >>> Best,
> >>> Josh
> >
I am a bit unsure about the overhead experiment results. Maybe we can add some
counters to check how many cgroups per cpu are actually touched and how many
threads are actually dequeued / enqueued for throttling / unthrottling?
Looks like busy loop workloads were used for the experiment. With throttling
deferred to exit_to_user_mode, it would only be triggered by ticks. A large
runtime debt can accumulate before the on cpu threads are actually dequeued.
(Also noted in https://lore.kernel.org/lkml/20240711130004.2157737-11-vschneid@redhat.com/)
distribute_cfs_runtime would finish early if the quotas are used up by the first
few cpus, which would also result in throttling/unthrottling for only a few
runqueues per period. An intermittent workload like hackbench may give us more
information.
See slide 10 of my presentation for more info:
https://lpc.events/event/18/contributions/1883/attachments/1420/3040/Priority%20Inheritance%20for%20CFS%20Bandwidth%20Control.pdf
Indeed we are seeing more cfsb scalability problems with larger servers. Both
the irq off time from the cgroup walk and the overheads from per task actions
can be problematic.
-Xi
Powered by blists - more mailing lists