linux-kernel - Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250319134257.GA1428417@bytedance>
Date: Wed, 19 Mar 2025 21:43:23 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Josh Don <joshdon@...gle.com>
Cc: Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Chuyi Zhou <zhouchuyi@...edance.com>, Xi Wang <xii@...gle.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based
 throttle

Hi Josh,

On Sat, Mar 15, 2025 at 08:25:53PM -0700, Josh Don wrote:
> Hi Aaron,
> 
> >  static int tg_throttle_down(struct task_group *tg, void *data)
> >  {
> >         struct rq *rq = data;
> >         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > +       struct task_struct *p;
> > +       struct rb_node *node;
> > +
> > +       cfs_rq->throttle_count++;
> > +       if (cfs_rq->throttle_count > 1)
> > +               return 0;
> >
> >         /* group is entering throttled state, stop time */
> > -       if (!cfs_rq->throttle_count) {
> > -               cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > -               list_del_leaf_cfs_rq(cfs_rq);
> > +       cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > +       list_del_leaf_cfs_rq(cfs_rq);
> >
> > -               SCHED_WARN_ON(cfs_rq->throttled_clock_self);
> > -               if (cfs_rq->nr_queued)
> > -                       cfs_rq->throttled_clock_self = rq_clock(rq);
> > +       SCHED_WARN_ON(cfs_rq->throttled_clock_self);
> > +       if (cfs_rq->nr_queued)
> > +               cfs_rq->throttled_clock_self = rq_clock(rq);
> > +
> > +       WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
> > +       /*
> > +        * rq_lock is held, current is (obviously) executing this in kernelspace.
> > +        *
> > +        * All other tasks enqueued on this rq have their saved PC at the
> > +        * context switch, so they will go through the kernel before returning
> > +        * to userspace. Thus, there are no tasks-in-userspace to handle, just
> > +        * install the task_work on all of them.
> > +        */
> > +       node = rb_first(&cfs_rq->tasks_timeline.rb_root);
> > +       while (node) {
> > +               struct sched_entity *se = __node_2_se(node);
> > +
> > +               if (!entity_is_task(se))
> > +                       goto next;
> > +
> > +               p = task_of(se);
> > +               task_throttle_setup_work(p);
> > +next:
> > +               node = rb_next(node);
> > +       }
> 
> I'd like to strongly push back on this approach. This adds quite a lot
> of extra computation to an already expensive path
> (throttle/unthrottle). e.g. this function is part of the cgroup walk
> and so it is already O(cgroups) for the number of cgroups in the
> hierarchy being throttled. This gets even worse when you consider that
> we repeat this separately across all the cpus that the
> bandwidth-constrained group is running on. Keep in mind that
> throttle/unthrottle is done with rq lock held and IRQ disabled.

Agree that it's not good to do this O(nr_task) thing in
throttle/unthrottle path. As Chengming mentioned, throttle path can
avoid this but unthrottle path does not have an easy way to avoid this.

> In K Prateek's last RFC, there was discussion of using context
> tracking; did you consider that approach any further? We could keep

I haven't tried that approach yet.

> track of the number of threads within a cgroup hierarchy currently in
> kernel mode (similar to h_nr_runnable), and thus simplify down the
> throttling code here.

My initial feeling is the implementation looks pretty complex. If it can
be simplified somehow, that would be great.

Best regards,
Aaron