[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhikwwyw8r.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Tue, 23 Jul 2024 17:16:20 +0200
From: Valentin Schneider <vschneid@...hat.com>
To: Benjamin Segall <bsegall@...gle.com>
Cc: linux-kernel@...r.kernel.org, rcu@...r.kernel.org, Peter Zijlstra
<peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, Juri Lelli
<juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
<rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Phil Auld
<pauld@...hat.com>, Clark Williams <williams@...hat.com>, Tomas Glozar
<tglozar@...hat.com>, "Paul E. McKenney" <paulmck@...nel.org>, Frederic
Weisbecker <frederic@...nel.org>, Neeraj Upadhyay
<neeraj.upadhyay@...nel.org>, Joel Fernandes <joel@...lfernandes.org>,
Josh Triplett <josh@...htriplett.org>, Boqun Feng <boqun.feng@...il.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Lai Jiangshan
<jiangshanlai@...il.com>, Zqiang <qiang.zhang1211@...il.com>, Alexander
Gordeev <agordeev@...ux.ibm.com>, Catalin Marinas
<catalin.marinas@....com>, Arnd Bergmann <arnd@...db.de>, Guo Ren
<guoren@...nel.org>, Palmer
Dabbelt <palmer@...osinc.com>, Andrew Morton <akpm@...ux-foundation.org>,
Oleg Nesterov <oleg@...hat.com>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return
to userspace
On 18/07/24 17:25, Benjamin Segall wrote:
> Valentin Schneider <vschneid@...hat.com> writes:
>
>> As reported in [1], CFS bandwidth throttling is a source of headaches in
>> PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
>> prevent ksoftirqd from running, which prevents replenishing & unthrottling
>> the cfs_rq of said CFS task.
>>
>> Peter mentioned that there have been discussions on changing /when/ the
>> throttling happens: rather than have it be done immediately upon updating
>> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
>> for the task to be about to return to userspace.
>
> Sorry for taking a while to get to replying ot this.
Well it's a pretty big diff, so thanks for taking a look!
>
>> Clocks
>> ======
>>
>> Correctly handling the different cfs_rq->throttled_clock* is tricky, as
>> unlike the current upstream approach where all tasks of a cfs_rq are
>> throttled at the exact same time, here they each get throttled at a
>> per-task, not-known-beforehand time.
>>
>> For instance, for the ->throttled_clock_pelt, ideally we would need a
>> per-task snapshot of when the task gets really throttled in
>> exit_to_user_mode(), rather than a single snapshot of when the cfs_rq runs
>> out of runtime. This isn't implemented here. The ->throttled_clock_pelt is
>> set when the cfs_rq runs out of runtime, which means the "grace period"
>> given to the cfs_rq's tasks on their way to exit_to_user_mode() isn't
>> accounted.
>
> And I haven't checked it yet because I never remember how the whole
> throttled clock thing works. :P
>
>>
>> Notable behaviour changes
>> =========================
>>
>> Once a cfs_rq is ->throttled, its tasks can continue running until they hit
>> exit_to_user_mode(). This means they can keep draining further runtime
>> from their cfs_rq, which can end up draining more than one period's worth
>> of runtime.
>>
>> I've tested a 10ms runtime / 100ms period cgroup with an always running
>> task: upstream gets a "clean" periodic pattern of 10ms runtime every 100ms,
>> whereas this gets something more like 40ms runtime every 400ms.
>
> Hmm, this seems a little odd since TWA_RESUME does a kick_process.
I didn't ponder too much on the workload used here, but the point I wanted
to bring up is: if you give a cgroup X amount of runtime, it may still
consume more than that within a single period because execution in
kernelspace isn't immediately stopped/throttled.
It means the "standard" bandwidth control behaviour becomes a bit more
bursty.
>> +static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
>> +{
>> + struct task_struct *p = container_of(work, struct task_struct, unthrottle_irq_work);
>> + int cpu = raw_smp_processor_id();
>> +
>> + CLASS(task_rq_lock, rq_guard)(p);
>> + WARN_ON_ONCE(task_cpu(p) != cpu);
>> +
>> + if (task_has_throttle_work(p) && !task_needs_throttling(p))
>> + task_throttle_do_cancel_work(p);
>> +}
>
> I think you can wind up triggering this WARN and in general have some
> weird state, whether or not it matters, basically any time that you
> __task_throttle_cancel(p, a_remote_cpu).
>
> It queues an irq_work and sends an IPI, but doesn't wait for it. If
> handling that IPI is delayed, then we can wind up doing more
> migration/class switch/etc (on this cpu or some third cpu) before that
> irq_work runs.
>
> I think this may be ok and it's just the WARN that's wrong, but I'm not
> sure.
>
That whole thing is indeed nasty, /me goes back through my notes
Right, so you can have:
task_cpu(p) == CPU0
CPU0 CPU1
rq_lock();
rq_lock();
switched_from_fair()
task_throttle_cancel()
...
rq_unlock();
pull_rt_task()
set_task_cpu(p, CPU1);
rq_unlock();
task_throttle_cancel_irq_work_fn()
task_throttle_do_cancel()
WARN_ON_ONCE(CPU1 != CPU0);
That does mean the task p could start executing on CPU1 before the
task_work is cleared, which is something I want to avoid - while very
unlikely, the worst case is it reaches exit_to_user_mode() before the
task_work gets cleared, which is a no-no.
I could add a sched_class check in throttle_cfs_rq_work(), but this is making
wonder if irq_work is the right mechanism here.
The one important thing for me is: if a task doesn't need throttling
anymore, then the dequeue_task_fair() in exit_to_user_mode() mustn't happen.
One option I'm seeing is forget about the irq_work, always re-check sched_class
in throttle_cfs_rq_work() (we already check cfs_rq->throttle_count aka
throttled_hierarchy()), and just do a kick_process() to ensure a kernel entry.
>> + /*
>> + * Re-enqueue the tasks that have been throttled at this level.
>> + *
>> + * The task count is up-propagated via ->unthrottled_*h_nr_running,
>> + * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
>> + * might happen when a cfs_rq still has some tasks enqueued, either still
>> + * making their way to userspace, or freshly migrated to it.
>> + */
>> + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>> + struct sched_entity *pse = &p->se;
>> +
>> + list_del_init(&p->throttle_node);
>> +
>> + enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
>> + task_delta++;
>> + idle_task_delta += task_has_idle_policy(p);
>> + }
>
> You know, on our too-large machines with too-many cgroups, we're already
> running into these walk_tg_trees being worrying slow for holding a spinlock :P
>
Yeah, for N throttled cgroups in a hierarchy, this is pretty much squashing
N rq-lock-held segments into one. Not great.
What I'm thinking is we could have N walks re-queuing the tasks of a single
cfs_rq each walk and updating the hierarchy, releasing the rq-lock between
each walk - instead of a single walk re-queueing up to N lists of
tasks. More operations but smaller rq-lock-held segments.
>> +
>> + /*
>> + * Account tasks woken up in children; by this point all direct children
>> + * have been visited.
>> + */
>> + task_delta += cfs_rq->unthrottled_h_nr_running;
>> + idle_task_delta += cfs_rq->unthrottled_idle_h_nr_running;
>> +
>> + cfs_rq->h_nr_running += task_delta;
>> + cfs_rq->idle_h_nr_running += idle_task_delta;
>> +
>> + /*
>> + * unthrottle_cfs_rq() needs a value to up-propagate above the
>> + * freshly unthrottled cfs_rq.
>> + */
>> + cfs_rq->unthrottled_h_nr_running = task_delta;
>> + cfs_rq->unthrottled_idle_h_nr_running = idle_task_delta;
>
> I think this should have no effect, right?
Hm so my thoughts here are:
The walk_tg_tree_from(tg_unthrottle_up) will update *nr_running counts up
to the cfs_rq->tg->se[cpu_of(rq)]. However if that cfs_rq isn't the root
one, we'll need the for_each_sched_entity() loop further down
unthrottle_cfs_rq() to update the upper part of the hierarchy. The values
that will be up-propagated there are the ones being saved here.
>
>> +
>> + /* Accumulate the delta in the parent's stash. Once all its children
>> + * (i.e. all of this cfs_rq's siblings) have been visited, this value
>> + * will be stable and used for its own count update.
>> + */
>> + pcfs_rq->unthrottled_h_nr_running += task_delta;
>> + pcfs_rq->unthrottled_idle_h_nr_running += idle_task_delta;
>> +
>> + /*
>> + * If the cfs_rq became empty during throttling, then we dequeued
>> + * it. It needs to be put back in the hierarchy if it or any of
>> + * its children have now-unthrottled tasks.
>> + */
>> + if (!se->on_rq && (cfs_rq->h_nr_running || cfs_rq->idle_h_nr_running)) {
>> + enqueue_entity(pcfs_rq, se, ENQUEUE_WAKEUP);
>> + } else {
>> + update_load_avg(pcfs_rq, se, UPDATE_TG);
>> + se_update_runnable(se);
>> }
>
> So I think this is trying to combine the all updates, and it's logically
> just the same as if the loop was enqueue_task_fair, like you mentioned
> in a followup for the throttle_one_task and dequeue_task_fair?
>
Good point, that should be a mirror of throttle_one_task() wrt the enqueueing.
>> void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> @@ -5922,25 +6152,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> goto unthrottle_throttle;
>> }
>>
>> - task_delta = cfs_rq->h_nr_running;
>> - idle_task_delta = cfs_rq->idle_h_nr_running;
>> - for_each_sched_entity(se) {
>> - struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>> -
>> - if (se->on_rq)
>> - break;
>> - enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
>> -
>> - if (cfs_rq_is_idle(group_cfs_rq(se)))
>> - idle_task_delta = cfs_rq->h_nr_running;
>> + if (cfs_rq->throttle_count)
>> + return;
>>
>> - qcfs_rq->h_nr_running += task_delta;
>> - qcfs_rq->idle_h_nr_running += idle_task_delta;
>> + /*
>> + * cfs_rq's below us may not have been fully emptied out, so we can't rely
>> + * directly on ->h_nr_running. Use the stash instead.
>> + */
>> + task_delta = cfs_rq->unthrottled_h_nr_running;
>> + idle_task_delta = cfs_rq->unthrottled_idle_h_nr_running;
>>
>> - /* end evaluation on encountering a throttled cfs_rq */
>> - if (cfs_rq_throttled(qcfs_rq))
>> - goto unthrottle_throttle;
>> - }
>> + walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_clear_up, (void *)rq);
>>
>> for_each_sched_entity(se) {
>> struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>
> I think this would be simpler by making the first/original
> walk_tg_tree_from be
>
> walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_clear_down, tg_unthrottle_up, (void *)rq);
>
> (With the clear function being the same, just renamed)
>
> We never have any unthrottled* saved between calls to unthrottle_cfs_rq,
> because if throttled_count is set for us, it's set for all of our
> descendants too, so we're doing nothing but throttle_count stuff in that
> case. Sadly that doesn't let us remove the cfs_rq fields, that would
> need a walk_tg_tree_by_level.
Good point, I think that makes sense. Thanks!
Powered by blists - more mailing lists