[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <xhsmhv88u9zsh.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Tue, 19 Dec 2023 12:50:38 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Benjamin Segall <bsegall@...gle.com>
Cc: linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Mel
Gorman <mgorman@...e.de>, Daniel Bristot de Oliveira <bristot@...hat.com>,
Phil Auld <pauld@...hat.com>, Clark Williams <williams@...hat.com>, Tomas
Glozar <tglozar@...hat.com>
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return
to userspace
On 18/12/23 14:34, Benjamin Segall wrote:
> Valentin Schneider <vschneid@...hat.com> writes:
>>
>> So if I'm trying to compare notes:
>>
>> The dual tree:
>> - Can actually throttle cfs_rq's once all tasks are in userspace
>> - Adds (light-ish) overhead to enqueue/dequeue
>> - Doesn't keep *nr_running updated when not fully throttled
>>
>> Whereas dequeue via task_work:
>> - Can never fully throttle a cfs_rq
>> - Adds (not so light) overhead to unthrottle
>> - Keeps *nr_running updated
>
> Yeah, this sounds right. (Though for the task_work version once all the
> tasks are throttled, the cfs_rq is dequeued like normal, so it doesn't
> seem like a big deal to me)
>
In my implementation the cfs_rq's can be dequeued once all of their tasks
have been throttled, but the tasks aren't migrated back onto the cfs_rq's
until the unthrottle (rather than at the "final" throttle).
That is done because if a random task gets migrated to a throttled & empty
cfs_rq, we still need to schedule it 'till it exits the kernel, and we
don't want the now-in-userspace task to suddenly become eligible for
picking.
That adds these O(ntasks) loops in the unthrottle to re-shape the
cfs_rq's. The dual tree approach saves us from doing all that, IIUC
unthrottle_on_enqueue() only puts the se's in the task's direct hierarchy
back onto the second tree for them to be picked.
>>
>> My thoughts right now are that there might be a way to mix both worlds: go
>> with the dual tree, but subtract from h_nr_running at dequeue_kernel()
>> (sort of doing the count update in throttle_cfs_rq() early) and keep track
>> of in-user tasks via handle_kernel_task_prev() to add this back when
>> unthrottling a not-fully-throttled cfs_rq. I need to actually think about
>> it though :-)
>
> Probably, but I had tons of trouble getting the accounting correct as it
> was :P
>
Yeah so did I, cgroups are "fun".
>>
>> I'm trying to grok what the implications of a second tasks_timeline would
>> be on picking - we'd want a RBtree update equivalent to put_prev_entity()'s
>> __enqueue_entity(), but that second tree doesn't follow the same
>> enqueue/dequeue cycle as the first one. Would a __dequeue_entity() +
>> __enqueue_entity() to force a rebalance make sense? That'd also take care
>> of updating the min_vruntime.
>
> Yeah, the most sensible thing to do would probably be to just have curr
> not be in the queue, the same as the normal rbtree, and save the "on
> kernel list" as an on_rq-equivalent bool instead of as !list_empty(kernel_node).
Makes sense, I'll have a go at this.
Powered by blists - more mailing lists