linux-kernel - Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <xhsmhfs0maw49.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date:   Fri, 01 Dec 2023 14:30:14 +0100
From:   Valentin Schneider <vschneid@...hat.com>
To:     Benjamin Segall <bsegall@...gle.com>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Phil Auld <pauld@...hat.com>,
        Clark Williams <williams@...hat.com>,
        Tomas Glozar <tglozar@...hat.com>
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return
 to userspace

On 30/11/23 13:26, Benjamin Segall wrote:
> Valentin Schneider <vschneid@...hat.com> writes:
>
>> As reported in [1], CFS bandwidth throttling is a source of headaches in
>> PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
>> prevent ksoftirqd from running, which prevents replenishing & unthrottling
>> the cfs_rq of said CFS task.
>>
>> Peter mentioned that there have been discussions on changing /when/ the
>> throttling happens: rather than have it be done immediately upon updating
>> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
>> for the task to be about to return to userspace.
>>
>> I'm not aware of the arguments in favour of this for !PREEMPT_RT, but given [1]
>> I had a look into it. Using this approach means no task can be throttled while
>> holding a kernel lock, which solves the locking dependency issue.
>
> The alternative we've been experimenting with (and still running into
> other issues that have made it hard to tell if they work) is to still
> leave the tasks on their cfs_rqs, but instead have two task_timelines or
> similar per cfs_rq, one of which only has unthrottleable tasks (or
> partially-throttled child cgroups) on it. Then when picking into a
> partially-unthrottled cfs_rq you only look at that alternate task_timeline.
>

IIUC then you don't dequeue the cfs_rq's se, but instead rely on the RB
tree switch to only have unthrottable tasks running.

> This means that we get to skip this per-actually-throttled-task loop:
>
>> @@ -5548,7 +5548,61 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>>  {
>>      struct rq *rq = data;
>>      struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>> +	struct sched_entity *se = tg->se[cpu_of(rq)];
>> +	struct cfs_rq *pcfs_rq = cfs_rq_of(se);
>> +	long task_delta = 0, idle_task_delta = 0;
>> +	struct task_struct *p, *tmp;
>>
>> +	/*
>> +	 * Re-enqueue the tasks that have been throttled at this level.
>> +	 *
>> +	 * The task count is up-propagated via ->unthrottled_*h_nr_running,
>> +	 * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
>> +	 * might happen when a cfs_rq still has some tasks enqueued, either still
>> +	 * making their way to userspace, or freshly migrated to it.
>> +	 */
>> +	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>> +		struct sched_entity *pse = &p->se;
>> +
>> +		list_del_init(&p->throttle_node);
>> +		enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
>> +		task_delta++;
>> +		idle_task_delta += task_has_idle_policy(p);
>> +	}
>
> The downsides are that you instead have extra operations per
> enqueue/dequeue/pick (but just an extra list/rbtree operation or check),
> and that it doesn't do *any* accounting change for a partially dequeued
> cfs_rq.
>

I would assume we want to keep the *nr_running as close to reality as
possible, given their impact on pick_next_task_fair() & load_balance().

> I'm going to try putting together a cleaner variant of our version that
> works via task_work instead of bracketing every relevant entry point.
> (That design came from when we were trying instead to only do it for
> tasks holding actual locks)

Interesting, thank you for sharing! I assume then the motivation for this
is to reduce latencies caused by throttling lock holders?