linux-kernel - Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhy1dr9ygf.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Mon, 18 Dec 2023 19:07:12 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Benjamin Segall <bsegall@...gle.com>
Cc: linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
 Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Mel
 Gorman <mgorman@...e.de>, Daniel Bristot de Oliveira <bristot@...hat.com>,
 Phil Auld <pauld@...hat.com>, Clark Williams <williams@...hat.com>, Tomas
 Glozar <tglozar@...hat.com>
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return
 to userspace

On 07/12/23 15:47, Benjamin Segall wrote:
> Alright, this took longer than I intended, but here's the basic version
> with the duplicate "runqueue" just being a list rather than actual EEVDF
> or anything sensible.
>
> I also wasn't actually able to adapt it to work via task_work like I
> thought I could; flagging the entry to the relevant schedule() calls is
> too critical.
>

Thanks for sharing this! Have a first set of comments.

> --------
>
> Subject: [RFC PATCH] sched/fair: only throttle CFS tasks in userspace
>
> The basic idea of this implementation is to maintain duplicate runqueues
> in each cfs_rq that contain duplicate pointers to sched_entitys which
> should bypass throttling. Then we can skip throttling cfs_rqs that have
> any such children, and when we pick inside any not-actually-throttled
> cfs_rq, we only look at this duplicated list.
>
> "Which tasks should bypass throttling" here is "all schedule() calls
> that don't set a special flag", but could instead involve the lockdep
> markers (except for the problem of percpu-rwsem and similar) or explicit
> flags around syscalls and faults, or something else.
>
> This approach avoids any O(tasks) loops, but leaves partially-throttled
> cfs_rqs still contributing their full h_nr_running to their parents,
> which might result in worse balancing. Also it adds more (generally
> still small) overhead to the common enqueue/dequeue/pick paths.
>
> The very basic debug test added is to run a cpusoaker and "cat
> /sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
> cgroup with a quota < 1 cpu.

So if I'm trying to compare notes:

The dual tree:
- Can actually throttle cfs_rq's once all tasks are in userspace
- Adds (light-ish) overhead to enqueue/dequeue
- Doesn't keep *nr_running updated when not fully throttled

Whereas dequeue via task_work:
- Can never fully throttle a cfs_rq
- Adds (not so light) overhead to unthrottle
- Keeps *nr_running updated

My thoughts right now are that there might be a way to mix both worlds: go
with the dual tree, but subtract from h_nr_running at dequeue_kernel()
(sort of doing the count update in throttle_cfs_rq() early) and keep track
of in-user tasks via handle_kernel_task_prev() to add this back when
unthrottling a not-fully-throttled cfs_rq. I need to actually think about
it though :-)

> @@ -154,11 +154,11 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>       while (ti_work & EXIT_TO_USER_MODE_WORK) {
>
>               local_irq_enable_exit_to_user(ti_work);
>
>               if (ti_work & _TIF_NEED_RESCHED)
> -			schedule();
> +			schedule_usermode(); /* TODO: also all of the arch/* loops that don't use this yet */
                                                                          ^^
                                      Interestingly -Werror=comment trips on this

>
>               if (ti_work & _TIF_UPROBE)
>                       uprobe_notify_resume(regs);
>
>               if (ti_work & _TIF_PATCH_PENDING)
> @@ -5416,18 +5426,38 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>       se->prev_sum_exec_runtime = se->sum_exec_runtime;
>  }
>
>  /*
>   * Pick the next process, keeping these things in mind, in this order:
> - * 1) keep things fair between processes/task groups
> - * 2) pick the "next" process, since someone really wants that to run
> - * 3) pick the "last" process, for cache locality
> - * 4) do not run the "skip" process, if something else is available
> + * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
> + * 2) keep things fair between processes/task groups
> + * 3) pick the "next" process, since someone really wants that to run
> + * 4) pick the "last" process, for cache locality
> + * 5) do not run the "skip" process, if something else is available
>   */
>  static struct sched_entity *
> -pick_next_entity(struct cfs_rq *cfs_rq)
> +pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
>  {
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	/*
> +	 * TODO: This might trigger, I'm not sure/don't remember. Regardless,
> +	 * while we do not explicitly handle the case where h_kernel_running
> +	 * goes to 0, we will call account/check_cfs_rq_runtime at worst in
> +	 * entity_tick and notice that we can now properly do the full
> +	 * throttle_cfs_rq.
> +	 */
> +	WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));

FWIW this triggers pretty much immediately in my environment (buildroot
over QEMU).

> @@ -5712,16 +5748,26 @@ static int tg_throttle_down(struct task_group *tg, void *data)
>       cfs_rq->throttle_count++;
>
>       return 0;
>  }
>
> +static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
> +static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
> +
>  static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>       struct rq *rq = rq_of(cfs_rq);
>       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>       struct sched_entity *se;
> -	long task_delta, idle_task_delta, dequeue = 1;
> +	long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
> +
> +	/*
> +	 * We don't actually throttle, though account() will have made sure to
> +	 * resched us so that we pick into a kernel task.
> +	 */
> +	if (cfs_rq->h_kernel_running)
> +		return false;
>

So as long as a cfs_rq has at least one task executing in the kernel, we
keep its *nr_running intact. We do throttle the whole thing once all the
tasks are in userspace (or about to be anyway), and unthrottle_on_enqueue()
undoes that.

> @@ -8316,15 +8471,44 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>
>  preempt:
>       resched_curr(rq);
>  }
>
> +static void handle_kernel_task_prev(struct task_struct *prev)
> +{
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	struct sched_entity *se = &prev->se;
> +	bool new_kernel = is_kernel_task(prev);
> +	bool prev_kernel = !list_empty(&se->kernel_node);
> +	/*
> +	 * These extra loops are bad and against the whole point of the merged
> +	 * PNT, but it's a pain to merge, particularly since we want it to occur
> +	 * before check_cfs_runtime.
> +	 */
> +	if (prev_kernel && !new_kernel) {
> +		WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
> +		for_each_sched_entity(se) {
> +			dequeue_kernel(cfs_rq_of(se), se, 1);
> +			if (cfs_rq_throttled(cfs_rq_of(se)))
> +				break;
> +		}
> +	} else if (!prev_kernel && new_kernel && se->on_rq) {
> +		for_each_sched_entity(se) {
> +			enqueue_kernel(cfs_rq_of(se), se, 1);
> +			if (cfs_rq_throttled(cfs_rq_of(se)))
> +				break;
> +		}
> +	}
> +#endif
> +}


I'm trying to grok what the implications of a second tasks_timeline would
be on picking - we'd want a RBtree update equivalent to put_prev_entity()'s
__enqueue_entity(), but that second tree doesn't follow the same
enqueue/dequeue cycle as the first one. Would a __dequeue_entity() +
__enqueue_entity() to force a rebalance make sense? That'd also take care
of updating the min_vruntime.