linux-kernel - Re: [GIT PULL] sched_ext: Initial pull request for v6.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zq0p154ndOtU9Ypu@slm.duckdns.org>
Date: Fri, 2 Aug 2024 08:47:51 -1000
From: Tejun Heo <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, David Vernet <void@...ifault.com>,
	Ingo Molnar <mingo@...hat.com>, Alexei Starovoitov <ast@...nel.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [GIT PULL] sched_ext: Initial pull request for v6.11

Hello,

On Fri, Aug 02, 2024 at 02:20:34PM +0200, Peter Zijlstra wrote:
> 
> A few more..
> 
> > +static bool scx_switching_all;
> > +DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
> 
> > +	WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
> > +	if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL))
> > +		static_branch_enable(&__scx_switched_all);
> 
> > +	static_branch_disable(&__scx_switched_all);
> > +	WRITE_ONCE(scx_switching_all, false);
> 
> FYI the static_key contains a variable you can read if you want, see
> static_key_count()/static_key_enabled(). No need to mirror the state.

They go off together but they don't go on together in scx_ops_enable().
switching_all goes on first which makes new tasks to go on SCX, all existing
tasks are switched, and then __scx_switched_all goes on. I can make
switching_all a static_key too but the only time it's read is during fork
and while switching scheduler class, so it doesn't really seem worth it.

> > +static struct task_struct *
> > +scx_task_iter_next_locked(struct scx_task_iter *iter, bool include_dead)
> > +{
> > +	struct task_struct *p;
> > +retry:
> > +	scx_task_iter_rq_unlock(iter);
> > +
> > +	while ((p = scx_task_iter_next(iter))) {
> > +		/*
> > +		 * is_idle_task() tests %PF_IDLE which may not be set for CPUs
> > +		 * which haven't yet been onlined. Test sched_class directly.
> > +		 */
> > +		if (p->sched_class != &idle_sched_class)
> > +			break;
> 
> This isn't quite the same; please look at play_idle_precise() in
> drivers/powercap/idle_inject.c.
> 
> That is, there are PF_IDLE tasks that are not idle_sched_class.

Thanks for pointing that out. scx_task_iter's are used to move tasks in and
out of SCX when BPF scheduler is attached and detached, respectively. The
requirements are:

1. __setscheduler_prio() should be able to determine the sched_class for the
   task.

2. scx_ops_init_task() must be called on all tasks which can potentially
   switch to SCX. This invokes ops.init_task() if implemented.

The swappers are problematic for both 1) and 2). __setscheduler_prio() won't
preserve the idle_sched_class, and it's really easy to confuse the BPF
schedulers with swappers as they aren't normal tasks (e.g. they have the
same PID). Besides, they are never going to be scheduled by SCX anyway.

Looking at idle_inject, they seem to be regular RT threads which temporarily
set PF_IDLE. AFAICS, this should be completely fine. The iterator just
really needs to avoid the swappers.

I'll add more comments.

...
> > +static void update_curr_scx(struct rq *rq)
> > +{
> > +	struct task_struct *curr = rq->curr;
> > +	u64 now = rq_clock_task(rq);
> > +	u64 delta_exec;
> > +
> > +	if (time_before_eq64(now, curr->se.exec_start))
> > +		return;
> > +
> > +	delta_exec = now - curr->se.exec_start;
> > +	curr->se.exec_start = now;
> > +	curr->se.sum_exec_runtime += delta_exec;
> > +	account_group_exec_runtime(curr, delta_exec);
> > +	cgroup_account_cputime(curr, delta_exec);
> 
> Could you please use update_curr_common() here?
> 
> This helps keep the accounting in one place. For instance, see this
> patch:
> 
>   https://lkml.kernel.org/r/20240727105031.053611186@infradead.org  
> 
> That adds a sum_exec_runtime variant that is scaled by DVFS and
> capacity.
> 
> You should be able to make the function:
> 
> 	u64 delta_exec = update_curr_common(rq);

Will do.

...
> > +	/*
> > +	 * We want to pass scx-specific enq_flags but activate_task() will
> > +	 * truncate the upper 32 bit. As we own @rq, we can pass them through
> > +	 * @rq->scx.extra_enq_flags instead.
> > +	 */
> > +	WARN_ON_ONCE(rq->scx.extra_enq_flags);
> > +	rq->scx.extra_enq_flags = enq_flags;
> 
> eeew.. it's not just activate_task(), its the whole callchain having
> 'int' flags. That said, we should be having plenty free bits there no?

Yeah, this isn't pretty but here are the rationales:

- These extra enqueue flags are only necessary for SCX. For other
  sched_classes, these flags exist to communicate between sched core and
  sched_classes. SCX also needs to communicate from the SCX itself to the
  BPF scheduler, which creates the need for extra flags.

- While possible, it's pretty cumbersome to support backward compatibility
  when flag values change. Unless done proactively, it can silently and
  subtly break forward compatibility.

- As SCX flags aren't interesting to anyone else, it makes sense to put them
  in the upper 32bits. This way, they don't get in the way and the chance of
  needing to shift them around is lower.

...
> > +static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
> > +				struct task_struct *p, struct rq *task_rq)
> > +{
> > +	bool moved = false;
> > +
> > +	lockdep_assert_held(&dsq->lock);	/* released on return */
> > +
> > +	/*
> > +	 * @dsq is locked and @p is on a remote rq. @p is currently protected by
> > +	 * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
> > +	 * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
> > +	 * rq lock or fail, do a little dancing from our side. See
> > +	 * move_task_to_local_dsq().
> > +	 */
> > +	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> > +	task_unlink_from_dsq(p, dsq);
> > +	dsq_mod_nr(dsq, -1);
> > +	p->scx.holding_cpu = raw_smp_processor_id();
> > +	raw_spin_unlock(&dsq->lock);
> > +
> > +	double_lock_balance(rq, task_rq);
> > +
> > +	moved = move_task_to_local_dsq(rq, p, 0);
> > +
> > +	double_unlock_balance(rq, task_rq);
> > +
> > +	return moved;
> > +}
> 
> I've gotta ask, why are you using the double_lock_balance() pattern
> instead of the one in move_queued_task() that does:
> 
>   lock src
>   dequeue src, task
>   set_task_cpu task, dst
>   unlock src
> 
>   lock dst
>   enqueue dst, task
>   unlock dst

When !CONFIG_PREEMPTION, double_lock_balance() seems cheaper than unlocking
and locking unconditionally. Because SCX schedulers can do a lot more hot
migrations, I thought it'd be better to go that way. I haven't actually
measured anything tho, so I could be wrong.

> > +/*
> > + * Similar to kernel/sched/core.c::is_cpu_allowed() but we're testing whether @p
> > + * can be pulled to @rq.
> > + */
> > +static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq)
> > +{
> > +	int cpu = cpu_of(rq);
> > +
> > +	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> > +		return false;
> > +	if (unlikely(is_migration_disabled(p)))
> > +		return false;
> > +	if (!(p->flags & PF_KTHREAD) && unlikely(!task_cpu_possible(cpu, p)))
> > +		return false;
> > +	if (!scx_rq_online(rq))
> > +		return false;
> > +	return true;
> > +}
> 
> I'm a little confused, is_cpu_allowed() is used for that same purpose
> no?

Yeah, this is a bit subtle. There are two differences:

- IIUC, during migration, is_cpu_allowed() has to say "yes" to the
  destination CPU as that may be the CPU that the task is being migrated to.
  However, task_can_run_on_remote_rq() is asking something different - it's
  asking "can I initiate migrating the task to this rq?", which it shouldn't
  while migration is disabled.

- The BPF scheduler is bypassed while the rq is offline. This doesn't hinder
  other migrations. It just makes the CPU state testing a non-factor for
  "can the BPF scheduler pull this task to this rq?".

I suppose the function can be rewritten to be something like:

	if (!is_cpu_allowed())
		return false;
	if (unlikely(is_migration_disabled(p)))
		return false;
	if (!scx_rq_online(rq))
		return false;

but I'm not sure it's necessarily better. I'll add more comments explaining
what's going on.

Thanks.

-- 
tejun