linux-kernel - Re: [GIT PULL] sched_ext: Initial pull request for v6.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZqAFtfSijJ-KMVHo@slm.duckdns.org>
Date: Tue, 23 Jul 2024 09:34:13 -1000
From: Tejun Heo <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, David Vernet <void@...ifault.com>,
	Ingo Molnar <mingo@...hat.com>, Alexei Starovoitov <ast@...nel.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [GIT PULL] sched_ext: Initial pull request for v6.11

Hello, Peter.

On Tue, Jul 23, 2024 at 06:33:58PM +0200, Peter Zijlstra wrote:
> >  	/*
> > -	 * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
> > -	 * if there's more than one we need the tick for involuntary
> > -	 * preemption.
> > +	 * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks
> > +	 * left. For CFS, if there's more than one we need the tick for
> > +	 * involuntary preemption. For SCX, ask.
> >  	 */
> > -	if (rq->nr_running > 1)
> > +	if (!scx_switched_all() && rq->nr_running > 1)
> > +		return false;
> > +
> > +	if (scx_enabled() && !scx_can_stop_tick(rq))
> >  		return false;
> >  
> 
> Doesn't that boil down to something like:
> 
> 	if (scx_switched_all())
> 		return scx_can_stop_tick(rq);
>
> 	if (rq->nr_running > 1)
> 		return false;

Yeah, given that rq->nr_running == 1 && partial enabling condition is
unlikely to matter and can easily lead to misdetermination by the BPF
scheduler, the suggested code is better and cleaner. I'll update.

> > @@ -5773,7 +5827,19 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> >  				  struct rq_flags *rf)
> >  {
> >  #ifdef CONFIG_SMP
> > +	const struct sched_class *start_class = prev->sched_class;
> >  	const struct sched_class *class;
> > +
> > +#ifdef CONFIG_SCHED_CLASS_EXT
> > +	/*
> > +	 * SCX requires a balance() call before every pick_next_task() including
> > +	 * when waking up from SCHED_IDLE. If @start_class is below SCX, start
> > +	 * from SCX instead.
> > +	 */
> > +	if (sched_class_above(&ext_sched_class, start_class))
> > +		start_class = &ext_sched_class;
> 
> 	if (scx_enabled() && ...)
> 
> ?

Will add.

> > @@ -5782,7 +5848,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> >  	 * We can terminate the balance pass as soon as we know there is
> >  	 * a runnable task of @class priority or higher.
> >  	 */
> > -	for_class_range(class, prev->sched_class, &idle_sched_class) {
> > +	for_active_class_range(class, start_class, &idle_sched_class) {
> >  		if (class->balance(rq, prev, rf))
> >  			break;
> 
> Don't you need fixing balance_fair() here? It has:
> 
>   if (rq->nr_running)
>     return 1;
> 
> Which would return true and terminate the iteration even if there are
> only scx tasks left.

Right, that will lead to issues in partial mode. I think it hasn't been
noticed because tasks were being consumed through otherwise empty CPUs. Will
fix.

...
> > +	for_each_active_class(class) {
> >  		p = class->pick_next_task(rq);
> > -		if (p)
> > +		if (p) {
> > +			const struct sched_class *prev_class = prev->sched_class;
> > +
> > +			if (class != prev_class && prev_class->switch_class)
> > +				prev_class->switch_class(rq, p);
> >  			return p;
> > +		}
> >  	}
> 
> So I really don't like this one.. at the very least it would need a comment
> explaining why it only needs calling here and not every time a put_prev_task()
> and set_next_task() pair cross a class boundary -- which would be the
> consistent thing to do.
> 
> Now, IIRC, you need a class call that indicates you're about to loose the CPU
> so that you can kick the task to another CPU or somesuch. And last time I got
> it backwards and suggested adding an argument to pick_next_task(), but what
> about put_prev_task()?
>
> Can't we universally push put_prev_task() after the pick loop? Then we get
> something like:

Yeah, the problem with put_prev_task() was that it was before the next task
was picked, so we couldn't know what the next class should be.

> 	next = pick_task();
> 	if (next != prev) {
> 		put_prev_task(rq, prev, next->class != prev->class);
> 		set_next_task(rq, next);
> 	}
> 
> I have patches for most of this for fair (in my eevdf queue), and I
> think the others are doable, after all, this is more or less what we do
> for SCHED_CORE already.
> 
>   /me went off hacking for a bit
> 
> I've done this; find the results at: queue.git sched/prep
> 
> I've also rebased the sched_ext tree on top of that with the below delta, which
> you can find at: queue.git sched/scx

Hmm... took a brief look, but I think I'm missing something. Doesn't
put_prev_task() need to take place before pick_task() so that the previously
running task can be considered when pick_task() is picking the next task to
run?

> This led me to discover that you're passing the task of the other class into
> the bpf stuff -- I don't think that is a sane thing to do. You get preempted,
> it doesn't matter from which higher class or by which specific task, a policy
> must not care about that. So I kinda bodged it, but I really think this should
> be taken out.

At least for visibility, I think it makes sense. One can attach extra BPF
progs to track the preemption events but it can be useful for the scheduler
itself to be able to colllect when and why it's getting preempted. Also, in
a more controlled environments, which RT task is preempting the task can be
a, while not always reliable, useful hint in whether it'd be better to
bounce to another CPU right away or not. That said, we can drop it e.g. if
it makes implementation unnecessarily complicated.

> I also found you have some terrible !SMP hack in there, which I've
> broken, I've disabled it for now. This needs a proper fix, and not
> something ugly like you did.

Yeah, this came up before. On UP, SCX either needs to call the balance
callback as that's where the whole dispatch logic is called from (which
makes sense for it as dispatching often involves balancing operations), or
SCX itself needs to call it directly in a matching sequence. Just enabling
balance callback on UP && SCX would be the cleanest.

> Anyway, it seems to build, boot and pass the sched_ext selftest:
> 
> PASSED:  21
> SKIPPED: 0
> FAILED:  0
> 
> (albeit with a fair amount of console noise -- is that expected?)

Yeah, the selftests trigger a lot of error conditions so that's expected.

> Also, why does that thing hard depend on DEBUG_BTF? (at least having
> that option enabled no longer explodes build times like it used to)

That's necessary for building the schedulers, at least, I think. We didn't
have that earlier and people were getting confused.

> Also should we /CONFIG_SCHED_CLASS_EXT/CONFIG_SCHED_BPF/ ? Then
> something like: grep BPF foo-build/.config
> more easily shows what's what.

I don't know. isn't that too inconsistent with other classes, and down the
line, maybe there may be BPF related additions to scheduler?

Thanks.

-- 
tejun