linux-kernel - Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8734p4ymqj.ffs@tglx>
Date: Sun, 23 Jun 2024 12:31:00 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Tejun Heo <tj@...nel.org>, mingo@...hat.com, peterz@...radead.org,
 juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com, ast@...nel.org,
 daniel@...earbox.net, andrii@...nel.org, martin.lau@...nel.org,
 joshdon@...gle.com, brho@...gle.com, pjt@...gle.com, derkling@...gle.com,
 haoluo@...gle.com, dvernet@...a.com, dschatzberg@...a.com,
 dskarlat@...cmu.edu, riel@...riel.com, changwoo@...lia.com,
 himadrics@...ia.fr, memxor@...il.com, andrea.righi@...onical.com,
 joel@...lfernandes.org, linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
 kernel-team@...a.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Linus!

On Fri, Jun 21 2024 at 09:34, Linus Torvalds wrote:
> So if I don't see it, please point it out very very explicitly, and
> using small words to make me understand.

I really let the scheduler people bring up their pain points now. There
is no value in me being the mouthpiece when I want to achieve that folks
talk to each other (again). It's not a me against you thing. If they
fail to explain, so be it.

So I just use one trivial example to illustrate the approach and the
message it emits, i.e. the technical and the social problem which makes
all of this so unpleasant.

It struck my eyes a few days ago when I deciphered the fork maze:

struct sched_ext_entity {
       .....
	/* must be the last field, see init_scx_entity() */
	struct list_head	tasks_node;
};

void init_scx_entity(struct sched_ext_entity *scx)
{
	/*
	 * init_idle() calls this function again after fork sequence is
	 * complete. Don't touch ->tasks_node as it's already linked.
	 */
	memset(scx, 0, offsetof(struct sched_ext_entity, tasks_node));
...

I immediately asked myself the obvious question: Why?

It took me less than 10 minutes to figure out that the double invocation of
__sched_fork() is bogus.

fork_idle()
   copy_process()
      sched_fork()
	__sched_fork()
   init_idle()
     __sched_fork()

There is only one other call site of init_idle():

     sched_init()
       init_idle()

to initialize the idle task of the boot CPU.

Another 10 minutes later I had the obvious patch for this booted and
validated. So overall that took me just 20 minutes and that's not because
I'm the deep scheduler expert, it's because I care. 

It's really not the job of the maintainer to point at that or figure it
out especially not with the knowledge that our maintainer resources are
anything else than abundant. Even if I can't figure it out on my own,
then pointing it out and asking would have been the right thing to do.

Instead of working around it and conveying the message "shrug, it's not in
the scope of my project, why should I care?", it would have:

   1) removed technical debt

   2) not added more technical debt

   3) followed what documentation asks people to do

   4) told the scheduler people "hey, we care about working with you"

It's a trivial detail, but it illustrates the tiny bits which contributed
to the overall rift.

Here I really pick RT, not for comparison, but for reference. RT was
certainly exposed to resistance, ignorance and outright denial. We sorted
it out by working with the people, by going the extra mile of mopping up
technical debt. Setting this as expectation is not asked too much.

As I said before: This is both a technical and a social problem.

As the social problem became prevalent over time the technical problem
solving got nowhere.

As a matter of fact both sides contributed to that, and that cannot be
resolved par ordre du mufti.

The result of that would be a even deeper rift which turns 'work it out in
tree' to 'work around each other in tree' or worse.

There is interest from both sides to get this sorted. At least that's what
was conveyed to me. I'll make sure to hear it from the source.

Thanks,

	tglx

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4344,7 +4344,8 @@ int wake_up_state(struct task_struct *p,
  * Perform scheduler related setup for a newly forked process p.
  * p is forked by current.
  *
- * __sched_fork() is basic setup used by init_idle() too:
+ * __sched_fork() is basic setup which is also used by sched_init() to
+ * initialize the boot CPU's idle task.
  */
 static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
@@ -7579,8 +7580,6 @@ void __init init_idle(struct task_struct
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	__sched_fork(0, idle);
-
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
 	raw_spin_rq_lock(rq);
 
@@ -7594,12 +7593,7 @@ void __init init_idle(struct task_struct
 	kthread_set_per_cpu(idle, cpu);
 
 #ifdef CONFIG_SMP
-	/*
-	 * It's possible that init_idle() gets called multiple times on a task,
-	 * in that case do_set_cpus_allowed() will not do the right thing.
-	 *
-	 * And since this is boot we can forgo the serialization.
-	 */
+	/* No validation and serialization required at boot time. */
 	set_cpus_allowed_common(idle, &ac);
 #endif
 	/*
@@ -8407,6 +8401,7 @@ void __init sched_init(void)
 	 * but because we are the idle thread, we just pick up running again
 	 * when this runqueue becomes "idle".
 	 */
+	__sched_fork(0, current);
 	init_idle(current, smp_processor_id());
 
 	calc_load_update = jiffies + LOAD_FREQ;