linux-kernel - Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <871q4rpi2s.ffs@tglx>
Date: Thu, 20 Jun 2024 20:47:23 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Tejun Heo <tj@...nel.org>, mingo@...hat.com, peterz@...radead.org,
 juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com, ast@...nel.org,
 daniel@...earbox.net, andrii@...nel.org, martin.lau@...nel.org,
 joshdon@...gle.com, brho@...gle.com, pjt@...gle.com, derkling@...gle.com,
 haoluo@...gle.com, dvernet@...a.com, dschatzberg@...a.com,
 dskarlat@...cmu.edu, riel@...riel.com, changwoo@...lia.com,
 himadrics@...ia.fr, memxor@...il.com, andrea.righi@...onical.com,
 joel@...lfernandes.org, linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
 kernel-team@...a.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Linus!

On Wed, Jun 19 2024 at 22:07, Linus Torvalds wrote:
> On Wed, 19 Jun 2024 at 19:35, Thomas Gleixner <tglx@...utronix.de> wrote:
>>
>> When I sat there in Richmond with the sched_ext people I gave them very
>> deep technical feedback especially on the way how they integrate it:
>>
>>   Sprinkle hooks and callbacks all over the place until it works by some
>>   definition of works.
>
> Are we even talking about the same thing?

Yes we do.

> But "sprinkle hooks and callbacks all over the place"?

There are too many places which add scx_***() invocations. That's what I
asked to be cleaned up and to be generalized so it becomes uniform over
the scheduler classes.

Sure one could argue that some of these calls are at places where some
other scheduling class dropped already one, but that's wrong because
accumulating bad code just creates more technical debt.

If cleaned up then the already existing hacks vanish into proper class
callbacks, which improves the existing code and allows to drop in sched
ext more naturally at the end.

One example I very explicitely mentioned back then is the dance around
fork().  It took me at least an hour last year to grok the convoluted
logic and it did not get any faster when I stared at it today again.

fork()
  sched_fork()
    scx_pre_fork()
      percpu_down_rwsem(&scx_fork_rwsem);

    if (dl_prio(p)) {
    	ret = -EINVAL;
        goto cancel; // required to release the semaphore
    }

  sched_cgroup_fork()
    return scx_fork();

  sched_post_fork()
    scx_post_fork()
      percpu_up_rwsem(&scx_fork_rwsem);

Plus the extra scx_cancel_fork() which releases the scx_fork_rwsem in
case that any call after sched_fork() fails.

My head still spins from deciphering this once more.

What has scx_fork() to do with sched_cgroup_fork()? It's completely
non-obvious and the lack of comments does not help either. The changelog is
handwaving at best.

scx_pre_fork() takes the semaphore unconditionally independent of the
scheduler class of the forking task and even in the case that no BPF
scheduler is loaded or active. Why? Neither the changelog nor the lack
of comments give any hint, which is also not a new complaint from me.

A proper cleanup would just eliminate the unconditional down(), the
dl_prio() cancel logic plus the whole if/elseif dance including the
#ifdef SCHED_EXT. It's not rocket science to come up with the obvious:

       ret = p->sched_class->pre_fork();
       if (ret)
       		return ret;

That's just proper engineering which is what some famous programmer
tells people to do for a very long time:

 "I’m a huge proponent of designing your code around the data, rather
  than the other way around, ... I will, in fact, claim that the
  difference between a bad programmer and a good one is whether he
  considers his code or his data structures more important. Bad
  programmers worry about the code. Good programmers worry about data
  structures and their relationships."

And that's not only proper engineering it's also the other approach the
same famous programmer tells people to do:

 "I want them to lull me into a safe and cozy world where the stuff they
  are pushing is actually useful to mainline people _first_."

IOW, give me something which is useful to me _first_ so that you can add
your particular flavor of crazy on top without bothering me.

You obviously can complain now about the crazy people who actually listen
to what that famous programmer is saying. :)

But the above is not only true for that famous programmer personally,
that's equally true for any maintainer who has to deal with the result of a
submission for a long time.

I'm still not seeing the general mainline people benefit of all this, so I
have to trust you that there is one which is beyond my comprehension
skills.

That said, I'm more than wary about the hidden locking scheme of that
percpu semaphore in the fork path, but that's a different design
question to be debated on the way.

There are some other technical details which need to be sorted including
the interaction with other pending code, but that's something which can
be solved as we move forward.

Ideally this can be shaped in a way so that the scheduler becomes closer to
being modular, which would be the real useful thing for research and not
just the advertisment version of it.

But wait a moment, that can't happen as pluggable schedulers have been
rejected in the past:

  "I absolutely *detest* pluggable schedulers."

Guess which famous programmer said that.

> And scx_next_task_picked() isn't pretty - as far as I understand, it's
> because there's only a "class X picked" callback ("pick_next_task()"),
> and no way to tell other classes they weren't picked.
> Could things like that next_active_class() perhaps be done more
> prettily? I'm sure.

Well spotted.

> But I get the very strong feeling that people wanted to limit the
> amount of changes they made to the core scheduler code.

Which is exactly the point. If the existing code does not let your new
feature fall into place, then refactor it so it does. Working around the
short comings at some other place is patently wrong and that's not
something new either.

Requesting such refactoring is not an undue burden because the people who
maintain the code will have to deal with the result.

Unwrapping this stuff after the fact is the worst thing to do and I
definitely have an expert opinion on this.

None of this is rocket science and could have been done long ago even
without me holding hands.

>> I clearly offered you to try to resolve this amicably within a
>> reasonable time frame.
>>
>> How exaclty is that equivalent to "continue to do nothing" ?
>
> So if we actually *can* resolve this amicably in three months, then
> that sounds worth it.
>
> But my reaction is "what changed"? Nothing has become more amicable in
> the last nine months. What makes the next three months special?

The difference is:

  1) Peter is back and the capacity problem is less bad than it was

  2) As I explained before I unfortunately did not have cycles in the
     past _seven_ months, but I can make the cycles available now and
     drive this forward. That's possible when both sides are willing to
     cooperate. I'm sure there is enough incentive to do so.

  3) I'm going to focus on the integration and interaction aspect and
     grudgingly leave the cgroup trainwreck out of it.

     If someone from the crowd who caused it actually had the courtesy
     to mop this up, that would be a great signal for everyone and it
     can be done in parallel.

     Not that I'm holding my breath, but I'm still a hopeless optimist.

If you don't trust me on that, then we have a very different problem.

Thanks,

        Thomas