linux-kernel - Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <612c8f18-21e5-452d-8e9f-583f224d8e54@meta.com>
Date: Mon, 24 Jun 2024 12:42:01 -0400
From: Chris Mason <clm@...a.com>
To: Thomas Gleixner <tglx@...utronix.de>, Tejun Heo <tj@...nel.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, mingo@...hat.com,
        peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, ast@...nel.org,
        daniel@...earbox.net, andrii@...nel.org, martin.lau@...nel.org,
        joshdon@...gle.com, brho@...gle.com, pjt@...gle.com,
        derkling@...gle.com, haoluo@...gle.com, dvernet@...a.com,
        dschatzberg@...a.com, dskarlat@...cmu.edu, riel@...riel.com,
        changwoo@...lia.com, himadrics@...ia.fr, memxor@...il.com,
        andrea.righi@...onical.com, joel@...lfernandes.org,
        linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
        kernel-team@...a.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On 6/23/24 4:14 AM, Thomas Gleixner wrote:
> Chris!
> 
> On Fri, Jun 21 2024 at 17:14, Chris Mason wrote:
>> On 6/21/24 6:46 AM, Thomas Gleixner wrote:
>> I'll be honest, the only clear and consistent communication we've gotten
>> about sched_ext was "no, please go away".  You certainly did engage with
>> face to face discussions, but at the end of the day/week/month the
>> overall message didn't change.
> 
> The only time _I_ really told you "go away" was at OSPM 2023 when you
> approached everyone in the worst possible way. I surely did not even say
> "please" back then.

Looping back to where I jumped into this thread, the context was you
suggesting that if we'd just asked one more time, real collaboration
might have started.  I'm not trying to change the message by snipping
this out of context, so if I've got this wrong, please do correct me.

>>> If you really wanted to get my attention then you exactly know how
>>> to get it like everyone else who is working with me for decades.

I really don't object to the scheduler maintainers disliking sched_ext.
Pluggable scheduling isn't a problem you wanted to solve, and bpf
probably isn't how you would have solved it.  We could have talked every
day for the last 18 months, and by now we'd have a huge library of
sonnets and haikus for each other, but sched_ext still wouldn't be merged.

I do object to rewriting history to claim that if we'd just used the
secret handshake, things would somehow be different than they are now.

> 
> The message people (not only me) perceived was:
> 
>   "The scheduler sucks, sched_ext solves the problems, saves us millions

I joined a conference I'd never been to before and brought a laundry
list of problems to the table.  So, there's definitely truth to the
perception that I came with an agenda and pushed it.

But, if I left anyone with the impression I thought the scheduler
sucked, that wasn't my goal.  Like every part of the kernel, there are
problems the scheduler creates and problems it doesn't solve, and my
goal is/was to invest in discussing and fixing both.

>    and Google is happy to work with us [after dropping upstream scheduler
>    development a decade ago and leaving the opens for others to mop up]."
> 

It's not a surprise that google and meta have a lot of problems in
common.  For us, collaborating with google is really rewarding and
important for a bunch of subsystems, scheduler included.

Google is full of smart people, and carrying private patches is both
expensive and wildly boring, and I'm always interested in why smart
people use different strategies to solve problems we have in common.

It's part of a discussion we get into internally.  Why does our kernel
team exist at all?  Are we just here to stabilize and ship Linus's
kernel?  Or are we here to try and advance our infrastructure by
developing in the kernel?

Those are two pretty different paths, and I know that we optimize for
things others don't care about, like contorting ourselves to make things
easier to ship to production.  But, we also optimize for feedback loops
with workload owners that other distros and kernel developers would
really envy.

It's a mixed bag, but I can say with certainty that adding features and
optimizations to the upstream kernel is one of the least efficient ways
to improve infra.  Some of this is for really good reason, nobody wants
all the tech debt that would come out of upstreaming every bad idea
we've ever had.  But there's a balance.

Before anyone gets upset with me, the upstream kernel can be the best
kernel on the planet and still be a really inefficient way to land
features and optimizations for applications.

> followed by:
> 
>   "You should take it, as it will bring in fresh people to work on the
>    scheduler due to the lower entry barrier [because kernel hacking sucks].
>    This will result in great new ideas which will be contributed back to
>    the scheduler proper."
> 
> That was a really brilliant marketing stunt and I told you so very bluntly.
> 

Yeah, I'd say all of those things again, and I think I repeated some of
it in this email too.  It's one of my favorite topics of conversation so
I won't bore everyone here (even more than I already have), but I'm
always trying to find ways to improve the feedback loops between
workloads and the kernel developers.

sched_ext has been really effective at that so far, both inside meta and
for others in the community.

> It was presumably not your intention, but that's the problem of
> communication between people. Though I haven't seen an useful attempt to
> cure that.
> 
> After that clash, the room got into a lively technical discussion about the
> real underlying problem, i.e. that a big part of scheduling issues comes
> from the fact, that there is not enough information about the requirements
> and properties of an application available. Even you agreed with that, if I
> remember correctly.

I still do!  In a private email a few months ago I promised you that my
one true workload modeling project was just a few months away.  It still
is just a few months away, but I do find the topic really interesting.

But, I disagree that we should stop sched_ext development until we find
the perfect way to model the properties and requirements of
applications.  I'm really glad that eevdf landed with more of a
iterate-in-the-kernel approach.

> 
> sched_ext does not solve that problem. It just works around it by putting
> the requirements and properties of an application into the BPF scheduler
> and the user space portion of it. That works well in a controlled
> environment like yours, but it does not even remotely help to solve the
> underlying general problems. You acknowlegded that and told: But we don't
> have it today, though sched_ext is ready and will help with that.

For me, the underlying general problems get solved with frequent
experiments and tight feedback loops.  It's all about iteration and
cooperation with the application developers, and sched_ext absolutely
does provide that.

Quoting from another email of yours in this thread

"I recently watched a talk about sched ext which explained how to model
an execution pipeline for a specific workload to optimize the scheduling
of the involved threads and how innovative that is. I really had a good
laugh because that's called explicit plan scheduling and has been
described and implemented in the early 2000s by academics already."

This is kind of exactly my point.  We do agree that there are lots of
well understood solutions to well understood problems that are missing
from the kernel.

> 
> The concern that sched_ext will reduce the incentive to work on the
> scheduler proper is not completely unfounded and I've yet to see the
> slightest evidence which proves the contrary.

Linus answered this pretty effectively, and I don't see the need to
expand on his comments.

> 
> Don't tell me that this is impossible because sched_ext is not yet
> upstream. It's used in production successfully as you said, so there
> clearly must be something to learn from which could be shared at least in
> form of data. OSPM24 would have been a great place for that especially as
> the requirements and properties discussion was continued there with a plan.
> 
> At all other occasions, I sat down with people and discussed at a technical
> level, but also clearly asked to resolve the social rift which all of this
> created.
> 
> I thereby surely said several times: "I wish it would just go away and stay
> out of tree", but that's a very different message, no?
> 

No, it's really not a different message.  The kernel tree is where
kernel development happens best.  Linus covered the comparison with RT
as well, but I definitely do understand you've had to carry a few
patches out of tree.

> Quite some of the questions and concerns I voiced, which got also voiced by
> others on the list, have not been sorted out until today. Just to name a
> few from the top of my head:
> 
>     - How is this supposed to work with different applications requiring
>       different sched_ext schedulers?
> 

I'll let Tejun pitch in on this one.

>     - How are distros/users supposed to handle this especially when
>       applications start to come with their own optimized schedulers?
> 

Having worked for two or three distros (I'd count meta, we have
customers too), distros pick and choose what to support based on what
their customers need and pay for, and different distros will make
different choices.  I'd assume we'll have a spectrum:

- sched_ext is unsupported, talk to the vendor
- sched_ext is unsupported, but we'll give debugging a shot
- sched_ext is supported when you're using $supported schedulers

Vendors might provide optimized schedulers, but they have to support a
huge range of distros and gloriously crusty enterprise kernels, so I
can't see anyone making it a requirement.

>     - What's the documented rule for dealing with bugs and regressions on a
>       system where sched_ext is active?
> 
>
> "We'll work it out in tree" is not an answer to that. Ignoring it and let
> the rest of the world deal with the fallout is not a really good answer
> either.

It's not different from any other new kernel component, or old kernel
component for that matter.  What's the documented rule for dealing with
bugs and regressions when a usb nic driver is loaded?  If you're asking
about bpf ABI, that's been covered in many other threads.

> 
> I'm not saying that this is all your and the sched_ext peoples fault, the
> other side was not always constructive either. Neither did it help that I
> had to drop the ball.
> 
> For me, Linus telling that he will merge it no matter what, was a wakeup
> call to all involved parties. One side reached out with a clear message to
> sort this out amicably and not making the situation worse.

This last part is where you lost me.  I've only seen a clear message to
delay for any and every reason you can make stick.  I know it's a jerk
thing to say and I'm sorry, but that's how it feels from my end.

> 
>> At any rate, I think sched_ext has a good path forward, and I know we'll
>> keep working together however we can.
> 
> Carefully avoiding the perception trap, may I politely ask what this is
> supposed to tell me?

I was shooting for optimism here...we've all known each other a long
time.  We'll find ways to keep working together.

-chris