linux-kernel - Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6eb9302e-59c9-4242-bfb1-e473d3e5380e@meta.com>
Date: Tue, 14 May 2024 16:22:25 -0400
From: Chris Mason <clm@...a.com>
To: Peter Zijlstra <peterz@...radead.org>, Tejun Heo <tj@...nel.org>
Cc: torvalds@...ux-foundation.org, mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, ast@...nel.org,
        daniel@...earbox.net, andrii@...nel.org, martin.lau@...nel.org,
        joshdon@...gle.com, brho@...gle.com, pjt@...gle.com,
        derkling@...gle.com, haoluo@...gle.com, dvernet@...a.com,
        dschatzberg@...a.com, dskarlat@...cmu.edu, riel@...riel.com,
        changwoo@...lia.com, himadrics@...ia.fr, memxor@...il.com,
        andrea.righi@...onical.com, joel@...lfernandes.org,
        linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
        kernel-team@...a.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On 5/13/24 4:03 AM, Peter Zijlstra wrote:
> On Sun, May 05, 2024 at 01:31:26PM -1000, Tejun Heo wrote:
> 
>>> You Google/Facebook are touting collaboration, collaborate on fixing it.
>>> Instead of re-posting this over and over. After all, your main
>>> motivation for starting this was the cpu-cgroup overhead.
>>
>> The hierarchical scheduling overhead isn't the main motivation for us. We
>> can't use the CPU controller for all workloads and while it'd be nice to
>> improve that,
> 
> Hurmph, I had the impression from the earlier threads that this ~5%
> cgroup overhead was most definitely a problem and a motivator for all
> this.
> 
> The overhead was prohibitive, it was claimed, and you needed a solution.
> Did not previous versions use this very argument in order to push for
> all this?
> 
> By improving the cgroup mess -- I very much agree that the cgroup thing
> is not very nice. This whole argument goes away and we all get a better
> cgroup implementation.
> 
>> This view works only if you assume that the entire world contains only a
>> handful of developers who can work on schedulers. The only way that would be
>> the case is if the barrier of entry is raised unreasonably high. Sometimes a
>> high barrier of entry can't be avoided or is beneficial. However, if it's
>> pushed up high enough to leave only a handful of people to work on an area
>> as large as scheduling, something probably is wrong.
> 
> I've never really felt there were too few sched patches to stare at on
> any one day (quite the opposite on many days in fact).
> 
> There have also always been plenty out of tree scheduler patches --
> although I rarely if ever have time to look at them.
> 
> Writing a custom scheduler isn't that hard, simply ripping out
> fair_sched_class and replacing it with something simple really isn't
> *that* hard.
> 
> The only really hard requirement is respecting affinities, you'll crash
> and burn real hard if you get that wrong (think of all the per-cpu
> kthreads that hard rely on the per-cpu-ness of them).
> 
> But you can easily ignore cgroups, uclamp and a ton of other stuff and
> still boot and play around.
> 
>> I believe we agree that we want more people contributing to the scheduling
>> area. 
> 
> I think therein lies the rub -- contribution. If we were to do this
> thing, random loadable BPF schedulers, then how do we ensure people will
> contribute back?
> 
> That is, from where I am sitting I see $vendor mandate their $enterprise
> product needs their $BPF scheduler. At which point $vendor will have no
> incentive to ever contribute back.

Especially in the scheduler space, the incentive to contribute back
today is somewhat inverted. As you mention above, it's relatively easy
to make custom things, and it's also very difficult to get features and
patches included. The cost of maintaining patches out of tree is
relatively low in comparison with the cost of working through inclusion,
and the scheduler stands out in terms of how hard it is to land changes.

I think the scheduler balances the needs of a wide variety of workloads
exceptionally well, but based on the volume of out of tree scheduler
infrastructure, it feels like the community is struggling to meet their
collaboration needs in the upstream tree.

Just like I can’t imagine one filesystem working for everything, I think
we need to open up the field a little on schedulers.  As we develop for
new variations in workloads, power management, and hardware types, I
think sched_ext gives us a way to do more collaboration in the upstream
tree, and while I’m not pretending it’s perfect, it’s definitely ready
for expansion and broader use.

I do think that sched_ext developers will keep participating upstream,
and I agree with a lot of the points that Steve makes in his reply.
People are going to keep sending patches in because the kernel community
is just the best place to build and maintain this functionality.

-chris