linux-kernel - Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zlit_RUFPparkS3h@slm.duckdns.org>
Date: Thu, 30 May 2024 06:49:01 -1000
From: Tejun Heo <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: torvalds@...ux-foundation.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	bristot@...hat.com, vschneid@...hat.com, ast@...nel.org,
	daniel@...earbox.net, andrii@...nel.org, martin.lau@...nel.org,
	joshdon@...gle.com, brho@...gle.com, pjt@...gle.com,
	derkling@...gle.com, haoluo@...gle.com, dvernet@...a.com,
	dschatzberg@...a.com, dskarlat@...cmu.edu, riel@...riel.com,
	changwoo@...lia.com, himadrics@...ia.fr, memxor@...il.com,
	andrea.righi@...onical.com, joel@...lfernandes.org,
	linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
	kernel-team@...a.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Hello,

It has been a couple weeks, so I take it that you aren't intending to
respond. I think it'd be useful to summarize the arguments against sched_ext
and list the counter-points.

(1) Merging sched_ext will weaken the incentive to contribute.

While this may partially be true, it isn't looking at the whole picture.
This argument looks at the costs of sched_ext while ignoring the benefits,
and it ignores the costs of funneling all scheduler work through one
codebase.

If you look at the whole picture, I think you’ll see that:

- The problem space of CPU scheduling is too big for a single code base to
  be effective. Hardware has changed a lot and so have the workloads. There
  are many areas that we haven't mapped out. It's difficult to try anything
  radical in a code base which has to satisfy everyone all the time, but
  holding the bar so high that experimentation is suppressed means we will
  all be worse off.

- The bar for contribution is too high, driving away potential contributors.
  Many vendors and users carry internal patches as the upstreaming cost is
  too high. We are already seeing multiple developers who have not
  previously contributed to fair.c actively participating in and driving
  sched_ext schedulers. It’s possible those developers will eventually
  contribute to fair.c, but if sched_ext didn’t exist this would be less
  likely.

The constraint of only one scheduler codebase makes it very difficult to
contribute. You say that this constraint is necessary to force
collaboration, but I think the opposite is happening - many people don't
bother trying to contribute because the bar is too high. If sched_ext is
merged, the scheduler code base may lose some of the enforcement. However,
in the longer term, I believe we will gain more talented and motivated
engineers working in the problem space and some of them will surely find it
worthwhile to contribute to fair.c. It will be the most widely used
scheduler in the world no matter what, and will be attractive for people to
work on.

EEVDF worked out because you have worked on the scheduler for a long time
and have gained a ton of context on what works and doesn't. It also worked
out because you were more confident that it'd get merged. How do we build
confidence in other developers who want to explore whatever comes after
EEVDF without worrying that it is hopeless to try? sched_ext provides an
outlet for people who aren't already established to take a smaller risk
first, which is likely to lead to more people contributing.

(2) Efforts and developments out of the kernel tree are worthless.

I believe this is too narrow a view. Direct contribution is one form of
contribution but there are many others, including research. EEVDF itself is
based on a research paper. Figuring out what works and sharing them seems as
important as anything to me.

One reason cited for the uselessness is that out-of-tree efforts are often
throw-away and don't build up to anything. There is some truth to this but
the main reason is the difficulty of working with out-of-tree kernel
modifications. Rebase is painful and there is no convenient way to
distribute to users. Some still power through but it's near impossible to
build a user base and community for things that are out-of-tree. sched_ext
solves these problems and the umbrella repo serves as the central repository
for the developers to collaborate and learn from each other. This isn’t a
prediction for the future, it is something which is already actively
happening.

Given the right environment, they will keep flourishing and finding new ways
to improve scheduling. Many of them won't be applicable to the built-in
scheduler, but some will. It's also likely that, in the long term, the
larger scheduler developer base will be directly beneficial to the built-in
scheduler too.

(3) This will lead to vendor-specific fragmentation.

This is already happening with or without sched_ext whether that's in the
form of out-of-tree scheduler patches or people trying to circumvent the
scheduler with creative uses of the RT class.

sched_ext will introduce a different mode of doing it. There are scenarios
where the situation can become a bit worse but I don't believe the
difference would be drastic. Because all sched_ext schedulers have to be
under the GPL, any vendor shipping a sched_ext scheduler to a customer will
have to publish the code. If there are useful ideas we'll be just as free to
take them as now. Also, users would have the benefit that it's a lot easier
to opt out of the vendor's scheduler.

On balance, yes, sched_ext may lead to more or at least different types of
fragmentation, but that seems like a minor downside compared to the overall
benefits especially given that we have to live with some level of
fragmentation no matter what.

(4) sched_ext is a debug tool and we don't merge debug tools.

I think both parts of the above claim are wrong. sched_ext can be used
purely as a debug tool but it's also performant and flexible enough to
readily enable non-trivial practical use cases. We are using it in
production today, and as stated elsewhere in this thread, there are multiple
other companies in various stages of rolling it out to production. It can be
a debug tool, a temporary bridge to field early ideas while working on
something more permanent, a proper solution to specific problems which don't
quite fit the general scheduler (an extreme example would be
standard-dictated scheduling for avionics), and so on.

Also, we merge debug tools all the time. Lockdep is a debug tool. The code
base is full of debug features and components. Why wouldn't we merge
something if it makes the lives of the developers and users better by making
it easier to understand and debug problems? We don't merge printks someone
sprinkled over the code base to debug one particular problem. We do and
should merge tools and frameworks which improve visibility and debugging.


To reiterate our proposition: Let’s please open it up. Scheduling doesn’t
have to be this closed. Many open subsystems survive fine and often thrive
thanks to their openness. sched_ext hooks into the core scheduling but the
contact surface is limited, and, if they ever get in the way, we’ll do our
best to resolve them. The balance in the trade-offs seems pretty obvious.

Thanks.

--
tejun