linux-kernel - Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240527212540.u66l3svj3iigj7ig@airbuntu>
Date: Mon, 27 May 2024 22:25:40 +0100
From: Qais Yousef <qyousef@...alina.io>
To: David Vernet <void@...ifault.com>
Cc: Steven Rostedt <rostedt@...dmis.org>,
	Peter Zijlstra <peterz@...radead.org>, Tejun Heo <tj@...nel.org>,
	torvalds@...ux-foundation.org, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, bsegall@...gle.com, mgorman@...e.de,
	bristot@...hat.com, vschneid@...hat.com, ast@...nel.org,
	daniel@...earbox.net, andrii@...nel.org, martin.lau@...nel.org,
	joshdon@...gle.com, brho@...gle.com, pjt@...gle.com,
	derkling@...gle.com, haoluo@...gle.com, dvernet@...a.com,
	dschatzberg@...a.com, dskarlat@...cmu.edu, riel@...riel.com,
	changwoo@...lia.com, himadrics@...ia.fr, memxor@...il.com,
	andrea.righi@...onical.com, joel@...lfernandes.org,
	linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
	kernel-team@...a.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On 05/14/24 16:34, David Vernet wrote:
> On Tue, May 14, 2024 at 01:07:15AM +0100, Qais Yousef wrote:
> 
> [...]
> 
> > > > 
> > > > How does this BPF muck translate into better quality patches for me?
> > > 
> > > Here's how we will be using it (we will likely be porting sched_ext to
> > > ChromeOS regardless of its acceptance).
> > > 
> > > Doing testing of scheduler changes in the field is extremely time
> > > consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to
> > > 5.15 (as that is the kernel version we are using on the chromebooks we
> > > were testing on), and then we need to add a user space "switch" to
> > > change the scheduler. Note, this also risks causing a bug in adding
> > > these changes. Then we push the kernel out, and then start our
> > > experiment that enables our feature to a small percentage, and slowly
> > > increases the number of users until we have a enough for a statistical
> > > result.
> > > 
> > > What sched_ext would give us is a easy way to try different scheduling
> > > algorithms and get feedback much quicker. Once we determine a solution
> > > that improves things, we would then spend the time to implement it in
> > > the scheduler, and yes, send it upstream.
> > > 
> > > To me, sched_ext should never be the final solution, but it can be
> > > extremely useful in testing various changes quickly in the field. Which
> > > to me would encourage more contributions.
> 
> Hello Qais,
> 
> [...]
> 
> > I really don't buy the rapid development aspect too. The scheduler was heavily
> 
> There are already several examples from users who have shown that the rapid
> development and experimentation is extremely useful. Imagine if you're
> iterating on the scheduler to improve p99 frame rates on the Steam Deck, as
> Changwoo described. It's much more efficient to be able to just tweak and load
> a BPF scheduler (that is safe and can't crash the machine) to try some random
> idea out than it is to:
> 
> 1. Tweak and recompile the kernel
> 2. Reinstall the kernel on the Steam Deck
> 3. Reboot the Steam Deck
> 4. Reload a game and let caches rewarm
> 5. Measure FPS
> 
> You're talking about a 5 second compile job + 1 second to reload a safe BPF
> scheduler vs. having to do all of the above steps _and_ potentially making a
> mistake that brings the machine down. These benefits are also extremely useful
> for testing workloads on production servers, etc. Let’s also not forget that
> unlike many other kernel features, you probably can’t get reliable scheduling
> results from running in a VM. The experimentation overhead is very real.

What I read here is that I can hack my system quickly. Is the intention to
extend the kernel? If yes, I can't see how this experimentation is actually
valid if not implemented in the kernel first taking into account the real
constraint that you have to deal with sooner or later.

> 
> [...]
> 
> > influenced by the early contributors which come from server market that had
> > (few) very specific workloads they needed to optimize for and throughput had
> > a heavier weight vs latency. Fast forward to now, things are different. Even on
> > server market latency/responsiveness has become more important. Power and
> > thermal are important on a larger class of systems now too. I'd dare say even
> > on server market. How do you know when it's okay for an app/task to consume too
> > much power and when it is not? Hint hint, you can't unless someone in userspace
> > tells you. Similarly for latency vs throughput. What is the correct way to
> > write an application to provide this info? Then we can ask what is missing in
> > the scheduler to enable this.
> 
> Hmm, you seem to be arguing that the way forward here is to have our one
> general purpose scheduler be entirely driven by user space hinting. Assuming
> I’m not misunderstanding you, I strongly disagree with this sentiment.  User
> space hinting can be powerful, but I think we need to have a general purpose
> scheduler that's completely agnostic to whatever is running in user space.
> We’ve also been able to get strong results from sched_ext schedulers that don’t
> use any user space hinting.

I'm curious. If you believe in general purpose system, what work was done to
improve the current one? What debugging and analysis was done to improve the
current situation? It seems you reached a conclusion that we need something
different - but no reasons behind it why is that.

Is the problem with the default behavior of the system? Or are your problems
focused on corner cases where things seem to fail?

> 
> Also, even if this ended up being the way forward, I don’t see it being
> practical to implement. Wouldn’t it require us to update all of user space

People swear by Apple's GCD by the way. It'd be really great if someone can
create something similar that works properly on Linux. I have never tried the
libdispatch port to see how well it does.

And have you seen these?

	https://developer.android.com/stories/games/mediatek-adpf

> globally just to update how it interfaces with the scheduler?

I think you're confusing default scheduler behavior and dealing with corner
cases that are impossible for the scheduler to resolve. These corner cases are
when help is needed. Note that the thermal API is actually info from the system
to the app. If the app decides to listen, then they can help reduce the thermal
impact without causing throttling. If they decide not to listen, then the best
the system can do is throttle everything hard to protect from damage. And under
bad thermal pressure, the scheduler can know which tasks to prioritize to
performance if it has explicit knowledge/hints.

If the default behavior is not working for you; could you provide more details
on what goes wrong? It's unlikely that a new algorithm is the solution, but
likely a bug somewhere or some configuration problem.

And if someone wants to optimize for best perf, power and thermal, they need to
do the work. There's only so much you can do on their behalf that is actually
scalable.

System designers want apps (all type of apps) to take best advantage of the
hardware they built.

App writers want to write portable software that gives the desired experience
on all type of systems without special optimization.

> 
> [...]
> 
> > Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
> > default for throughput by the way (server market bias). You can manipulate
> > those and get better latencies.
> 
> Those knobs aren't available anymore in EEVDF.

I generalized my statement as I didn't expect many have moved to 6.6 LTS, which
is the only one that has EEVDF.

EEVDF has base_slice_ns. What value do you read on your system? What is your
TICK value and how m any CPUs do you have?

>  
> [...]
> 
> > point IMO, not the scheduler algorithm. If the latter need to change, it needs
> > to be as the result of this friction - which what EEVDF came about from to my
> > understanding. To enable implementing a latency interface easier. But Vincent
> > had a working implementation with CFS too which I think would have worked fine
> > by the way.
> 
> This friction is nothing new. It's why we already find ourselves in the
> unfortunate position of having a large corpus of out of tree scheduler patches.
> If there is a lot of performance being left on the table, vendors are going to
> find a way to get that performance. Corporations don't need our consent to ship
> kernels with custom schedulers on their devices. They've already been doing it
> for years, and it's ultimately the users who suffer.

I think everyone agrees on the need to improve. But..

> 
> I genuinely believe that the fair.c scheduler will benefit from being able to
> apply ideas conceived in a sched_ext scheduler which end up working well for
> general use cases. For example, in scx_rusty, we’re able to get very good
> interactivity [0] by determining a task’s deadline as a function of its average

. I am really failing to see why you jumped to the fact we need a new
scheduler. And you'll find a lot of skepticism about the validity of your
results. We have no clue what kind of unknown constraint you've left out with
these test. And how limited your environment is.

> runtime (along with some other great ideas that Changwoo first added to
> scx_lavd) rather than from its eligibility + slice as with what EEVDF does.

You'll find soon that the concept of runtime is hard. And generally there's
a big soup of tasks running in the system. Most of which have no real deadline.
Only few do. And corner cases are the complexity of any situation when you
have more tasks that need to run immediately than you have CPUs to distribute
them on. I don't think we can figure this out automatically based on runtime.

> Over the course of a day or two, I tried way more ideas that didn’t work than
> would have been possible in that time frame than with a recompile-reboot cycle,
> and ended up finding one that seems to work very well. It would be awesome if
> these ideas were added to EEVDF so that everyone can benefit.

Why do you think they're applicable? And how do you know you're not working
around different problems? Or have missed constraints in your testing once
applied will make the whole results invalid?

Too many unknowns IMHO. I am not against a different scheduler algorithm if it
proves to be a more generic default. But you'll find first you have to explain
what has failed in current one and what kind of analysis made you reach this
conclusion. And then you'll find you'll need to actually do it in the kernel
taking into account all the constraints that you must handle to prove it is
still as valid as you initially thought.

And I can only share my experience, I don't think the algorithm itself is the
bottleneck here. The devil is in the corner cases. And these are hard to deal
with without explicit hints.

The biggest issue I see generally with the default behavior is that
traditionally it has been biased towards throughput because those folks are
the one that keep reporting regressions when anything changes on the list.

Please add your voice and report problems when you notice things don't work for
you. That's the best way to ensure there's visibility of these issues. It seems
to me you're hitting problems that people expect to work. But I have no clue
what problems you have. I am not sure if this was reported somewhere else, but
it seems not.


Thanks!

--
Qais Yousef