[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240805014414.6t3puvhudklwbhaw@airbuntu>
Date: Mon, 5 Aug 2024 02:44:14 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Tejun Heo <tj@...nel.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
linux-kernel@...r.kernel.org, David Vernet <void@...ifault.com>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Alexei Starovoitov <ast@...nel.org>,
Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [GIT PULL] sched_ext: Initial pull request for v6.11
On 08/01/24 06:36, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 01, 2024 at 02:17:35PM +0100, Qais Yousef wrote:
> > > You made the same point in another thread, so let's discuss it there but
> >
> > Don't you think it's a bit rushed to include this part in the pull request?
>
> Not really. It seems pretty straightforward to me.
>
> > > it's not changing the relationship between schedutil and sched class.
> > > schedutil collects utility signals from sched classes and then translates
> > > that to cpufreq operations. For SCX scheds, the only way to get such util
> > > signals is asking the BPF scheduler. Nobody else knows. It's loading a
> > > completely new scheduler after all.
> >
> > But you're effectively making schedutil a userspace governor. If SCX wants to
> > define its own util signal, wouldn't it be more appropriate to pair it with
> > user space governor instead? It makes more sense to pair userspace scheduler
> > with userspace governor than alter schedutil behavior.
>
> The *scheduler* itself is defined from userspace. I have a hard time
> following why utilization signal coming from that scheduler is all that
> surprising. If user or the scheduler implementation want to pair it up with
> userspace governor, they can do that. I don't want to make that decision for
> developers who are implementing their own schedulers.
But schedutil is based on PELT signal. Capacity values, RT pressure, irq
pressure, and DL bandwidth are all based on that. And adding them together is
based on the fact they're all the same signal. I don't see it compatible to mix
and match.
And we have uclamp for user space to influence the decisions based on PELT
already. I don't see the need for another way to influence the decision.
Is it not desired to reuse util signal as-is? Or there's a problem that
prevents you from reusing it?
>
> ...
> > That's not how I read it. It supposed to be for things that alter the kernel
> > spec/functionality and make it not trust worthy. We already have a taint flag
> > for overriding ACPI tables. Out of tree modules can have lots of power to alter
> > things in a way that makes the kernel generally not trust worthy. Given how
> > intrusively the scheduler behavior can be altered with no control, I think
> > a taint flag to show case it is important. Not only for us, but also for app
> > developers as you don't know what people will decide to do that can end up
> > causing apps to misbehave weirdly on some systems that load specific scheduler
> > extensions. I think both of us (kernel and app developers) want to know that
> > something in the kernel that can impact this misbehavior was loaded.
>
> We of course want to make sure that developers and users can tell what
> they're running on. However, this doesn't really align with why taint flags
> were added and how they are usually used, and it's unclear how the use of a
> taint flag would improve the situation on top of the existing visibility
> mechanisms (in the sysfs and oops messasges). Does this mean loading any BPF
> program should taint the kernel? How about changing sysctls?
The difference here is that you're overriding decision, not just hooking. It's
like live patching the kernel and using fault injection. There's a very visible
side effect.
A BPF program can qualify to taint when it leads to changing the control flow.
The particular case here is not a passive observer case, but it is an active
overrider. And of a critical functionality. That's why I think it should be
treated like an external module.
sysctls don't change the control flow in a way that is decided outside of the
kernel.
The schedutil problem is an example of how there's a visible side effect. What
if the loaded scheduler decided to ignore uclamp hints for task placement or
potentially any new/existing hint/sched_attr added/present? Or if the system is
HMP and there's a loaded Energy Model but the loaded scheduler doesn't have
Energy Aware Scheduling support?
IIUC one of the goals of the sched_ext is not to have to keep everything happy
in favour to optimize for a specific system and workloads without being dragged
down with all the other things that can come in the way. So the inherent
breakage, AFAIU, is by design.
And once this is in we will lose all control over what people will do with it.
>
> > > It's the same as other BPF hooks. We don't want to break willy-nilly but we
> > > can definitely break backward compatibility if necessary. This has been
> > > discussed to death and I don't think we can add much by litigating the case
> > > again.
> >
> > Was this discussion on the list? I haven't seen it. Assuming the details were
> > discussed with the maintainers and Linus and there's an agreement in place,
> > that's good to know. If not, then a clarity before-the-fact is better than
> > after-the-fact. I think the boundaries are very hazy and regressions are one of
> > the major reasons that holds up the speed of scheduler development. It is very
> > easy to break some configuration/workload/system unwittingly. Adding more
> > constraints that are actually harder to deal with to the mix will make our life
> > exponentially more difficult.
>
> I wasn't a first party in the discussions and don't have good pointers.
> However, I know that the subject has been discussed to the moon and back a
> few times and the conclusion is pretty clear at this point - after all, the
> multiple ecosystems around BPF have been operating this way for quite a
> while now. Maybe BPF folks have better pointers?
Fair enough. I think I just want to highlight that the fragility extends to
failure to load as well as things suddenly stopping to behave as intended after
a kernel upgrade. If we all agree that wouldn't constitute a regression that
can impact in-kernel development and Linus is onboard with that then it's all
good.
Thanks!
--
Qais Yousef
Powered by blists - more mailing lists