linux-kernel - Re: [PATCH rfc 0/6] Scheduler BPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YV3v3RkxOB6g/O+8@carbon.lan>
Date:   Wed, 6 Oct 2021 11:50:05 -0700
From:   Roman Gushchin <guro@...com>
To:     Qais Yousef <qais.yousef@....com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        <bpf@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH rfc 0/6] Scheduler BPF

On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> Hi Roman
> 
> On 09/16/21 09:24, Roman Gushchin wrote:
> > There is a long history of distro people, system administrators, and
> > application owners tuning the CFS settings in /proc/sys, which are now
> > in debugfs. Looking at what these settings actually did, it ended up
> > boiling down to changing the likelihood of task preemption, or
> > disabling it by setting the wakeup_granularity_ns to more than half of
> > the latency_ns. The other settings didn't really do much for
> > performance.
> > 
> > In other words, some our workloads benefit by having long running tasks
> > preempted by tasks handling short running requests, and some workloads
> > that run only short term requests which benefit from never being preempted.
> 
> We had discussion about introducing latency-nice hint; but that discussion
> didn't end up producing any new API. Your use case seem similar to Android's;
> we want some tasks to run ASAP. There's an out of tree patch that puts these
> tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> which seem okay for its purpose. Having a more generic solution in mainline
> would be nice.
> 
> https://lwn.net/Articles/820659/

Hello Qais!

Thank you for the link, I like it!

> 
> > 
> > This leads to a few observations and ideas:
> > - Different workloads want different policies. Being able to configure
> >   the policy per workload could be useful.
> > - A workload that benefits from not being preempted itself could still
> >   benefit from preempting (low priority) background system tasks.
> 
> You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> these tasks; but assuming they're background and there's idle time in the
> system that should be fine.
> 
> https://lwn.net/Articles/805317/
> 
> That of course assuming you can classify these background tasks..
> 
> If you can do the classification, you can also use cpu.shares to reduce how
> much cpu time they get. Or CFS bandwidth controller
> 
> https://lwn.net/Articles/844976/

The cfs cgroup controller is that it's getting quite expensive quickly with the
increasing depth of the cgroup tree. This is why we had to disable it for some
of our primary workloads.

Still being able to control latencies on per-cgroup level is one of the goals
of this patchset.

> 
> I like Androd's model of classifying tasks. I think we need this classification
> done by other non-android systems too.
> 
> > - It would be useful to quickly (and safely) experiment with different
> >   policies in production, without having to shut down applications or reboot
> >   systems, to determine what the policies for different workloads should be.
> 
> Userspace should have the knobs that allows them to tune that without reboot.
> If you're doing kernel development; then it's part of the job spec I'd say :-)

The problem here occurs because there is no comprehensive way to test any
scheduler change rather than run it on many machines (sometimes 1000's) running
different production-alike workloads.

If I'm able to test an idea by loading a bpf program (and btw have some sort of
safety guarantees: maybe the performance will be hurt, but at least no panics),
it can speed up the development process significantly. The alternative is way
more complex from the infrastructure's point of view: releasing a custom kernel,
test it for safety, reboot certain machines to it, pin the kernel from being
automatically updated etc.

> 
> I think one can still go with the workflow you suggest for development without
> the hooks. You'd need to un-inline the function you're interested in; then you
> can use kprobes to hook into it and force an early return. That should produce
> the same effect, no?

Basically it's exactly what I'm suggesting. My patchset just provides a
convenient way to define these hooks and some basic useful helper functions.

> 
> > - Only a few workloads are large and sensitive enough to merit their own
> >   policy tweaks. CFS by itself should be good enough for everything else,
> >   and we probably do not want policy tweaks to be a replacement for anything
> >   CFS does.
> > 
> > This leads to BPF hooks, which have been successfully used in various
> > kernel subsystems to provide a way for external code to (safely)
> > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > and the people deploying BPF scripts are already quite used to updating them
> > for new kernel versions.
> 
> I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> gets heavily modified by vendors and OEMs. We try very hard to understand the
> problems they face and get the right set of solutions in mainline. Which would
> ultimately help towards the goal of having a single Generic kernel Image [1]
> that gives you what you'd expect out of the platform without any need for
> additional cherries on top.

Wouldn't it make your life easier had they provide a set of bpf programs instead
of custom patches?

> 
> So my worry is that this will open the gate for these hooks to get more than
> just micro-optimization done in a platform specific way. And that it will
> discourage having the right discussion to fix real problems in the scheduler
> because the easy path is to do whatever you want in userspace. I am not sure we
> can control how these hooks are used.

I totally understand your worry. I think we need to find a right balance between
allowing to implement custom policies and keeping the core functionality
working well enough for everybody without a need to tweak anything.

It seems like an alternative to this "let's allow cfs customization via bpf"
approach is to completely move the scheduler code into userspace/bpf, something
that Google's ghOSt is aiming to do.

> 
> The question is: why can't we fix any issues in the scheduler/make it better
> and must have these hooks instead?

Of course, if it's possible to implement an idea in a form which is suitable
for everybody and upstream it, this is the best outcome. The problem is that
not every idea is like that. A bpf program can leverage a priori knowledge
of a workload and its needs, something the generic scheduler code lacks
by the definition.

Thanks!