lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zpbp02N6bAE8mNXb@slm.duckdns.org>
Date: Tue, 16 Jul 2024 11:44:51 -1000
From: Tejun Heo <tj@...nel.org>
To: Vishal Chourasia <vishalc@...ux.ibm.com>
Cc: David Vernet <void@...ifault.com>, linux-kernel@...r.kernel.org
Subject: Re: sched_ext/for-6.11: cpu validity check in ops_cpu_valid

Hello, Vishal.

On Tue, Jul 16, 2024 at 12:19:16PM +0530, Vishal Chourasia wrote:
...
> However, the case of the BPF scheduler is different; we shouldn't need
> to handle corner cases but instead immediately flag such cases.

I'm not convinced of this. There's a tension here and I don't think either
end of the spectrum is the right solution. Please see below.

> Consider this: if a BPF scheduler is returning a non-present CPU in
> select_cpu, the corresponding task will get scheduled on a CPU (using
> the fallback mechanism) that may not be the best placement, causing
> inconsistent behavior. And there will be no red flags reported making it
> difficult to catch. My point is that sched_ext should be much stricter
> towards the BPF scheduler.

While flagging any deviation as failure and aborting sounds simple and clean
on the surface, I don't think it's that clear cut. There already are edge
conditions where ext or core scheduler code overrides sched_class decisions
and it's not straightforward to get synchronization against e.g. CPU hotplug
watertight from the BPF scheduler. So, we can end up with aborting a
scheduler once in a blue moon for a condition which can only occur during
hotplug and be easily worked around without any noticeable impact. I don't
think that's what we want.

That's not to say that the current situation is great because, as you
pointed out, it's possible to be systematically buggy and fly under the
radar, although I have to say that I've never seen this particular part
being a problem but YMMV.

Currently, error handling is binary. Either it's all okay or the scheduler
dies, but I think things like select_cpu() returning an offline CPU likely
needs a bit more nuance. ie. If it happens once around CPU hotplug, who
cares? But if a scheduler is consistently returning an invalid CPU, that
certainly is a problem and it may not be easy to notice. One way to go about
it could be collecting stats for these events and let the BPF scheduler
decide what to do about them.

Thanks.

-- 
tejun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ