lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zyvo7lFcnAddB9RT@slm.duckdns.org>
Date: Wed, 6 Nov 2024 12:08:46 -1000
From: Tejun Heo <tj@...nel.org>
To: Doug Anderson <dianders@...omium.org>
Cc: David Vernet <void@...ifault.com>, linux-kernel@...r.kernel.org,
	kernel-team@...a.com, sched-ext@...a.com,
	Andrea Righi <arighi@...dia.com>,
	Changwoo Min <multics69@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops
 breather and eject BPF scheduler on softlockup

Hello, Doug.

On Wed, Nov 06, 2024 at 01:32:40PM -0800, Doug Anderson wrote:
...
> 1. It doesn't feel right to add knowledge of "sched-ext" to the
> softlockup detector. You're calling from a generic part of the kernel
> to a specific part and it just feels unexpected, like there should be
> some better boundaries between the two.

I suppose we can create a notifier like infrastructure if directly calling
is what's bothersome but it's likely an overkill at this point. The second
point probably is more important to discuss.

> 2. You're relying on a debug feature to enforce correctness. The
> softlockup detector isn't designed to _fix_ softlockups. It's designed
> to detect and report softlockups and then possibly reboot the machine.
> Someone would not expect that turning on this debug feature would
> cause the system to take the action of kicking out a BPF scheduler.

Softlockups can trigger panic and thus system reset, which is arguably also
a remediative action.

> It feels like sched-ext should fix its own watchdog so it detects and
> fixes the problem before the softlockup detector does.

sched_ext has its own watchdog with configurable timeout and softlockups
would eventually trigger that too. However, that's looking at the time
between tasks waking up and running to detect stalls and the (configurable)
time duration is usually longer than softlockup detection threshold, which
makes sense given what the different failure modes they're looking at.

If sched_ext is to expand its watchdog to monitor softlockup like
conditions, it would essentially look just like softirq watchdog and we
would still have the same problem of coordinating detection thresholds.

Having a notification mechanism which triggers when watchdog is about to
trigger which can take a drastic action doesn't sound too odd to me. If I
make it use a notification chain so that the mechanism is more generic,
would that make it more acceptable to you?

Thanks.

-- 
tejun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ