lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=UTh=JGUDZxO74+ZRQbF+yzcWnBo-f=oie0msmBn2_00g@mail.gmail.com>
Date: Wed, 6 Nov 2024 15:02:35 -0800
From: Doug Anderson <dianders@...omium.org>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, linux-kernel@...r.kernel.org, kernel-team@...a.com, 
	sched-ext@...a.com, Andrea Righi <arighi@...dia.com>, Changwoo Min <multics69@...il.com>, 
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather
 and eject BPF scheduler on softlockup

Hi,

On Wed, Nov 6, 2024 at 2:08 PM Tejun Heo <tj@...nel.org> wrote:
>
> Hello, Doug.
>
> On Wed, Nov 06, 2024 at 01:32:40PM -0800, Doug Anderson wrote:
> ...
> > 1. It doesn't feel right to add knowledge of "sched-ext" to the
> > softlockup detector. You're calling from a generic part of the kernel
> > to a specific part and it just feels unexpected, like there should be
> > some better boundaries between the two.
>
> I suppose we can create a notifier like infrastructure if directly calling
> is what's bothersome but it's likely an overkill at this point. The second
> point probably is more important to discuss.
>
> > 2. You're relying on a debug feature to enforce correctness. The
> > softlockup detector isn't designed to _fix_ softlockups. It's designed
> > to detect and report softlockups and then possibly reboot the machine.
> > Someone would not expect that turning on this debug feature would
> > cause the system to take the action of kicking out a BPF scheduler.
>
> Softlockups can trigger panic and thus system reset, which is arguably also
> a remediative action.

Sort of, though it doesn't feel to me like quite the same thing.


> > It feels like sched-ext should fix its own watchdog so it detects and
> > fixes the problem before the softlockup detector does.
>
> sched_ext has its own watchdog with configurable timeout and softlockups
> would eventually trigger that too. However, that's looking at the time
> between tasks waking up and running to detect stalls and the (configurable)
> time duration is usually longer than softlockup detection threshold, which
> makes sense given what the different failure modes they're looking at.
>
> If sched_ext is to expand its watchdog to monitor softlockup like
> conditions, it would essentially look just like softirq watchdog and we
> would still have the same problem of coordinating detection thresholds.
>
> Having a notification mechanism which triggers when watchdog is about to
> trigger which can take a drastic action doesn't sound too odd to me. If I
> make it use a notification chain so that the mechanism is more generic,
> would that make it more acceptable to you?

Honestly, it would feel better to me if the soft lockup timer didn't
tell schedext to kill things but instead we just make some special
exception for "schedext" tasks and exclude them from the softlockup
detector because they're already being watched by their own watchdog.
Would that be possible? Then tweaking the "softlockup" timeouts
doesn't implicitly change how long schedext things can run.

-Doug

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ