lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKPOu+_zurvzehn+Wp=gbQxafHP9jBEPM4NcrDzb6Kd2C0MmaA@mail.gmail.com>
Date: Sun, 4 May 2025 08:36:23 +0200
From: Max Kellermann <max.kellermann@...os.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: song@...nel.org, joel.granados@...nel.org, dianders@...omium.org, 
	cminyard@...sta.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count

On Sun, May 4, 2025 at 4:47 AM Andrew Morton <akpm@...ux-foundation.org> wrote:
> Documenation/, please?

Do you mean Documentation/ABI/testing/ ? (like
Documentation/ABI/testing/sysfs-kernel-oops_count)
I'll add that; I was confused by the directory name "testing" and
didn't expect to find actual documentation there.

> >  Having this is useful for monitoring tools.
>
> Useful how?  Use cases?  Examples?

To detect whether the machine is healthy. If the kernel has
experienced a soft lockup, it's probably due to a kernel bug, and I'd
like to detect that quickly and easily. There is currently no way to
detect that, other than parsing dmesg. Or observing indirect effects:
such as certain tasks not responding, but then I need to observe all
tasks. I'd rather be able to detect the primary cause easily - just
like some people decided that they want to observe an oops and a
warning counter.

We always run the latest stable kernel on our production servers, and
this has brought great sorrow for the last year (I think the big netfs
drama began in 6.9 or so when the pgpriv2 refactoring began). There
have been numerous netfs/NFS/Ceph regressions, we had just as many
production outages, and the maintainers wouldn't respond to my bug
reports, so I had to figure it all out myself.
The latest regression that quickly took down our servers was a
"stable" backport of a performance optimization for epoll in 6.14.4,
leading to soft lockups in ep_poll(), see
https://lore.kernel.org/lkml/20250429185827.3564438-1-max.kellermann@ionos.com/
- but we observed it only after everything had already fallen apart.
Since our main process has switched from epoll to io_uring, only
second-order processes were falling apart. Had we had a soft lockup
counter, we could have noticed it earlier.

> A proposal to permanently extend Linux's userspace API requires better
> justification than an unsubstantiated assertion, surely?

The commits that added warn_count/oops_count literally only said "is a
fairly interesting signal". See commits 9db89b411170 ("exit: Expose
"oops_count" to sysfs") and 8b05aa263361 ("panic: Expose "warn_count"
to sysfs"). That's quite an unsubstantiated assertion, too, isn't it?

I agree with you, but I thought the point for a soft lockup counter
was trivial enough to see, and I didn't think you needed more
justification than the other counters.

Max

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ