linux-kernel - Re: [PATCH 0/2] hung_task: add detect count for hung tasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20241023212815.240844bdf83e4dc17b66b88c@linux-foundation.org>
Date: Wed, 23 Oct 2024 21:28:15 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: Lance Yang <ioworker0@...il.com>
Cc: cunhuang@...cent.com, leonylgao@...cent.com, j.granados@...sung.com,
 jsiddle@...hat.com, kent.overstreet@...ux.dev, 21cnbao@...il.com,
 ryan.roberts@....com, david@...hat.com, ziy@...dia.com,
 libang.li@...group.com, baolin.wang@...ux.alibaba.com,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 0/2] hung_task: add detect count for hung tasks

On Thu, 24 Oct 2024 11:28:01 +0800 Lance Yang <ioworker0@...il.com> wrote:

> Hi Andrew,
> 
> Thanks a lot for paying attention!
> 
> On Thu, Oct 24, 2024 at 10:05 AM Andrew Morton
> <akpm@...ux-foundation.org> wrote:
> >
> > On Tue, 22 Oct 2024 19:47:34 +0800 Lance Yang <ioworker0@...il.com> wrote:
> >
> > > Hi all,
> > >
> > > This patchset adds a counter, hung_task_detect_count, to track the number of
> > > times hung tasks are detected. This counter provides a straightforward way
> > > to monitor hung task events without manually checking dmesg logs.
> > >
> > > With this counter in place, system issues can be spotted quickly, allowing
> > > admins to step in promptly before system load spikes occur, even if the
> > > hung_task_warnings value has been decreased to 0 well before.
> > >
> > > Recently, we encountered a situation where warnings about hung tasks were
> > > buried in dmesg logs during load spikes. Introducing this counter could
> > > have helped us detect such issues earlier and improve our analysis efficiency.
> > >
> >
> > Isn't the answer to this problem "write a better parser"?  I mean,
> 
> Yeah, I certainly agree that having a good parser is important, and I'm
> working on that as well ;)
> 
> > we're providing userspace with information which is already available.
> 
> IHMO, there are two reasons why this counter remains valuable:
> 
> 1) It allows us to easily detect hung tasks in time before load spikes occur,
> using simple and common monitoring tools like Prometheus.

But the new sysctl_hung_task_detect_count counter gets incremented a
microsecond before the printk comes out.  I don't understand the
difference.

> 2) It ensures that we remain aware of hung tasks even when the
> hung_task_warnings value has already been decreased to 0 well before.

That makes sense, I guess.  But fleshing this out with a real
operational scenario would help persuade reviewers of the benefit of
this change.

So please describe the utility with full details - sell it to us!