linux-kernel - Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAFQd5BeJnYXZt06WVFBWu8cvCmXWTe_tH8Ly3ywTNRCjxXCMA@mail.gmail.com>
Date: Wed, 30 Apr 2025 17:42:51 +0900
From: Tomasz Figa <tfiga@...omium.org>
To: Petr Mladek <pmladek@...e.com>
Cc: Sergey Senozhatsky <senozhatsky@...omium.org>, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	John Ogness <john.ogness@...utronix.de>, Steven Rostedt <rostedt@...dmis.org>, 
	Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel

On Sat, Apr 26, 2025 at 12:32 AM Petr Mladek <pmladek@...e.com> wrote:
>
> On Fri 2025-04-25 15:58:46, Tomasz Figa wrote:
> > Hi Petr,
> >
> > On Thu, Apr 24, 2025 at 7:59 PM Petr Mladek <pmladek@...e.com> wrote:
> > >
> > > On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> > > > Currently, hung-task watchdog uses two different loglevels
> > > > to report hung-tasks: a) KERN_INFO for all the important task
> > > > information (e.g. sched_show_task()) and b)  KERN_ERR for the
> > > > rest.
> > >
> > > IMHO, the two different loglevels make sense. The KERN_ERR
> > > message seems to inform about that a task gets blocked for too long.
> > > And KERN_INFO is used for an extra debug information.
> > >
> >
> > I agree that two different levels make sense, but I think that
> > KERN_INFO is not necessarily the best one to use, because we have
> > quite a lot of usual/expected things logged with that level, but this
> > clearly is not an unusual/expected event that we're logging.
> >
> > My preference would be on KERN_NOTICE.
>
> Sigh, this is the problem with loglevels. Different people have
> different feeling about them.
>
> A solution would be to add an extra log level. But the full 0-7
> (3 bit) range is already taken.
>
> > > > This makes it a little inconvenient, especially for
> > > > automated kernel logs parsing.
> > >
> > > Anyway, what is the exact problem, please?
> > > Are the KERN_INFO messages filtered because of console_loglevel?
> > > Or is it a problem to match all the related lines?
> >
> > The problem is that when we're looking at the hundreds of reports with
> > various problems from the production fleet, we want to be able to
> > filter out some of the usual/expected logs. The easiest way to do it
> > is by using the message log level. However, if we set the filters to
> > anything more severe than KERNEL_INFO, we lose the task dumps and we
> > need to go and fetch the entire unfiltered log, which is tedious.
>
> Good to know.
>
> This might be an argument for using the same log level for the entire
> report. But it might create new problems. It would be more complicated
> to filter-out known problems. I mean that a single known
> warning/error/emergency message can be filtered easily. But
> creating a filter for the entire to-be-ignored backtrace is more
> complicated.
>

Yeah, but since this filtering would be happening in whatever code
reads those logs, outside of the kernel, we first need to get the
candidate backtraces from the kernel, so having them a log level
appropriate for an unusual event would definitely help.

>
> > (FWIW, we're also developing an automated analysis tool and it would
> > make the implementation much easier if we could simply use the log
> > level to filter out expected vs unexpected events from the logs - and
> > most of the time that already works, the case Sergey's patch is
> > addressing is just one of the small number of exceptions.)
>
> It might be interesting to see the list of exceptions. Maybe, we
> could find some common pattern...
>
> It would be nice to handle all the reports of critical situations
> similar way. It would help everyone. This is why I am not happy with
> a hung-stask-detector-specific setting.
>

I was convinced that the soft/hard_lockup detector also does the same,
but I double checked and it seems like it [1] just calls dump_stack()
that uses KERN_DEFAULT and in our case that defaults to 4
(KERN_WARNING), which is more severe than 6 (KERN_INFO) that we want
to filter out, so I guess making the hung task watchdog behave the
same way would also work for us.

[1] https://elixir.bootlin.com/linux/v6.14.4/source/lib/nmi_backtrace.c#L94
[2] https://elixir.bootlin.com/linux/v6.14.4/source/lib/dump_stack.c#L127

Besides that, I don't actually have any other cases at hand. I guess
we'll find out once we look at more crash reports.

> > > If the problem is matching all related lines. Then a solution
> > > would be printing some help lines around the report, similar
> > > to
> > >
> > >     ------------[ cut here ]------------
> > >
> > > in include/asm-generic/bug.h
> > >
> > > Plus, it would be needed to filter out messages from other CPUs.
> > > CONFIG_PRINTK_CALLER should help with this.
> >
> > I'm not really in love with that idea - it would make things so much
> > more complicated, despite already having the right tool to
> > differentiate between the importance of various logs - after all the
> > log level is exactly that.
>
> Honestly, the more I think about it the more I like the prefix/postfix
> lines + the caller_id. I am afraid that manipulating log levels is a
> lost fight  because different people might have different opinion
> about how various messages are important.

The problem with the special lines is that it completely breaks any
line-based processing in a data pipeline. For a piece of
infrastructure that needs to deal with thousands of reports, on an
on-demand basis, that would mean quite a bit of sequential work done
instead of doing it in parallel and taking much more time to answer
users' queries.

That could be worked around, though, if we could prefix each line
separately with some special tag in addition to log level, timestamp
and caller, though. Borrowing from Sergey's earlier example:

<3>[  125.297687][  T140][E] INFO: task zsh:470 blocked for more than
61 seconds.
<3>[  125.302321][  T140][E]       Not tainted
6.15.0-rc3-next-20250424-00001-g258d8df78c77-dirty #154
<3>[  125.309333][  T140][E] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[  125.315040][  T140][E] task:zsh             state:D stack:0
pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
<6>[  125.320594][  T140][E] Call Trace:
<6>[  125.322327][  T140][E]  <TASK>
<6>[  125.323852][  T140][E]  __schedule+0x13b4/0x2120
<6>[  125.325459][  T140][E]  ? schedule+0xdc/0x280
<6>[  125.327100][  T140][E]  schedule+0xdc/0x280
<6>[  125.328590][  T140][E]  schedule_preempt_disabled+0x10/0x20
<6>[  125.330589][  T140][E]  __mutex_lock+0x698/0x1200
<6>[  125.332291][  T140][E]  ? __mutex_lock+0x485/0x1200
<6>[  125.334074][  T140][E]  mutex_lock+0x81/0x90
<6>[  125.335113][  T140][E]  drop_caches_sysctl_handler+0x3e/0x140
<6>[  125.336665][  T140][E]  proc_sys_call_handler+0x327/0x4f0
<6>[  125.338069][  T140][E]  vfs_write+0x794/0xb60
<6>[  125.339216][  T140][E]  ? proc_sys_read+0x10/0x10
<6>[  125.340568][  T140][E]  ksys_write+0xb8/0x170
<6>[  125.341701][  T140][E]  do_syscall_64+0xd0/0x1a0
<6>[  125.343009][  T140][E]  ? arch_exit_to_user_mode_prepare+0x11/0x60
<6>[  125.344612][  T140][E]  ? irqentry_exit_to_user_mode+0x7e/0xa0
<6>[  125.346260][  T140][E]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

where [E] would mean an "emergency" message, rather than something
usual, regardless of the loglevel.

Best,
Tomasz