linux-kernel - Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAuie7Pgm3_9MsXK@pathway.suse.cz>
Date: Fri, 25 Apr 2025 16:55:55 +0200
From: Petr Mladek <pmladek@...e.com>
To: Sergey Senozhatsky <senozhatsky@...omium.org>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Tomasz Figa <tfiga@...omium.org>,
	John Ogness <john.ogness@...utronix.de>,
	Steven Rostedt <rostedt@...dmis.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel

On Fri 2025-04-25 13:49:13, Sergey Senozhatsky wrote:
> Hi Petr,
> 
> On (25/04/24 12:58), Petr Mladek wrote:
> > On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> > > Currently, hung-task watchdog uses two different loglevels
> > > to report hung-tasks: a) KERN_INFO for all the important task
> > > information (e.g. sched_show_task()) and b)  KERN_ERR for the
> > > rest.
> > 
> > IMHO, the two different loglevels make sense. The KERN_ERR
> > message seems to inform about that a task gets blocked for too long.
> > And KERN_INFO is used for an extra debug information.
> > 
> > > This makes it a little inconvenient, especially for
> > > automated kernel logs parsing.
> > 
> > Anyway, what is the exact problem, please?
> > Are the KERN_INFO messages filtered because of console_loglevel?
> > Or is it a problem to match all the related lines?
> 
> The latter one.  A made up example, just to demonstrate what we are
> getting now:
> 

Note: I do not have strong opinion. I am rather thinking loudly.

When I look at it. A prefix line:

<6>[  125.297687][  T140] ------------[ cut here ]------------

> <3>[  125.297687][  T140] INFO: task zsh:470 blocked for more than 61 seconds.
> <3>[  125.302321][  T140]       Not tainted 6.15.0-rc3-next-20250424-00001-g258d8df78c77-dirty #154
> <3>[  125.309333][  T140] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> <6>[  125.315040][  T140] task:zsh             state:D stack:0     pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
> <6>[  125.320594][  T140] Call Trace:
> <6>[  125.322327][  T140]  <TASK>
> <6>[  125.323852][  T140]  __schedule+0x13b4/0x2120
> <6>[  125.325459][  T140]  ? schedule+0xdc/0x280
> <6>[  125.327100][  T140]  schedule+0xdc/0x280
> <6>[  125.328590][  T140]  schedule_preempt_disabled+0x10/0x20
> <6>[  125.330589][  T140]  __mutex_lock+0x698/0x1200
> <6>[  125.332291][  T140]  ? __mutex_lock+0x485/0x1200
> <6>[  125.334074][  T140]  mutex_lock+0x81/0x90
> <6>[  125.335113][  T140]  drop_caches_sysctl_handler+0x3e/0x140
> <6>[  125.336665][  T140]  proc_sys_call_handler+0x327/0x4f0
> <6>[  125.338069][  T140]  vfs_write+0x794/0xb60
> <6>[  125.339216][  T140]  ? proc_sys_read+0x10/0x10
> <6>[  125.340568][  T140]  ksys_write+0xb8/0x170
> <6>[  125.341701][  T140]  do_syscall_64+0xd0/0x1a0
> <6>[  125.343009][  T140]  ? arch_exit_to_user_mode_prepare+0x11/0x60
> <6>[  125.344612][  T140]  ? irqentry_exit_to_user_mode+0x7e/0xa0
> <6>[  125.346260][  T140]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> <6>[  125.347772][  T140] RIP: 0033:0x7fa4bd8be687
> <6>[  125.348958][  T140] RSP: 002b:00007ffecf417820 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
> <6>[  125.351161][  T140] RAX: ffffffffffffffda RBX: 00007fa4bd82e300 RCX: 00007fa4bd8be687
> <6>[  125.353221][  T140] RDX: 0000000000000002 RSI: 00005621f5c65860 RDI: 0000000000000001
> <6>[  125.355338][  T140] RBP: 00005621f5c65860 R08: 0000000000000000 R09: 0000000000000000
> <6>[  125.357424][  T140] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
> <6>[  125.359677][  T140] R13: 00007fa4bda175c0 R14: 00007fa4bda14e80 R15: 00007fa4bdb59f70
> <6>[  125.361551][  T140]  </TASK>
> <3>[  125.362363][  T140] INFO: task zsh:470 is blocked on a mutex likely owned by task zsh:470.
> <6>[  125.364467][  T140] task:zsh             state:D stack:0     pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
> <6>[  125.367493][  T140] Call Trace:
> <6>[  125.368359][  T140]  <TASK>
> <6>[  125.369180][  T140]  __schedule+0x13b4/0x2120
> <6>[  125.370364][  T140]  ? schedule+0xdc/0x280
> <6>[  125.371486][  T140]  schedule+0xdc/0x280
> <6>[  125.372518][  T140]  schedule_preempt_disabled+0x10/0x20
> <6>[  125.374049][  T140]  __mutex_lock+0x698/0x1200
> <6>[  125.375326][  T140]  ? __mutex_lock+0x485/0x1200
> <6>[  125.376572][  T140]  mutex_lock+0x81/0x90
> <6>[  125.377773][  T140]  drop_caches_sysctl_handler+0x3e/0x140
> <6>[  125.379391][  T140]  proc_sys_call_handler+0x327/0x4f0
> <6>[  125.380715][  T140]  vfs_write+0x794/0xb60
> <6>[  125.381951][  T140]  ? proc_sys_read+0x10/0x10
> <6>[  125.383083][  T140]  ksys_write+0xb8/0x170
> <6>[  125.384329][  T140]  do_syscall_64+0xd0/0x1a0
> <6>[  125.385461][  T140]  ? arch_exit_to_user_mode_prepare+0x11/0x60
> <6>[  125.387110][  T140]  ? irqentry_exit_to_user_mode+0x7e/0xa0
> <6>[  125.388566][  T140]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> <6>[  125.390194][  T140] RIP: 0033:0x7fa4bd8be687
> <6>[  125.391268][  T140] RSP: 002b:00007ffecf417820 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
> <6>[  125.393367][  T140] RAX: ffffffffffffffda RBX: 00007fa4bd82e300 RCX: 00007fa4bd8be687
> <6>[  125.395506][  T140] RDX: 0000000000000002 RSI: 00005621f5c65860 RDI: 0000000000000001
> <6>[  125.397494][  T140] RBP: 00005621f5c65860 R08: 0000000000000000 R09: 0000000000000000
> <6>[  125.399701][  T140] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
> <6>[  125.401663][  T140] R13: 00007fa4bda175c0 R14: 00007fa4bda14e80 R15: 00007fa4bdb59f70
> <6>[  125.403579][  T140]  </TASK>

and a suffix line

<6>[  125.403579][  T140] ------------[ cut here ]------------

might be helpful for both automated monitors and humans => win win solution.

> So there are <3> and then <6> lines and we can't easily tell when to
> stop parsing and what to consider part of the automated error report
> and what not to, there also can be some other <6> lines in between.

All the related information should be printed from the same context.
It is "[  T140]" in this case. The only exception would be backtraces
from other CPUs.

> > > Introduce CONFIG_HUNG_TASK_STACKTRACE_LOGLEVEL so that (a)
> > > becomes configurable.
> > 
> > I am not sure if adding hung-task-specific config option is
> > the right solution. I guess that other watchdogs or other
> > similar reports have the same problem.
> >
> > It seems that several other reports, for example,
> > watchdog_hardlockup_check(), or __die(), are using KERN_DEFAULT
> > which is configurable via CONFIG_MESSAGE_LOGLEVEL_DEFAULT.
> > 
> > A solution might be using KERN_DEFAULT for sched_show_task()
> > in hung_tasks detector as well.
> 
> So my worry with CONFIG_MESSAGE_LOGLEVEL_DEFAULT was that, if we
> set it, say, to 3 (KERN_ERR), then how many "false positives" we
> will get?  There are many printk("") calls in the kernel that
> default to MESSAGE_LOGLEVEL_DEFAULT, as far as I understand it:
> 	git grep "printk(\"" | grep -v TP_printk | wc -l
> 	9001

Just for record, this still counts also many other printk() wrappers,
e.g. bpf_printk(), srm_printk(), dprintk().

It should be more precise with -w option:

	$> git grep -w "printk(\"" | wc -l
	2830

That said, I agree that it might still mean a lot of false
positives.

> But maybe that is the solution.  Do you want to switch sched_show_task()
> to KERN_DEFAULT for all or would it be better to still introduce
> sched_show_task_log_lvl() can call it with KERN_DEFAULT only from
> kernel/hung_task.c?

If you want to go this way then I would introduce sched_show_task_log_lvl().
It would allow to fix the various reports incrementally.

For example, I see sched_show_task() used in show_cpu_pool_hog() in
kernel/workqueue.c and some other messages are KERN_INFO. So, using
another log level in sched_show_task() would cause mess here.

That said, you probably want to fix workqueue stall report
as well. I believe that you have the workqueue watchdog enabled
and it has the same problem as the hung_task detector.


My opinion:

I think that using the prefix/postfix "cut here" line is a win-win
solution. And I would personally go this way.

But I understand this might have other problems, like the need
of the full log (the other reply from Thomasz Figa). You really
might want to use the same loglevel for the entire report.

If you want the same loglevel then I really suggest to introduce
a generic CONFIG option, e.g. CONFIG_CONSISTENT_REPORT_LOGLEVEL
and use is everywhere to force the consistent loglevel,
for example, hung taskdetector, workqueue watchdog, soft lockup
detector, Oops, ...


Idea:

Some people complained about a non-consistent loglevel when pr_cont()
lines were not connected because another context (CPU, task,
interrupt) added a message in between.

I had an idea to store the last used loglevel in struct task_struct.
But it did not look worth it just for continues lines.

This might be another use-case. So we could add something like:

struct printk_context {
	u8	deferred;	/* printk_safe/deferred nesting	counter. */
	u8      emergency;	/* nbcon_emergency nesting counter */
	u8	level:3;	/* last used or forced log level */
	u8	flags:5;
}

and one of the flags might be

enum printk_context_flag {
	pr_ctxt_force_level = 0x0,	/* Force the same loglevel for	all messages */
};

and then we could create an API:

void printk_force_loglevel_enter(const char *lvl)
{
	struct printk_context *pr_ctxt;

	pr_ctxt = printk_get_context();
	/* FIXME: make sure the it returns a valid level 0..7 */
	pr_ctxt.level = printk_get_level(lvl);
	pt_ctxt.flags |= pr_ctct_force_level;
}

void printk_force_loglevel_exit(void)
{
	struct printk_context *pr_ctxt;

	pr_ctxt = printk_get_context();
	pt_ctxt.flags &= ~pr_ctct_force_level;
}

, where printk_get_context() would return the right struct
printk_context either from struct task_struct for task context or
a static per-CPU variable for IRQ or NMI context.

FIXME: We should integrate the above with the existing:

	  + printk_context used by printk_safe_enter()/exit()
	  + nbcon_pcpu_emergency_nesting
	  + ??? (Did I miss any other printk context counter/flag?)


Also we could even create an API for consistent reports:

static atomic_t printk_report_count = ATOMIC_INIT(0);
static bool printk_force_report_loglevel = IS_ENABLED(CONFIG_PRINTK_FORCE_REPORT_LOGLEVEL);

int printk_report_start(const char *lvl)
{
	int report_id = atomic_inc_return(&printk_report_count);

	if (printk_force_report_loglevel)
		printk_force_loglevel_enter(lvl);

	printk("%s---------- Report No. %d start --------\n", lvl, report_id);

	return report_id;
}

void printk_report_end(const char *lvl, int report_id)
{
	printk("%s---------- Report No. %d end --------\n", lvl, report_id);

	if (printk_force_report_loglevel)
		printk_force_loglevel_exit();
}

This API would help to standardize the reports from various watchdogs:

  + The "start/end" lines would help both humans and bots.

  + The optional loglevel forcing would help to filter the messages depending
    on whether people want to see all details using the same loglevel
    or not.


My problem with the log level manipulation is that I do not know what
loglevel right. The loglevel might be important with slow consoles which
need not be suitable for all the details.

Best Regards,
Petr