lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Y4TYNF8jRSkGii/U@alley>
Date:   Mon, 28 Nov 2022 16:48:04 +0100
From:   Petr Mladek <pmladek@...e.com>
To:     akpm@...ux-foundation.org, peterz@...radead.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] kernel/hung_task: print real_parent->comm, pid in
 check_hung_task

On Thu 2022-11-24 19:25:26, Tio Zhang wrote:
> We can avoid a hung task by fixing the process who causes it.
> But sometimes it is difficult to find out which service 
> the bad process belongs to by only knowing its pid and comm.
> Since userspace tools to learn who launches the bad process
> do not always work when we get a hung task, 
> it is helpful printing the parent by kernel.

Could you please be more specific how the information about
the parent helped to debug the problem?

Was it really important who started the process?
Was it related to some cgroup limits or permissions?

> Signed-off-by: Tio Zhang <tiozhang@...iglobal.com>
> ---
>  kernel/hung_task.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index c71889f3f3fc..33543d27bd5c 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -89,6 +89,7 @@ static struct notifier_block panic_block = {
>  
>  static void check_hung_task(struct task_struct *t, unsigned long timeout)
>  {
> +	struct task_struct *p = t->real_parent;

IMHO, this should be read using rcu_dereference(t->real_parent).

Note that check_hung_task() is already called under
rcu_read_lock() from check_hung_uninterruptible_tasks().

>  	unsigned long switch_count = t->nvcsw + t->nivcsw;
>  
>  	/*
> @@ -129,8 +130,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>  	if (sysctl_hung_task_warnings) {
>  		if (sysctl_hung_task_warnings > 0)
>  			sysctl_hung_task_warnings--;
> -		pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
> -		       t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
> +		pr_err("INFO: task %s:%d, parent %s:%d blocked for more than %ld seconds.\n",
> +		       t->comm, t->pid, p->comm, p->pid, (jiffies - t->last_switch_time) / HZ);

IMHO, this is a wrong place. The formulation creates more harm than
good. It might confuse people that both processes are blocked. Or it
makes the feeling that the parent somehow created the deadlock.

But if I get it correctly, the information about the parent is
needed only in special situations where only a particular parent
triggers the lockup.

>  		pr_err("      %s %s %.*s\n",
>  			print_tainted(), init_utsname()->release,
>  			(int)strcspn(init_utsname()->version, " "),

Alternative solution would be to print the parent in
sched_show_task() that is called here as well.

sched_show_task() prints many useful information that might
be useful for debugging. And the parent is just yet another
information that might bu useful.

Also sched_show_task() is called in more situations where
this information might be useful as well.

Best Regards,
Petr

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ