linux-kernel - Re: [PATCH] kernel/hung_task.c: Monitor killed tasks.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190516115758.6v7oitg3vbkfhh5j@pathway.suse.cz>
Date:   Thu, 16 May 2019 13:57:58 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "Paul E. McKenney" <paulmck@...ux.ibm.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Liu Chuansheng <chuansheng.liu@...el.com>,
        Valdis Kletnieks <valdis.kletnieks@...edu>,
        linux-kernel@...r.kernel.org, Dmitry Vyukov <dvyukov@...gle.com>,
        Stephen Rothwell <sfr@...b.auug.org.au>
Subject: Re: [PATCH] kernel/hung_task.c: Monitor killed tasks.

CCed Stephen to discuss linux-next related question at the bottom
of the mail.

On Thu 2019-05-16 17:19:12, Tetsuo Handa wrote:
> On 2019/05/15 19:55, Petr Mladek wrote:
> >> +	if (!stamp) {
> >> +		stamp = jiffies;
> >> +		if (!stamp)
> >> +			stamp++;
> >> +		t->killed_time = stamp;
> >> +		return;
> >> +	}
> > 
> > I might be too dumb but the above code looks pretty tricky to me.
> > It would deserve a comment. Or better, I would remove
> > trick to handle overflow. If it happens, we would just
> > lose one check period.
> 
> We can use
> 
>   static inline unsigned long jiffies_nonzero(void)
>   {
>       const unsigned long stamp = jiffies;
> 
>       return stamp ? stamp : -1;
>   }
> 
> or even shortcut "jiffies | 1" because difference by one jiffie
> is an measurement error for multiple HZ of timeout.

I would just ignore the overflow. We would just start measuring
the timeout in the next check_hung_task() call. It is not
a big deal and removes few lines of a tricky code.

> >> +	if (time_is_after_jiffies(stamp + timeout * HZ))
> >> +		return;
> >> +	trace_sched_process_hang(t);
> >> +	if (sysctl_hung_task_panic) {
> >> +		console_verbose();
> >> +		hung_task_call_panic = true;
> > 
> > IMHO, the delayed task exit is much less fatal than sleeping
> > in an uninterruptible state.
> > 
> > Anyway, the check is much less reliable. In case of hung_task,
> > it is enough when the task gets scheduled. In the new check,
> > the task has to do some amount of work until the signal
> > gets handled and do_exit() is called.
> > 
> > The panic should either get enabled separately or we should
> > never panic in this case.
> 
> OK, we should not share existing sysctl settings.
> 
> But in the context of syzbot's testing where there are only 2 CPUs
> in the target VM (which means that only small number of threads and
> not so much memory) and threads get SIGKILL after 5 seconds from fork(),
> being unable to reach do_exit() within 10 seconds is likely a sign of
> something went wrong. For example, 6 out of 7 trials of a reproducer for
> https://syzkaller.appspot.com/bug?id=835a0b9e75b14b55112661cbc61ca8b8f0edf767
> resulted in "no output from test machine" rather than "task hung".
> This patch is revealing that such killed threads are failing to reach
> do_exit() because they are trapped at unkillable retry loop due to a
> race bug.
> 
> Therefore, I would like to try this patch in linux-next.git for feasibility
> testing whether this patch helps finding more bugs and reproducers for such
> bugs, by bringing "unable to terminate threads" reports out of "no output from
> test machine" reports. We can add sysctl settings before sending to linux.git.

In this case, the watchdog should get enabled on with
CONFIG_DEBUG_AID_FOR_SYZBOT

Also we should ask/inform Stephen about this. I am not sure
if he is willing to resolve eventual conflicts for these
syzboot-specific patches that are not upstream candidates.

A solution might be to create sysbot-specific for-next branch
that Stephen might simply ignore when there are conflicts.
And you would be responsible for updating it.

Best Regards,
Petr