linux-kernel - Re: [PATCH] kernel/hung_task.c: allow to set period separately from timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACT4Y+adbYJdmgzuLVV+iy+Q+tii_cnKm=VwCfdkr8uKWbww+Q@mail.gmail.com>
Date:   Mon, 11 Jun 2018 13:16:18 +0200
From:   Dmitry Vyukov <dvyukov@...gle.com>
To:     Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Paul McKenney <paulmck@...ux.vnet.ibm.com>,
        LKML <linux-kernel@...r.kernel.org>,
        syzkaller <syzkaller@...glegroups.com>
Subject: Re: [PATCH] kernel/hung_task.c: allow to set period separately from timeout

On Sat, Jun 9, 2018 at 9:00 AM, Tetsuo Handa
<penguin-kernel@...ove.sakura.ne.jp> wrote:
> On 2018/06/09 6:58, Andrew Morton wrote:
>> On Fri,  8 Jun 2018 15:30:43 +0200 Dmitry Vyukov <dvyukov@...gle.com> wrote:
>>
>>> Currently task hung checking period is equal to timeout,
>>> as the result hung is detected anywhere between timeout and 2*timeout.
>>> This is fine for most interactive environments, but this hurts automated
>>> testing setups (syzbot). In an automated setup we need to strictly order
>>> CPU lockup < RCU stall < workqueue lockup < task hung < silent loss,
>>> so that RCU stall is not detected as task hung and task hung is not
>>> detected as silent machine loss. The large variance in task hung
>>> detection timeout requires setting silent machine loss timeout to
>>> a very large value (e.g. if task hung is 3 mins, then silent loss
>>> need to be set to ~7 mins). The additional 3 minutes significantly
>>> reduce testing efficiency because usually we crash kernel within
>>> a minute, and this can add hours to bug localization process as it
>>> needs to do dozens of tests.
>>>
>>> Allow setting checking period separately from timeout.
>>> This allows to set timeout to, say, 3 minutes, but period to 10 secs.
>>>
>>> The period is controlled via a new hung_task_period_secs sysctl,
>>> similar to the existing hung_task_timeout_secs sysctl.
>>> The default value of 0 results in the current behavior.
>>
>> I'm rather struggling to understand the difference between "period" and
>> "timeout".  We would benefit from a clear description of what these two
>> things do.  An appropriate place for this description is
>> Documentation/sysctl/kernel.txt, which this patch forgot to update.
>
> My understanding is that "period" is "how frequently we should check"
> and "timeout" is "how long a thread remained uninterruptible". Maybe
> hung_task_check_interval_secs would be better than hung_task_period_secs.

Hi Tetsuo, Andrew,

I've just mailed v2:

    Changes since v1:
     - add entry to Documentation/sysctl/kernel.txt
     - rename hung_task_period_secs sysctl to hung_task_check_interval_sec

Hopefully now it's more clear what's the difference and what it is doing.



> timeout = 60 and period = 1 would allow hung task to be reported as soon
> as it remained uninterruptible for 60 seconds. That makes me easier to
> narrow down relevant kernel messages and syzbot program.
>
> Well, showing exact slept time, along with all threads which slept more
> than some threshold (e.g. timeout / 2), might be helpful.

You mean if we report any task, then scan all tasks second time and
additionally report tasks that are blocked for (timeout/2 : timeout)?

Should we do this when hung_task_show_lock? Or only when
sysctl_hung_task_panic? Or when?