linux-kernel - Re: [PATCH v2] watchdog/mm: Allow dumping memory info in pretimeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <041dcfd3d4e45c387fa1f6f49f53ccb59967b104.camel@axis.com>
Date:   Wed, 14 Jun 2023 07:42:46 +0000
From:   Vincent Whitchurch <Vincent.Whitchurch@...s.com>
To:     "linux@...ck-us.net" <linux@...ck-us.net>,
        "wim@...ux-watchdog.org" <wim@...ux-watchdog.org>,
        Vincent Whitchurch <Vincent.Whitchurch@...s.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>
CC:     "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        kernel <kernel@...s.com>,
        "linux-watchdog@...r.kernel.org" <linux-watchdog@...r.kernel.org>
Subject: Re: [PATCH v2] watchdog/mm: Allow dumping memory info in pretimeout

On Mon, 2023-06-12 at 07:53 -0700, Guenter Roeck wrote:
> On 6/12/23 00:26, Vincent Whitchurch wrote:
> > On my (embedded) systems, the most common cause of hitting the watchdog
> > (pre)timeout is due to thrashing.  Diagnosing these problems is hard
> > without knowing the memory state at the point of the watchdog hit.  In
> > order to make this information available, add a module parameter to the
> > watchdog pretimeout panic governor to ask it to dump memory info and the
> > OOM task list (using a new helper in the OOM code) before triggering the
> > panic.
> 
> Personally I don't think this is the right way of approaching this problem.
> First, the userspace task controlling the watchdog should run as realtime
> task, forced to be in memory, and not be affected by thrashing.

That may not be appropriate in all cases since you may want the watchdog
to hit when the system as a whole really is unusable.

> Second, the problem should be observable well before the watchdog fires.

Yes, there are ways to try to detect it earlier (e.g. PSI) and attempt
recovery, even if the kernel's OOM killer itself is very slow to react.

But if those attempts fail for whatever reason and we actually do end up
hitting the watchdog, something like this patch provides information
which is invaluable for diagnosing the problem.

> Last but not least, I don't think it is appropriate to intertwine
> watchdog code with oom handling code as suggested here.

The show_mem() function is in lib/ so that's outside of the OOM
handling.  The oom_dump_tasks() function could perhaps be refactored and
moved to a neutral location so then we would avoid the intertwining.