lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <041dcfd3d4e45c387fa1f6f49f53ccb59967b104.camel@axis.com>
Date:   Wed, 14 Jun 2023 07:42:46 +0000
From:   Vincent Whitchurch <Vincent.Whitchurch@...s.com>
To:     "linux@...ck-us.net" <linux@...ck-us.net>,
        "wim@...ux-watchdog.org" <wim@...ux-watchdog.org>,
        Vincent Whitchurch <Vincent.Whitchurch@...s.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>
CC:     "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        kernel <kernel@...s.com>,
        "linux-watchdog@...r.kernel.org" <linux-watchdog@...r.kernel.org>
Subject: Re: [PATCH v2] watchdog/mm: Allow dumping memory info in pretimeout

On Mon, 2023-06-12 at 07:53 -0700, Guenter Roeck wrote:
> On 6/12/23 00:26, Vincent Whitchurch wrote:
> > On my (embedded) systems, the most common cause of hitting the watchdog
> > (pre)timeout is due to thrashing.  Diagnosing these problems is hard
> > without knowing the memory state at the point of the watchdog hit.  In
> > order to make this information available, add a module parameter to the
> > watchdog pretimeout panic governor to ask it to dump memory info and the
> > OOM task list (using a new helper in the OOM code) before triggering the
> > panic.
> 
> Personally I don't think this is the right way of approaching this problem.
> First, the userspace task controlling the watchdog should run as realtime
> task, forced to be in memory, and not be affected by thrashing.

That may not be appropriate in all cases since you may want the watchdog
to hit when the system as a whole really is unusable.

> Second, the problem should be observable well before the watchdog fires.

Yes, there are ways to try to detect it earlier (e.g. PSI) and attempt
recovery, even if the kernel's OOM killer itself is very slow to react.

But if those attempts fail for whatever reason and we actually do end up
hitting the watchdog, something like this patch provides information
which is invaluable for diagnosing the problem.

> Last but not least, I don't think it is appropriate to intertwine
> watchdog code with oom handling code as suggested here.

The show_mem() function is in lib/ so that's outside of the OOM
handling.  The oom_dump_tasks() function could perhaps be refactored and
moved to a neutral location so then we would avoid the intertwining.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ