lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20180727202236.GB12399@cmpxchg.org>
Date:   Fri, 27 Jul 2018 16:22:36 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Daniel Drake <drake@...lessm.com>
Cc:     mhocko@...nel.org, linux-mm@...ck.org, linux@...lessm.com,
        linux-kernel@...r.kernel.org
Subject: Re: Making direct reclaim fail when thrashing

On Fri, Jul 27, 2018 at 11:21:43AM -0500, Daniel Drake wrote:
> Split from the thread
>   [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
> where we were discussing if/how to make the direct reclaim codepath
> fail if we're excessively thrashing, so that the OOM killer might
> step in. This is potentially desirable when the thrashing is so bad
> that the UI stops responding, causing the user to pull the plug.
> 
> On Tue, Jul 17, 2018 at 7:23 AM, Michal Hocko <mhocko@...nel.org> wrote:
> > mm/workingset.c allows for tracking when an actual page got evicted.
> > workingset_refault tells us whether a give filemap fault is a recent
> > refault and activates the page if that is the case. So what you need is
> > to note how many refaulted pages we have on the active LRU list. If that
> > is a large part of the list and if the inactive list is really small
> > then we know we are trashing. This all sounds much easier than it will
> > eventually turn out to be of course but I didn't really get to play with
> > this much.

I've mentioned it in the other thread, but whether refaults are a
performance/latency problem depends 99% on your available IO capacity
and the IO patterns. On a highly contended IO device, refaults of a
single unfortunately located page can lead to multi-second stalls. On
an idle SSD, thousands of refaults might not be noticable to the user.

Without measuring how much time these events take out of your day, you
can't really tell eif they're a problem or not. The event rate or the
proportion between pages and refaults doesn't carry that signal.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ