lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 23 Feb 2016 16:36:49 -0800
From:	Johannes Weiner <hannes@...xchg.org>
To:	David Rientjes <rientjes@...gle.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	kernel-team@...com
Subject: Re: [PATCH v2] mm: scale kswapd watermarks in proportion to memory

On Mon, Feb 22, 2016 at 06:23:19PM -0800, David Rientjes wrote:
> On Mon, 22 Feb 2016, Johannes Weiner wrote:
> 
> > In machines with 140G of memory and enterprise flash storage, we have
> > seen read and write bursts routinely exceed the kswapd watermarks and
> > cause thundering herds in direct reclaim. Unfortunately, the only way
> > to tune kswapd aggressiveness is through adjusting min_free_kbytes -
> > the system's emergency reserves - which is entirely unrelated to the
> > system's latency requirements. In order to get kswapd to maintain a
> > 250M buffer of free memory, the emergency reserves need to be set to
> > 1G. That is a lot of memory wasted for no good reason.
> > 
> > On the other hand, it's reasonable to assume that allocation bursts
> > and overall allocation concurrency scale with memory capacity, so it
> > makes sense to make kswapd aggressiveness a function of that as well.
> > 
> > Change the kswapd watermark scale factor from the currently fixed 25%
> > of the tunable emergency reserve to a tunable 0.001% of memory.
> > 
> 
> Making this tunable independent of min_free_kbytes is great.
> 
> I'm wondering how the choice of 0.001% was picked for default?  One of my 
> workstations currently has step sizes of about 0.0005% so this will be 
> doubling the steps from min to low and low to high.  I'm not objecting to 
> that since it's definitely in the right direction (more free memory) but I 
> wonder if it will make a difference for some users.

I wish it were a bit more scientific, but I basically picked an order
of magnitude that sounds like a reasonable balance between wasted
memory and expected allocation bursts before kswapd can ramp up.

On a 10G machine, a 10M latency buffer sounds adequate, whereas 1M
might get overwhelmed and 100M is almost certainly a waste of RAM.

> > Beyond 1G of memory, this will produce bigger watermark steps than the
> > current formula in default settings. Ensure that the new formula never
> > chooses steps smaller than that, i.e. 25% of the emergency reserve.
> > 
> > On a 140G machine, this raises the default watermark steps - the
> > distance between min and low, and low and high - from 16M to 143M.
> > 
> > Signed-off-by: Johannes Weiner <hannes@...xchg.org>
> > Acked-by: Mel Gorman <mgorman@...e.de>
> > ---
> >  Documentation/sysctl/vm.txt | 18 ++++++++++++++++++
> >  include/linux/mm.h          |  1 +
> >  include/linux/mmzone.h      |  2 ++
> >  kernel/sysctl.c             | 10 ++++++++++
> >  mm/page_alloc.c             | 29 +++++++++++++++++++++++++++--
> >  5 files changed, 58 insertions(+), 2 deletions(-)
> > 
> > v2: Ensure 25% of emergency reserves as a minimum on small machines -Rik
> > 
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 89a887c..b02d940 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -803,6 +803,24 @@ performance impact. Reclaim code needs to take various locks to find freeable
> >  directory and inode objects. With vfs_cache_pressure=1000, it will look for
> >  ten times more freeable objects than there are.
> >  
> > +=============================================================
> > +
> > +watermark_scale_factor:
> > +
> > +This factor controls the aggressiveness of kswapd. It defines the
> > +amount of memory left in a node/system before kswapd is woken up and
> > +how much memory needs to be free before kswapd goes back to sleep.
> > +
> > +The unit is in fractions of 10,000. The default value of 10 means the
> > +distances between watermarks are 0.001% of the available memory in the
> > +node/system. The maximum value is 1000, or 10% of memory.
> > +
> 
> The effective maximum value can be different than the tunable, though,
> correct?  It seems like you'd want to document why watermark_scale_factor
> and the actual watermarks in /proc/zoneinfo may be different on some
> systems.

You mean because of the enforced minimum? I wondered about that, but
it seems more like an implementation detail rather than part of the
API. I doubt that in practice anybody would intentionally set the
scale factor low enough for the kernel minimum to kick in.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ