linux-kernel - RE: [PATCHv11 3/4] zswap: add to mm/

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 14 May 2013 13:18:48 -0700 (PDT)
From:	Dan Magenheimer <dan.magenheimer@...cle.com>
To:	Seth Jennings <sjenning@...ux.vnet.ibm.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Nitin Gupta <ngupta@...are.org>,
	Minchan Kim <minchan@...nel.org>,
	Konrad Wilk <konrad.wilk@...cle.com>,
	Robert Jennings <rcj@...ux.vnet.ibm.com>,
	Jenifer Hopper <jhopper@...ibm.com>,
	Mel Gorman <mgorman@...e.de>,
	Johannes Weiner <jweiner@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Larry Woodman <lwoodman@...hat.com>,
	Benjamin Herrenschmidt <benh@...nel.crashing.org>,
	Dave Hansen <dave@...1.net>, Joe Perches <joe@...ches.com>,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Cody P Schafer <cody@...ux.vnet.ibm.com>,
	Hugh Dickens <hughd@...gle.com>,
	Paul Mackerras <paulus@...ba.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, devel@...verdev.osuosl.org
Subject: RE: [PATCHv11 3/4] zswap: add to mm/

> From: Seth Jennings [mailto:sjenning@...ux.vnet.ibm.com]
> Subject: Re: [PATCHv11 3/4] zswap: add to mm/
> 
> <snip>
>
> > > +/* The maximum percentage of memory that the compressed pool can occupy */
> > > +static unsigned int zswap_max_pool_percent = 20;
> > > +module_param_named(max_pool_percent,
> > > +			zswap_max_pool_percent, uint, 0644);
> 
> <snip>
>
> > This limit, along with the code that enforces it (by calling reclaim
> > when the limit is reached), is IMHO questionable.  Is there any
> > other kernel memory allocation that is constrained by a percentage
> > of total memory rather than dynamically according to current
> > system conditions?  As Mel pointed out (approx.), if this limit
> > is reached by a zswap-storm and filled with pages of long-running,
> > rarely-used processes, 20% of RAM (by default here) becomes forever
> > clogged.
> 
> So there are two comments here 1) dynamic pool limit and 2) writeback
> of pages in zswap that won't be faulted in or forced out by pressure.
> 
> Comment 1 feeds from the point of view that compressed pages should just be
> another type of memory managed by the core MM.  While ideal, very hard to
> implement in practice.  We are starting to realize that even the policy
> governing to active vs inactive list is very hard to get right. Then shrinkers
> add more complexity to the policy problem.  Throwing another type in the mix
> would just that much more complex and hard to get right (assuming there even
> _is_ a "right" policy for everyone in such a complex system).
> 
> This max_pool_percent policy is simple, works well, and provides a
> deterministic policy that users can understand. Users can be assured that a
> dynamic policy heuristic won't go nuts and allow the compressed pool to grow
> unbounded or be so aggressively reclaimed that it offers no value.

Hi Seth --

Hmmm... I'm not sure how to politely say "bullshit". :-)

The default 20% was randomly pulled out of the air long ago for zcache
experiments.  If you can explain why 20% is better than 19% or 21%, or
better than 10% or 30% or even 50%, that would be a start.  Then please try
to explain -- in terms an average sysadmin can understand -- under
what circumstances this number should be higher or lower, that would
be even better.  In fact if you can explain it in even very broadbrush
terms like "higher for embedded" and "lower for server" that would be
useful.  If the top Linux experts in compression can't answer these
questions (and the default is a random number, which it is), I don't
know how we can expect users to be "assured".

What you mean is "works well"... on the two benchmarks you've tried it
on.  You say it's too hard to do dynamically... even though every other
significant RAM user in the kernel has to do it dynamically.
Workloads are dynamic and heavy users of RAM needs to deal with that.
You don't see a limit on the number of anonymous pages in the MM subsystem,
and you don't see a limit on the number of inodes in btrfs.  Linus
would rightfully barf all over those limits and (if he was paying attention
to this discussion) he would barf on this limit too.

It's unfortunate that my proposed topic for LSFMM was pre-empted
by the zsmalloc vs zbud discussion and zswap vs zcache, because
I think the real challenge of zswap (or zcache) and the value to
distros and end users requires us to get this right BEFORE users
start filing bugs about performance weirdness.  After which most
users and distros will simply default to 0% (i.e. turn zswap off)
because zswap unpredictably sometimes sucks.

<flame off> sorry...

> Comment 2 I agree is an issue. I already have patches for a "periodic
> writeback" functionality that starts to shrink the zswap pool via
> writeback if zswap goes idle for a period of time.  This addresses
> the issue with long-lived, never-accessed pages getting stuck in
> zswap forever.

Pulling the call out of zswap_frontswap_store() (and ensuring there still
aren't any new races) would be a good start.  But this is just a mechanism;
you haven't said anything about the policy or how you intend to
enforce the policy.  Which just gets us back to Comment 1...

So Comment 1 and Comment 2 are really the same:  How do we appropriately
manage the number of pages in the system that are used for storing
compressed pages?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/