linux-kernel - Re: [PATCH 7/8] zswap: add to mm/

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 01 Jan 2013 11:52:21 -0600
From:	Seth Jennings <sjenning@...ux.vnet.ibm.com>
To:	Dan Magenheimer <dan.magenheimer@...cle.com>
CC:	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Nitin Gupta <ngupta@...are.org>,
	Minchan Kim <minchan@...nel.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Robert Jennings <rcj@...ux.vnet.ibm.com>,
	Jenifer Hopper <jhopper@...ibm.com>,
	Mel Gorman <mgorman@...e.de>,
	Johannes Weiner <jweiner@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Larry Woodman <lwoodman@...hat.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, devel@...verdev.osuosl.org
Subject: Re: [PATCH 7/8] zswap: add to mm/

On 12/31/2012 05:06 PM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:sjenning@...ux.vnet.ibm.com]
>> Subject: [PATCH 7/8] zswap: add to mm/
>>
>> zswap is a thin compression backend for frontswap. It receives
>> pages from frontswap and attempts to store them in a compressed
>> memory pool, resulting in an effective partial memory reclaim and
>> dramatically reduced swap device I/O.
> 
> Hi Seth --
> 
> Happy (almost) New Year!

You too :)  Thanks for taking a look at the code.

> I am eagerly studying one of the details of your zswap "flush"
> code in this patch to see how you solved a problem or two that
> I was struggling with for the similar mechanism RFC'ed for zcache
> (see https://lkml.org/lkml/2012/10/3/558).  I like the way
> that you force the newly-uncompressed to-be-flushed page immediately
> into a swap bio in zswap_flush_entry via the call to swap_writepage,
> though I'm not entirely convinced that there aren't some race
> conditions there.  However, won't swap_writepage simply call
> frontswap_store instead and re-compress the page back into zswap?

I break part of swap_writepage() into a bottom half called
__swap_writepage() that doesn't include the call to frontswap_store().
__swap_writepage() is what is called from zswap_flush_entry().  That
is how I avoid flushed pages recycling back into zswap and the
potential recursion mentioned.

> A second related issue that concerns me is that, although you
> are now, like zcache2, using an LRU queue for compressed pages
> (aka "zpages"), there is no relationship between that queue and
> physical pageframes.  In other words, you may free up 100 zpages
> out of zswap via zswap_flush_entries, but not free up a single
> pageframe.  This seems like a significant design issue.  Or am
> I misunderstanding the code?

You understand correctly.  There is room for optimization here and it
is something I'm working on right now.

What I'm looking to do is give zswap a little insight into zsmalloc
internals, namely the ability figure out what class size a particular
allocation is in and, in the event the store can't be satisfied, flush
an entry from that exact class size so that we can be assured the
store will succeed with minimal flushing work.  In this solution,
there would be an LRU list per zsmalloc class size tracked in zswap.
The result is LRU-ish flushing overall with class size being the first
flush selection criteria and LRU as the second.

> A third concern is about scalability... the locking seems very
> coarse-grained.  In zcache, you personally observed and fixed
> hashbucket contention (see https://lkml.org/lkml/2011/9/29/215).
> Doesn't zswap's tree_lock essentially use a single tree (per
> swaptype), i.e. no scalability?

The reason the coarse lock isn't a problem for zswap like the hash
bucket locks where in zcache is that the lock is not held for long
periods time as it is in zcache.  It is only held while operating on
the tree, not during compression/decompression and larger memory
operations.

Also, I've done some lockstat checks and the zswap tree lock is way
down on the list contributing <1% of the lock contention wait time on
a 4-core system.  The anon_vma lock is the primary bottleneck.

Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/