linux-kernel - Re: [GIT PULL] mm: frontswap (for 3.2 window)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111102160201.GB18879@redhat.com>
Date:	Wed, 2 Nov 2011 17:02:01 +0100
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Avi Kivity <avi@...hat.com>
Cc:	James Bottomley <James.Bottomley@...senPartnership.com>,
	Dan Magenheimer <dan.magenheimer@...cle.com>,
	Pekka Enberg <penberg@...nel.org>,
	Cyclonus J <cyclonusj@...il.com>,
	Sasha Levin <levinsasha928@...il.com>,
	Christoph Hellwig <hch@...radead.org>,
	David Rientjes <rientjes@...gle.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-mm@...ck.org, LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Konrad Wilk <konrad.wilk@...cle.com>,
	Jeremy Fitzhardinge <jeremy@...p.org>,
	Seth Jennings <sjenning@...ux.vnet.ibm.com>, ngupta@...are.org,
	Chris Mason <chris.mason@...cle.com>, JBeulich@...ell.com,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Jonathan Corbet <corbet@....net>
Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Avi,

On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote:
> If you look at cleancache, then it addresses this concern - it extends
> pagecache through host memory.  When dropping a page from the tail of
> the LRU it first goes into tmem, and when reading in a page from disk
> you first try to read it from tmem.  However in many workloads,
> cleancache is actually detrimental.  If you have a lot of cache misses,
> then every one of them causes a pointless vmexit; considering that
> servers today can chew hundreds of megabytes per second, this adds up. 
> On the other side, if you have a use-once workload, then every page that
> falls of the tail of the LRU causes a vmexit and a pointless page copy.

I also think it's bad design for Virt usage, but hey, without this
they can't even run with cache=writeback/writethrough and they're
forced to cache=off, and then they claim specvirt is marketing, so for
Xen it's better than nothing I guess.

I'm trying right now to evaluate it as a pure zcache host side
optimization. If it can drive us in the right long term direction and
we're free to modify it as we wish to boost swapping I/O too using
compressed data, then it may be viable. Otherwise it's better they add
some Xen specific hook and leave whatever zcache infrastructure "free
to be modified as the VM needs" "as Xen needs not". I currently don't
know exactly where the Xen ABI starts and the kernel stops in tmem so
it's hard to tell how hackable it is and if it is actually a
complication to try to hide things away from the VM or not. Certainly
the highly advertised automatic dynamic sizing of the tmem pools is an
OOM timebomb without proper VM control on it. So it just can't stay
away from the VM too much. Currently it's unlikely to be safe in all
workloads (i.e. mlockall growing fast).

Whatever happens in tmem it must be still "owned by the kernel" so it
will be written out to disk with bios. Doesn't need to happpen
immediately, doesn't need to be perfect, but must definitely be
possible to add it later without Xen folks complaining at whatever
change we do in tmem.

The fact not a line of code of Xen was written over the last two
years, doesn't mean there aren't dependencies on the code, just maybe
those never broke and so Xen never needed to be modified either becuse
they kept the tmem ABI/API fixed while adding the other backends of
tmem (zcache etc..). I mean just the fact I read in those emails the
word "ABI" signals something is wrong. There can't be any ABI there,
only an API and even the API is a kernel internal one so it must be
allowed to break freely. Or we can't innovate. Again if we can't
change whatever ABI/API without first talking with the Xen folks I
think it's better they split the two projects and just submit the Xen
hooks separately. That wouldn't remove value to tmem (assuming it's
the way to go which I'm not entirely convinced yet).

In any case starting fixing up the zcache layer sounds good to me,
first things that come to mind are to document with a comment why it
disables irqs and which is the exact code racing with the compression
that runs from irqs or softirqs, fix the casts in tmem_put, rename
tmem_put to tmem_store etc... Then we see if Xen side complains by
just those small needed cleanups.

Ideally the API should also be stackable so you can do ramster on top
of zcache on top of cleancache/frontswap so we can write a swap driver
for the zcache and we can do swapper -> zcache -> frontswap, we could
even write compressed pagecache to disk that way.

And the whole thing should handle all allocation failures with a
fallback all up to the top layer (which for swap would mean go to the
regular swapout path if oom happens within those calls and for
pagecache would mean to really free the page not compress it in some
tmem memory). That is a design that may be good. I hadn't an huge
amount of time to think about it but if you remove virt from the
equation it looks less bad.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/