linux-kernel - Re: [GIT PULL] mm: frontswap (for 3.2 window)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111101180702.GL3466@redhat.com>
Date:	Tue, 1 Nov 2011 19:07:02 +0100
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Dan Magenheimer <dan.magenheimer@...cle.com>
Cc:	Pekka Enberg <penberg@...nel.org>,
	Cyclonus J <cyclonusj@...il.com>,
	Sasha Levin <levinsasha928@...il.com>,
	Christoph Hellwig <hch@...radead.org>,
	David Rientjes <rientjes@...gle.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-mm@...ck.org, LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Konrad Wilk <konrad.wilk@...cle.com>,
	Jeremy Fitzhardinge <jeremy@...p.org>,
	Seth Jennings <sjenning@...ux.vnet.ibm.com>, ngupta@...are.org,
	Chris Mason <chris.mason@...cle.com>, JBeulich@...ell.com,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Jonathan Corbet <corbet@....net>
Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

On Tue, Nov 01, 2011 at 09:41:38AM -0700, Dan Magenheimer wrote:
> I suppose this documentation (note, it is in drivers/staging/zcache,
> not in the proposed frontswap patchset) could be misleading.  It is

Yep I gotten the comment from tmem.c in staging, and the lwn link I
read before reading the tmem_put comment also only mentioned about
tmem_put doing a copy. So I erroneously assumed that all memory
passing through tmem was being copied and you lost reference of the
"struct page" when it entered zcache.

But instead there is this obscure cast of a "struct page *" to a "char
*", that is casted back to a struct page * from a char * in zcache
code, and kmap() runs on the page, to avoid the unnecessary copy.

So far so good, now the question is why do you have that cast at all?

I mean it's hard to be convinced on the sanity of on a API that
requires the caller to cast a "struct page *" to a "char *" to run
zerocopy. And well that is the very core tmem_put API I'm talking
about.

I assume the explanation of the cast is: before it was passing
page_address(page) to tmem, but that breaks with highmem because
highmem requires kmap(page). So then you casted the page.

This basically proofs the API must be fixed. In the kernel we work
with _pages_ not char *, exactly for this reason, and tmem_put must be
fixed to take a page structure. (in fact better would be an array of
pages and ranges start/end for each entry in the array but hey at
least a page+len would be sane). A char * is flawed and the cast of
the page to char * and back to struct page, kind of proofs it. So I
think that must be fixed in tmem_put. Unfortunately it's already
merged with this cast back and forth in the upstream kernel.

About the rest of zcache I think it's interesting but because it works
inside tmem I'm unsure how we're going to write it to disk.

The local_irq_save would be nice to understand why it's needed for
frontswap but not for pagecache. All that VM code never runs from
irqs, so it's hard to see how the irq disabling is relevant. A bit fat
comment on why local_irq_save is needed in zcache code (in staging
already) would be helpful. Maybe it's tmem that can run from irq?  The
only thing running from irqs is the tlb flush and I/O completion
handlers, everything else in the VM isn't irq/softirq driven so we
never have to clear irqs.

My feeling is this zcache should be based on a memory pool abstraction
that we can write to disk with a bio and working with "pages".

I'm also not sure how you balance the pressure in the tmem pool, when
you fail the allocation and swap to disk, or when you keep moving to
compressed swap.

> This is a known problem: zcache is currently not very
> good for high-response RT environments because it currently
> compresses a page of data with interrupts disabled, which
> takes (IIRC) about 20000 cycles.  (I suspect though, without proof,
> that this is not the worst irq-disabled path in the kernel.)

That's certainly more than the irq latency so it's probably something
the rt folks don't want and yes they should keep it in mind not to use
frontswap+zcache in embedded RT environments.

Besides there was no benchmark comparing zram performance to zcache
performance so latency aside we miss a lot of info.

> As noted earlier, this is fixable at the cost of the extra copy
> which could be implemented as an option later if needed.
> Or, as always, the RT folks can just not enable zcache.
> Or maybe smarter developers than me will find a solution
> that will work even better.

And what is the exact reason of the local_irq_save for doing it
zerocopy?

> Yeah, remember zcache was merged before either cleancache or
> frontswap, so this ugliness was necessary to get around the
> chicken-and-egg problem.  Zcache will definitely need some
> work before it is ready to move out of staging, and your
> feedback here is useful for that, but I don't see that as
> condemning frontswap, do you?

Would I'd like is a mechanism where you:

1) add swapcache to zcache (with fallback to swap immediately if zcache
   allocation fails)

2) when some threshold is hit or zcache allocation fails, we write the
   compressed data in a compact way to swap (freeing zcache memory),
   or swapcache directly to swap if no zcache is present

3) newly added swapcache is added to zcache (old zcache was written to
   swap device compressed and freed)

Once we already did the compression it's silly to write to disk the
uncompressed data. Ok initially it's ok because compacting the stuff
on disk is super tricky but we want a design that will allow writing
the zcache to disk and add new swapcache to zcache, instead of the
current way of swapping the new swapcache to disk uncompressed and not
being able to writeout the compressed zcache.

If nobody called zcache_get and uncompressed it, it means it's
probably less likely to be used than the newly added swapcache that
wants to be compressed.

I'm afraid adding frontswap in this form will still get stuck us in
the wrong model and most of it will have to be dropped and rewritten
to do just the above 3 points I described to do proper swap
compression.

Also I'm skeptical we need to pass through tmem at all to do that. I
mean done right the swap compression could be a feature to enable
across the board without needing tmem at all. Then if you want to add
ramster just add a frontswap on the already compressed
swapcache... before it goes to the hard swap device.

The final swap design must also include the pre-swapout from Avi by
writing data to swapcache in advance and relaying on the dirty bit to
rewrite it. And the pre-swapin as well (original idea from Con). The
pre-swapout would need to stop before compressing. The pre-swapin
should stop before decompressing.

I mean I see an huge potential for improvement in the swap space, just
I guess most are busy with more pressing issues, like James said most
data centers don't use swap, desktop is irrelevant and android (as
relevant as data center) don't use swap.

But your improvement to frontswap don't look the right direction if
you really want to improve swap for the long term. It may be better
than nothing but I don't see it going the way it should go and I
prefer to remove the tmem dependency on zcache all together. Zcache
alone would be way more interesting.

And tmem_put must be fixed to take a page, that cast to char * of a
page, to avoid crashing on highmem is not allowed.

Of course I didn't have the time to read 100% of the code so please
correct me again if I misunderstood something.

> This is the "fix highmem" bug fix from Seth Jennings.  The file
> tmem.c in zcache is an attempt to separate out the core tmem
> functionality and data structures so that it can (eventually)
> be in the lib/ directory and be used by multiple backends.
> (RAMster uses tmem.c unchanged.)  The code in tmem.c reflects
> my "highmem-blindness" in that a single pointer is assumed to
> be able to address the "PAMPD" (as opposed to a struct page *
> and an offset, necessary for a 32-bit highmem system).  Seth
> cleverly discovered this ugly two-line fix that (at least for now)
> avoided major mods to tmem.c.

Well you need to do the major mods, it's not ok to do that cast,
passing pages is correct instead. Let's fix the tmem_put API before
people can use it wrong. Maybe then I'll dislike passing through tmem
less? Dunno.

int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index,
-		char *data, size_t size, bool raw, bool ephemeral)
+		struct page *page, size_t size, bool raw, bool ephemeral)


> First ignoring frontswap, there is currently no way to move a
> page of swap data from one swap device to another swap device
> except by moving it first into RAM (in the swap cache), right?

Yes.

> Frontswap doesn't solve that problem either, though it would
> be cool if it could.  The "partial swapoff" functionality
> in the patch, added so that it can be called from frontswap_shrink,
> enables pages to be pulled out of frontswap into swap cache
> so that they can be moved if desired/necessary onto a real
> swap device.

The whole logic deciding the size of the frontswap zcache is going to
be messy. But to do the real swapout you should not pull the memory
out of frontswap zcache, you should write it to disk compacted and
compressed compared to how it was inserted in frontswap... That would
be the ideal.

> The selfballooning code in drivers/xen calls frontswap_shrink
> to pull swap pages out of the Xen hypervisor when memory pressure
> is reduced.  Frontswap_shrink is not yet called from zcache.

So I wonder how zcache is dealing with the dynamic size. Or has it a
fixed size? How do you pull pages out of zcache to max out the real
RAM availability? 

> Note, however, that unlike swap-disks, compressed pages in
> frontswap CAN be silently moved to another "device".  This is
> the foundation of RAMster, which moves those compressed pages
> to the RAM of another machine.  The device _could_ be some
> special type of real-swap-disk, I suppose.

Yeah you can do ramster with frontswap+zcache but not writing the
zcache to disk into the swap device. Writing to disk doesn't require
new allocations. Migrating to other node does. And you must deal with
OOM conditions there. Or it'll deadlock. So the basic should be to
write compressed data to disk (which at least can be done reliably for
swapcache, unlike ramster which has the same issues of nfs swapping
and nbd swapping and iscsi sapping) before wondering if to send it to
another node.

> Yes, this is a good example of the most important feature of
> tmem/frontswap:  Every frontswap_put can be rejected for whatever reason
> the tmem backend chooses, entirely dynamically.  Not only is it true
> that hardware can't handle this well, but the Linux block I/O subsystem
> can't handle it either.  I've suggested in the frontswap documentation
> that this is also a key to allowing "mixed RAM + phase-change RAM"
> systems to be useful.

Yes what is not clear is how the size of the zcache is choosen.

> Also I think this is also why many linux vm/vfs/fs/bio developers
> "don't like it much" (where "it" is cleancache or frontswap).
> They are not used to losing control of data to some other
> non-kernel-controlled entity and not used to being told "NO"
> when they are trying to move data somewhere.  IOW, they are
> control freaks and tmem is out of their control so it must
> be defeated ;-)

Either tmem works on something that is a core MM structure and is
compatible with all bios and operations we can want to do on memory,
or I've an hard time to think it's a good thing in trying to make the
memory it handles not-kernel-controlled.

This non-kernel-controlled approach to me looks like exactly a
requirement coming from Xen, not really something useful.

There is no reason why a kernel abstraction should stay away from
using kernel data structures like "struct page" just to cast it back
from char * to struct page * when it needs to handle highmem in
zcache. Something seriously wrong is going on there in API terms so
you can start by fixing that bit.

> I hope the earlier explanation about frontswap_shrink helps.
> It's also good to note that the only other successful Linux
> implementation of swap compression is zram, and zram's
> creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8)
>
> So where are we now?  Are you now supportive of merging
> frontswap?  If not, can you suggest any concrete steps
> that will gain your support?

My problem is this is like zram, like mentioned it only solves the
compression. There is no way it can store the compressed data on
disk. And this is way more complex than zram, and it only makes the
pooling size not fixed at swapon time... so very very small gain and
huge complexity added (again compared to zram). zram in fact required
absolutely zero changes to the VM. So it's hard to see how this is
overall better than zram. If we deal with that amount of complexity we
should at least be a little better than zram at runtime, while this is
same.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/