linux-kernel - Re: [GIT PULL] mm: frontswap (for 3.2 window)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20111031223717.GI3466@redhat.com>
Date:	Mon, 31 Oct 2011 23:37:17 +0100
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Dan Magenheimer <dan.magenheimer@...cle.com>
Cc:	Pekka Enberg <penberg@...nel.org>,
	Cyclonus J <cyclonusj@...il.com>,
	Sasha Levin <levinsasha928@...il.com>,
	Christoph Hellwig <hch@...radead.org>,
	David Rientjes <rientjes@...gle.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-mm@...ck.org, LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Konrad Wilk <konrad.wilk@...cle.com>,
	Jeremy Fitzhardinge <jeremy@...p.org>,
	Seth Jennings <sjenning@...ux.vnet.ibm.com>, ngupta@...are.org,
	Chris Mason <chris.mason@...cle.com>, JBeulich@...ell.com,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Jonathan Corbet <corbet@....net>
Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

On Mon, Oct 31, 2011 at 01:58:39PM -0700, Dan Magenheimer wrote:
> Hmmm... not sure I understand this one.  It IS copy-based
> so is not zerocopy; the page of data is actually moving out

copy-based is my main problem, being synchronous is no big deal I
agree.

I mean, I don't see why you have to make one copy before you start
compressing and then you write to disk the output of the compression
algorithm. To me it looks like this API forces on zcache one more copy
than necessary.

I can't see why this copy is necessary and why zcache isn't working on
"struct page" on core kernel structures instead of moving the memory
off to a memory object invisible to the core VM.

> TRUE.  Tell me again why a vmexit/vmenter per 4K page is
> "impossible"?  Again you are assuming (1) the CPU had some

It's sure not impossible, it's just impossible we want it as it'd be
too slow.

> real work to do instead and (2) that vmexit/vmenter is horribly

Sure the CPU has another 1000 VM to schedule. This is like saying
virtio-blk isn't needed on desktop virt becauase the desktop isn't
doing much I/O. Absurd argument, there are another 1000 desktops doing
I/O at the same time of course.

> slow.  Even if vmexit/vmenter is thousands of cycles, it is still
> orders of magnitude faster than a disk access.  And vmexit/vmenter

I fully agree tmem is faster for Xen than no tmem. That's not the
point, we don't need such an articulate hack hiding pages from the
guest OS in order to share pagecache, our hypervisor is just a bit
more powerful and has a function called file_read_actor that does what
your tmem copy does...

> is about the same order of magnitude as page copy, and much
> faster than compression/decompression, both of which still
> result in a nice win.

Saying it's a small overhead, is not like saying it is _needed_. Why
not add a udelay(1) in it too? Sure it won't be noticeable.

> You are also assuming that frontswap puts/gets are highly
> frequent.  By definition they are not, because they are
> replacing single-page disk reads/writes due to swapping.

They'll be as frequent as the highmem bounce buffers...

> That said, the API/ABI is very extensible, so if it were
> proven that batching was sufficiently valuable, it could
> be added later... but I don't see it as a showstopper.
> Really do you?

That's fine with me... but like ->writepages it'll take ages for the
fs to switch from writepage to writepages. Considering this is a new
API I don't think it's unreasonable to ask at least it to handle
immediately zerocopy behavior. So showing the userland mapping to the
tmem layer so it can avoid the copy and read from the userland
address. Xen will badly choke if ever tries to do that, but zcache
should be ok with that.

Now there may be algorithms where the page must be stable, but others
will be perfectly fine even if the page is changing under the
compression, and in that case the page won't be discarded and it'll be
marked dirty again. So even if a wrong data goes on disk, we'll
rewrite later. I see no reason why there has always to be a copy
before starting any compression/encryption as long as the algorithm
will not crash its input data isn't changing under it.

The ideal API would be to send down page pointers (and handling
compound pages too), not to copy. Maybe with a flag where you can also
specify offsets so you can send down partial pages too down to a byte
granularity. The "copy input data before anything else can happen"
looks flawed to me. It is not flawed for Xen because Xen has no
knowledge of the guest "struct page" but her I'm talking about the
not-virt usages.

> So, please, all the other parts necessary for tmem are
> already in-tree, why all the resistance about frontswap?

Well my comments are generic not specific to frontswap.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/