linux-kernel - Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140402183113.GL1500@redhat.com>
Date:	Wed, 2 Apr 2014 20:31:13 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	John Stultz <john.stultz@...aro.org>
Cc:	Johannes Weiner <hannes@...xchg.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Android Kernel Team <kernel-team@...roid.com>,
	Robert Love <rlove@...gle.com>, Mel Gorman <mel@....ul.ie>,
	Hugh Dickins <hughd@...gle.com>, Dave Hansen <dave@...1.net>,
	Rik van Riel <riel@...hat.com>,
	Dmitry Adamushko <dmitry.adamushko@...il.com>,
	Neil Brown <neilb@...e.de>, Mike Hommey <mh@...ndium.org>,
	Taras Glek <tglek@...illa.com>, Jan Kara <jack@...e.cz>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	Michel Lespinasse <walken@...gle.com>,
	Minchan Kim <minchan@...nel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

Hi everyone,

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

I actually thought the way of being notified with a page fault (sigbus
or whatever) was the most efficient way of using volatile ranges.

Why having to call a syscall to know if you can still access the
volatile range, if there was no VM pressure before the access?
syscalls are expensive, accessing the memory direct is not. Only if it
page was actually missing and a page fault would fire, you'd take the
slowpath.

The usages I see for this are plenty, like for maintaining caches in
memory that may be big and would be nice to discard if there's VM
pressure, jpeg uncompressed images sounds like a candidate too. So the
browser size would shrink if there's VM pressure, instead of ending up
swapping out uncompressed image data that can be regenerated more
quickly with the CPU than with swapins.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

I'm actually working on feature that would solve the problem for the
syscalls accessing missing volatile pages. So you'd never see a
-EFAULT because all syscalls won't return even if they encounters a
missing page in the volatile range dropped by the VM pressure.

It's called userfaultfd. You call sys_userfaultfd(flags) and it
connects the current mm to a pseudo filedescriptor. The filedescriptor
works similarly to eventfd but with a different protocol.

You need a thread that will never access the userfault area with the
CPU, that is responsible to poll on the userfaultfd and talk the
userfaultfd protocol to fill-in missing pages. The userfault thread
after a POLLIN event reads the virtual addresses of the fault that
must have happened on some other thread of the same mm, and then
writes back an "handled" virtual range into the fd, after the page (or
pages if multiple) have been regenerated and mapped in with
sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
swapping. Then depending on the "solved" range written back into the
fd, the kernel will wakeup the thread or threads that were waiting in
kernel mode on the "handled" virtual range, and retry the fault
without ever exiting kernel mode.

We need this in KVM for running the guest on memory that is on other
nodes or other processes (postcopy live migration is the most common
use case but there are others like memory externalization and
cross-node KSM in the cloud, to keep a single copy of memory across
multiple nodes and externalized to the VM and to the host node).

This thread made me wonder if we could mix the two features and you
would then depend on MADV_USERFAULT and userfaultfd to deliver to
userland the "faults" happening on the volatile pages that have been
purged as result of VM pressure.

I'm just saying this after Johannes mentioned the issue with syscalls
returning -EFAULT. Because that is the very issue that the userfaultfd
is going to solve for the KVM migration thread.

What I'm thinking now would be to mark the volatile range also
MADV_USERFAULT and then calling userfaultfd and instead of having the
cache regeneration "slow path" inside the SIGBUS handler, to run it in
the userfault thread that polls the userfaultfd. Then you could write
the volatile ranges to disk with a write() syscall (or use any other
syscall on the volatile ranges), without having to worry about -EFAULT
being returned because one page was discarded. And if MADV_USERFAULT
is not called in combination with vrange syscalls, then it'd still
work without the userfault, but with the vrange syscalls only.

In short the idea would be to let the userfault code solve the fault
delivery to userland for you, and make the vrange syscalls only focus
on the page purging problem, without having to worry about what
happens when something access a missing page.

But if you don't intend to solve the syscall -EFAULT problem, well
then probably the overlap is still as thin as I thought it was before
(like also mentioned in the below link).

Thanks,
Andrea

PS. my last email about this from a more KVM centric point of view:

http://www.spinics.net/lists/kvm/msg101449.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/