linux-kernel - Re: [PATCH 10/17] mm: rmap preparation for remap_anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141007141913.GC2342@redhat.com>
Date:	Tue, 7 Oct 2014 16:19:13 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	"Dr. David Alan Gilbert" <dgilbert@...hat.com>,
	qemu-devel@...gnu.org, KVM list <kvm@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>,
	Linux API <linux-api@...r.kernel.org>,
	Andres Lagar-Cavilla <andreslc@...gle.com>,
	Dave Hansen <dave@...1.net>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Andy Lutomirski <luto@...capital.net>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Sasha Levin <sasha.levin@...cle.com>,
	Hugh Dickins <hughd@...gle.com>,
	Peter Feiner <pfeiner@...gle.com>,
	Christopher Covington <cov@...eaurora.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Android Kernel Team <kernel-team@...roid.com>,
	Robert Love <rlove@...gle.com>,
	Dmitry Adamushko <dmitry.adamushko@...il.com>,
	Neil Brown <neilb@...e.de>, Mike Hommey <mh@...ndium.org>,
	Taras Glek <tglek@...illa.com>, Jan Kara <jack@...e.cz>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	Michel Lespinasse <walken@...gle.com>,
	Minchan Kim <minchan@...nel.org>,
	Keith Packard <keithp@...thp.com>,
	"Huangpeng (Peter)" <peter.huangpeng@...wei.com>,
	Isaku Yamahata <yamahata@...inux.co.jp>,
	Anthony Liguori <anthony@...emonkey.ws>,
	Stefan Hajnoczi <stefanha@...il.com>,
	Wenchao Xia <wenchaoqemu@...il.com>,
	Andrew Jones <drjones@...hat.com>,
	Juan Quintela <quintela@...hat.com>
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

Hello,

On Tue, Oct 07, 2014 at 08:47:59AM -0400, Linus Torvalds wrote:
> On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@...hat.com> wrote:
> >
> > Of course if somebody has better ideas on how to resolve an anonymous
> > userfault they're welcome.
> 
> So I'd *much* rather have a "write()" style interface (ie _copying_
> bytes from user space into a newly allocated page that gets mapped)
> than a "remap page" style interface
> 
> remapping anonymous pages involves page table games that really aren't
> necessarily a good idea, and tlb invalidates for the old page etc.
> Just don't do it.

I see what you mean. The only cons I see is that we couldn't use then
recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
PAGE_SIZE, ..)  and retain the zerocopy behavior. Or how could we?
There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

Ideally if we could prevent the page data coming from the network to
ever become visible in the kernel we could avoid the TLB flush and
also be zerocopy but I can't see how we could achieve that.

The page data could come through a ssh pipe or anything (qemu supports
all kind of network transports for live migration), this is why
leaving the network protocol into userland is preferable.

As things stands now, I'm afraid with a write() syscall we couldn't do
it zerocopy. We'd still need to receive the memory in a temporary page
and then copy it to a kernel page (invisible to userland while we
write to it) to later map into the userfault address.

If it wasn't for the TLB flush of the old page, the remap_anon_pages
variant would be more optimal than doing a copy through a write
syscall. Is the copy cheaper than a TLB flush? I probably naively
assumed the TLB flush was always cheaper.

Now another idea that comes to mind to be able to add the ability to
switch between copy and TLB flush is using a RAP_FORCE_COPY flag, that
would then do a copy inside remap_anon_pages and leave the original
page mapped in place... (and such flag would also disable the -EBUSY
error if page_mapcount is > 1).

So then if the RAP_FORCE_COPY flag is set remap_anon_pages would
behave like you suggested (but with a mremap-like interface, instead
of a write syscall) and we could benchmark the difference between copy
and TLB flush too. We could even periodically benchmark it at runtime
and switch over the faster method (the more CPUs there are in the host
and the more threads the process has, the faster the copy will be
compared to the TLB flush).

Of course in terms of API I could implement the exact same mechanism
as described above for remap_anon_pages inside a write() to the
userfaultfd (it's a pseudo inode). It'd need two different commands to
prepare for the coming write (with a len multiple of PAGE_SIZE) to
know the address where the page should be mapped into and if to behave
zerocopy or if to skip the TLB flush and copy.

Because the copy vs TLB flush trade off is possible to achieve with
both interfaces, I think it really boils down to choosing between a
mremap like interface, or file+commands protocol interface. I tend to
like mremap more, that's why I opted for a remap_anon_pages syscall
kept orthogonal to the userfaultfd functionality (remap_anon_pages
could be also used standalone as an accelerated mremap in some
circumstances) but nothing prevents to just embed the same mechanism
inside userfaultfd if a file+commands API is preferable. Or we could
add a different syscall (separated from userfaultfd) that creates
another pseudofd to write a command plus the page data into it. Just I
wouldn't see the point of creating a pseudofd just to copy a page
atomically, the write() syscall would look more ideal if the
userfaultfd is already open for other reasons and the pseudofd
overhead is required anyway.

Last thing to keep in mind is that if using userfaults with SIGBUS and
without userfaultfd, remap_anon_pages would have been still useful, so
if we retain the SIGBUS behavior for volatile pages and we don't force
the usage for userfaultfd, it may be cleaner not to use userfaultfd
but a separate pseudofd to do the write() syscall though. Otherwise
the app would need to open the userfaultfd to resolve the fault even
though it's not using the userfaultfd protocol which doesn't look an
intuitive interface to me.

Comments welcome.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/