lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 06 Mar 2015 08:39:30 -0700
From:	Eric Blake <eblake@...hat.com>
To:	Andrea Arcangeli <aarcange@...hat.com>, qemu-devel@...gnu.org,
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, linux-api@...r.kernel.org,
	Android Kernel Team <kernel-team@...roid.com>
CC:	Robert Love <rlove@...gle.com>, Dave Hansen <dave@...1.net>,
	Jan Kara <jack@...e.cz>, Neil Brown <neilb@...e.de>,
	Stefan Hajnoczi <stefanha@...il.com>,
	Andrew Jones <drjones@...hat.com>,
	Sanidhya Kashyap <sanidhya.gatech@...il.com>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	Michel Lespinasse <walken@...gle.com>,
	Taras Glek <tglek@...illa.com>, zhang.zhanghailiang@...wei.com,
	Pavel Emelyanov <xemul@...allels.com>,
	Hugh Dickins <hughd@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Dr. David Alan Gilbert" <dgilbert@...hat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@...wei.com>,
	Andres Lagar-Cavilla <andreslc@...gle.com>,
	Christopher Covington <cov@...eaurora.org>,
	Anthony Liguori <anthony@...emonkey.ws>,
	Paolo Bonzini <pbonzini@...hat.com>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Keith Packard <keithp@...thp.com>,
	Wenchao Xia <wenchaoqemu@...il.com>,
	Juan Quintela <quintela@...hat.com>,
	Andy Lutomirski <luto@...capital.net>,
	Minchan Kim <minchan@...nel.org>,
	Dmitry Adamushko <dmitry.adamushko@...il.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Mike Hommey <mh@...ndium.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Feiner <pfeiner@...gle.com>
Subject: Re: [Qemu-devel] [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt

On 03/05/2015 10:17 AM, Andrea Arcangeli wrote:
> Add documentation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@...hat.com>
> ---
>  Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 97 insertions(+)
>  create mode 100644 Documentation/vm/userfaultfd.txt

Just a grammar review (no analysis of technical correctness)

> 
> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
> new file mode 100644
> index 0000000..2ec296c
> --- /dev/null
> +++ b/Documentation/vm/userfaultfd.txt
> @@ -0,0 +1,97 @@
> += Userfaultfd =
> +
> +== Objective ==
> +
> +Userfaults allow to implement on demand paging from userland and more

s/to implement/the implementation of/
and maybe: s/on demand/on-demand/

> +generally they allow userland to take control various memory page
> +faults, something otherwise only the kernel code could do.
> +
> +For example userfaults allows a proper and more optimal implementation
> +of the PROT_NONE+SIGSEGV trick.
> +
> +== Design ==
> +
> +Userfaults are delivered and resolved through the userfaultfd syscall.
> +
> +The userfaultfd (aside from registering and unregistering virtual
> +memory ranges) provides for two primary functionalities:

s/provides for/provides/

> +
> +1) read/POLLIN protocol to notify an userland thread of the faults

s/an userland/a userland/ (remember, 'a unicorn gets an umbrella' - if
the 'u' is pronounced 'you' the correct article is 'a')

> +   happening
> +
> +2) various UFFDIO_* ioctls that can mangle over the virtual memory
> +   regions registered in the userfaultfd that allows userland to
> +   efficiently resolve the userfaults it receives via 1) or to mangle
> +   the virtual memory in the background

maybe: s/mangle/manage/2

> +
> +The real advantage of userfaults if compared to regular virtual memory
> +management of mremap/mprotect is that the userfaults in all their
> +operations never involve heavyweight structures like vmas (in fact the
> +userfaultfd runtime load never takes the mmap_sem for writing).
> +
> +Vmas are not suitable for page(or hugepage)-granular fault tracking

s/page(or hugepage)-granular/page- (or hugepage-) granular/

> +when dealing with virtual address spaces that could span
> +Terabytes. Too many vmas would be needed for that.
> +
> +The userfaultfd once opened by invoking the syscall, can also be
> +passed using unix domain sockets to a manager process, so the same
> +manager process could handle the userfaults of a multitude of
> +different process without them being aware about what is going on

s/process/processes/

> +(well of course unless they later try to use the userfaultfd themself

s/themself/themselves/

> +on the same region the manager is already tracking, which is a corner
> +case that would currently return -EBUSY).
> +
> +== API ==
> +
> +When first opened the userfaultfd must be enabled invoking the
> +UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API

s/an uffdio/a uffdio/

> +which will specify the read/POLLIN protocol userland intends to speak
> +on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
> +uffdio_api.api is spoken also by the running kernel), will return into
> +uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
> +respectively the activated feature bits below PAGE_SHIFT in the
> +userfault addresses returned by read(2) and the generic ioctl
> +available.
> +
> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
> +register a memory range in the userfaultfd by setting the
> +uffdio_register structure accordingly. The uffdio_register.mode
> +bitmask will specify to the kernel which kind of faults to track for
> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
> +pages). The UFFDIO_REGISTER ioctl will return the
> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
> +userfaults on the range reigstered. Not all ioctls will necessarily be

s/reigstered/registered/

> +supported for all memory types depending on the underlying virtual
> +memory backend (anonymous memory vs tmpfs vs real filebacked
> +mappings).
> +
> +Userland can use the uffdio_register.ioctls to mangle the virtual

maybe s/mangle/manage/

> +address space in the background (to add or potentially also remove
> +memory from the userfaultfd registered range). This means an userfault

s/an/a/

> +could be triggering just before userland maps in the background the
> +user-faulted page. To avoid POLLIN resulting in an unexpected blocking
> +read (if the UFFD is not opened in nonblocking mode in the first
> +place), we don't allow the background thread to wake userfaults that
> +haven't been read by userland yet. If we would do that likely the
> +UFFDIO_WAKE ioctl could be dropped. This may change in the future
> +(with a UFFD_API protocol bumb combined with the removal of the

s/bumb/bump/

> +UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
> +optimization and worthy to force userland to use the UFFD always in
> +nonblocking mode if combined with POLLIN.
> +
> +userfaultfd is also a generic enough feature, that it allows KVM to
> +implement postcopy live migration (one form of memory externalization
> +consisting of a virtual machine running with part or all of its memory
> +residing on a different node in the cloud) without having to modify a
> +single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
> +and all other GUP features works just fine in combination with
> +userfaults (userfaults trigger async page faults in the guest
> +scheduler so those guest processes that aren't waiting for userfaults
> +can keep running in the guest vcpus).
> +
> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
> +atomically copies a page into the userfault registered range and wakes
> +up the blocked userfaults (unless uffdio_copy.mode &
> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
> +UFFDIO_COPY.
> 
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


Download attachment "signature.asc" of type "application/pgp-signature" (605 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ