[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1425575884-2574-3-git-send-email-aarcange@redhat.com>
Date: Thu, 5 Mar 2015 18:17:45 +0100
From: Andrea Arcangeli <aarcange@...hat.com>
To: qemu-devel@...gnu.org, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-api@...r.kernel.org,
Android Kernel Team <kernel-team@...roid.com>
Cc: "Kirill A. Shutemov" <kirill@...temov.name>,
Pavel Emelyanov <xemul@...allels.com>,
Sanidhya Kashyap <sanidhya.gatech@...il.com>,
zhang.zhanghailiang@...wei.com,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andres Lagar-Cavilla <andreslc@...gle.com>,
Dave Hansen <dave@...1.net>,
Paolo Bonzini <pbonzini@...hat.com>,
Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
Andy Lutomirski <luto@...capital.net>,
Andrew Morton <akpm@...ux-foundation.org>,
Sasha Levin <sasha.levin@...cle.com>,
Hugh Dickins <hughd@...gle.com>,
Peter Feiner <pfeiner@...gle.com>,
"Dr. David Alan Gilbert" <dgilbert@...hat.com>,
Christopher Covington <cov@...eaurora.org>,
Johannes Weiner <hannes@...xchg.org>,
Robert Love <rlove@...gle.com>,
Dmitry Adamushko <dmitry.adamushko@...il.com>,
Neil Brown <neilb@...e.de>, Mike Hommey <mh@...ndium.org>,
Taras Glek <tglek@...illa.com>, Jan Kara <jack@...e.cz>,
KOSAKI Motohiro <kosaki.motohiro@...il.com>,
Michel Lespinasse <walken@...gle.com>,
Minchan Kim <minchan@...nel.org>,
Keith Packard <keithp@...thp.com>,
"Huangpeng (Peter)" <peter.huangpeng@...wei.com>,
Anthony Liguori <anthony@...emonkey.ws>,
Stefan Hajnoczi <stefanha@...il.com>,
Wenchao Xia <wenchaoqemu@...il.com>,
Andrew Jones <drjones@...hat.com>,
Juan Quintela <quintela@...hat.com>
Subject: [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt
Add documentation.
Signed-off-by: Andrea Arcangeli <aarcange@...hat.com>
---
Documentation/vm/userfaultfd.txt | 97 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 97 insertions(+)
create mode 100644 Documentation/vm/userfaultfd.txt
diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
new file mode 100644
index 0000000..2ec296c
--- /dev/null
+++ b/Documentation/vm/userfaultfd.txt
@@ -0,0 +1,97 @@
+= Userfaultfd =
+
+== Objective ==
+
+Userfaults allow to implement on demand paging from userland and more
+generally they allow userland to take control various memory page
+faults, something otherwise only the kernel code could do.
+
+For example userfaults allows a proper and more optimal implementation
+of the PROT_NONE+SIGSEGV trick.
+
+== Design ==
+
+Userfaults are delivered and resolved through the userfaultfd syscall.
+
+The userfaultfd (aside from registering and unregistering virtual
+memory ranges) provides for two primary functionalities:
+
+1) read/POLLIN protocol to notify an userland thread of the faults
+ happening
+
+2) various UFFDIO_* ioctls that can mangle over the virtual memory
+ regions registered in the userfaultfd that allows userland to
+ efficiently resolve the userfaults it receives via 1) or to mangle
+ the virtual memory in the background
+
+The real advantage of userfaults if compared to regular virtual memory
+management of mremap/mprotect is that the userfaults in all their
+operations never involve heavyweight structures like vmas (in fact the
+userfaultfd runtime load never takes the mmap_sem for writing).
+
+Vmas are not suitable for page(or hugepage)-granular fault tracking
+when dealing with virtual address spaces that could span
+Terabytes. Too many vmas would be needed for that.
+
+The userfaultfd once opened by invoking the syscall, can also be
+passed using unix domain sockets to a manager process, so the same
+manager process could handle the userfaults of a multitude of
+different process without them being aware about what is going on
+(well of course unless they later try to use the userfaultfd themself
+on the same region the manager is already tracking, which is a corner
+case that would currently return -EBUSY).
+
+== API ==
+
+When first opened the userfaultfd must be enabled invoking the
+UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API
+which will specify the read/POLLIN protocol userland intends to speak
+on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
+uffdio_api.api is spoken also by the running kernel), will return into
+uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
+respectively the activated feature bits below PAGE_SHIFT in the
+userfault addresses returned by read(2) and the generic ioctl
+available.
+
+Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
+be invoked (if present in the returned uffdio_api.ioctls bitmask) to
+register a memory range in the userfaultfd by setting the
+uffdio_register structure accordingly. The uffdio_register.mode
+bitmask will specify to the kernel which kind of faults to track for
+the range (UFFDIO_REGISTER_MODE_MISSING would track missing
+pages). The UFFDIO_REGISTER ioctl will return the
+uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
+userfaults on the range reigstered. Not all ioctls will necessarily be
+supported for all memory types depending on the underlying virtual
+memory backend (anonymous memory vs tmpfs vs real filebacked
+mappings).
+
+Userland can use the uffdio_register.ioctls to mangle the virtual
+address space in the background (to add or potentially also remove
+memory from the userfaultfd registered range). This means an userfault
+could be triggering just before userland maps in the background the
+user-faulted page. To avoid POLLIN resulting in an unexpected blocking
+read (if the UFFD is not opened in nonblocking mode in the first
+place), we don't allow the background thread to wake userfaults that
+haven't been read by userland yet. If we would do that likely the
+UFFDIO_WAKE ioctl could be dropped. This may change in the future
+(with a UFFD_API protocol bumb combined with the removal of the
+UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
+optimization and worthy to force userland to use the UFFD always in
+nonblocking mode if combined with POLLIN.
+
+userfaultfd is also a generic enough feature, that it allows KVM to
+implement postcopy live migration (one form of memory externalization
+consisting of a virtual machine running with part or all of its memory
+residing on a different node in the cloud) without having to modify a
+single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
+and all other GUP features works just fine in combination with
+userfaults (userfaults trigger async page faults in the guest
+scheduler so those guest processes that aren't waiting for userfaults
+can keep running in the guest vcpus).
+
+The primary ioctl to resolve userfaults is UFFDIO_COPY. That
+atomically copies a page into the userfault registered range and wakes
+up the blocked userfaults (unless uffdio_copy.mode &
+UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
+UFFDIO_COPY.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists