linux-kernel - Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5661B62B.2020409@gmail.com>
Date:	Fri, 04 Dec 2015 16:50:03 +0100
From:	"Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>
To:	Andrea Arcangeli <aarcange@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	qemu-devel@...gnu.org, kvm@...r.kernel.org,
	linux-api@...r.kernel.org
CC:	mtk.manpages@...il.com, Pavel Emelyanov <xemul@...allels.com>,
	Sanidhya Kashyap <sanidhya.gatech@...il.com>,
	zhang.zhanghailiang@...wei.com,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Andres Lagar-Cavilla <andreslc@...gle.com>,
	Dave Hansen <dave.hansen@...el.com>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Andy Lutomirski <luto@...capital.net>,
	Hugh Dickins <hughd@...gle.com>,
	Peter Feiner <pfeiner@...gle.com>,
	"Dr. David Alan Gilbert" <dgilbert@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	"Huangpeng (Peter)" <peter.huangpeng@...wei.com>
Subject: Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt

Hi Andrea,

On 09/11/2015 10:47 AM, Michael Kerrisk (man-pages) wrote:
> On 05/14/2015 07:30 PM, Andrea Arcangeli wrote:
>> Add documentation.
> 
> Hi Andrea,
> 
> I do not recall... Did you write a man page also for this new system call?

No response to my last mail, so I'll try again... Did you 
write any man page for this interface?

Thanks,

Michael


>> Signed-off-by: Andrea Arcangeli <aarcange@...hat.com>
>> ---
>>  Documentation/vm/userfaultfd.txt | 140 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 140 insertions(+)
>>  create mode 100644 Documentation/vm/userfaultfd.txt
>>
>> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
>> new file mode 100644
>> index 0000000..c2f5145
>> --- /dev/null
>> +++ b/Documentation/vm/userfaultfd.txt
>> @@ -0,0 +1,140 @@
>> += Userfaultfd =
>> +
>> +== Objective ==
>> +
>> +Userfaults allow the implementation of on-demand paging from userland
>> +and more generally they allow userland to take control various memory
>> +page faults, something otherwise only the kernel code could do.
>> +
>> +For example userfaults allows a proper and more optimal implementation
>> +of the PROT_NONE+SIGSEGV trick.
>> +
>> +== Design ==
>> +
>> +Userfaults are delivered and resolved through the userfaultfd syscall.
>> +
>> +The userfaultfd (aside from registering and unregistering virtual
>> +memory ranges) provides two primary functionalities:
>> +
>> +1) read/POLLIN protocol to notify a userland thread of the faults
>> +   happening
>> +
>> +2) various UFFDIO_* ioctls that can manage the virtual memory regions
>> +   registered in the userfaultfd that allows userland to efficiently
>> +   resolve the userfaults it receives via 1) or to manage the virtual
>> +   memory in the background
>> +
>> +The real advantage of userfaults if compared to regular virtual memory
>> +management of mremap/mprotect is that the userfaults in all their
>> +operations never involve heavyweight structures like vmas (in fact the
>> +userfaultfd runtime load never takes the mmap_sem for writing).
>> +
>> +Vmas are not suitable for page- (or hugepage) granular fault tracking
>> +when dealing with virtual address spaces that could span
>> +Terabytes. Too many vmas would be needed for that.
>> +
>> +The userfaultfd once opened by invoking the syscall, can also be
>> +passed using unix domain sockets to a manager process, so the same
>> +manager process could handle the userfaults of a multitude of
>> +different processes without them being aware about what is going on
>> +(well of course unless they later try to use the userfaultfd
>> +themselves on the same region the manager is already tracking, which
>> +is a corner case that would currently return -EBUSY).
>> +
>> +== API ==
>> +
>> +When first opened the userfaultfd must be enabled invoking the
>> +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
>> +a later API version) which will specify the read/POLLIN protocol
>> +userland intends to speak on the UFFD. The UFFDIO_API ioctl if
>> +successful (i.e. if the requested uffdio_api.api is spoken also by the
>> +running kernel), will return into uffdio_api.features and
>> +uffdio_api.ioctls two 64bit bitmasks of respectively the activated
>> +feature of the read(2) protocol and the generic ioctl available.
>> +
>> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
>> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
>> +register a memory range in the userfaultfd by setting the
>> +uffdio_register structure accordingly. The uffdio_register.mode
>> +bitmask will specify to the kernel which kind of faults to track for
>> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
>> +pages). The UFFDIO_REGISTER ioctl will return the
>> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
>> +userfaults on the range registered. Not all ioctls will necessarily be
>> +supported for all memory types depending on the underlying virtual
>> +memory backend (anonymous memory vs tmpfs vs real filebacked
>> +mappings).
>> +
>> +Userland can use the uffdio_register.ioctls to manage the virtual
>> +address space in the background (to add or potentially also remove
>> +memory from the userfaultfd registered range). This means a userfault
>> +could be triggering just before userland maps in the background the
>> +user-faulted page.
>> +
>> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
>> +atomically copies a page into the userfault registered range and wakes
>> +up the blocked userfaults (unless uffdio_copy.mode &
>> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
>> +UFFDIO_COPY.
>> +
>> +== QEMU/KVM ==
>> +
>> +QEMU/KVM is using the userfaultfd syscall to implement postcopy live
>> +migration. Postcopy live migration is one form of memory
>> +externalization consisting of a virtual machine running with part or
>> +all of its memory residing on a different node in the cloud. The
>> +userfaultfd abstraction is generic enough that not a single line of
>> +KVM kernel code had to be modified in order to add postcopy live
>> +migration to QEMU.
>> +
>> +Guest async page faults, FOLL_NOWAIT and all other GUP features work
>> +just fine in combination with userfaults. Userfaults trigger async
>> +page faults in the guest scheduler so those guest processes that
>> +aren't waiting for userfaults (i.e. network bound) can keep running in
>> +the guest vcpus.
>> +
>> +It is generally beneficial to run one pass of precopy live migration
>> +just before starting postcopy live migration, in order to avoid
>> +generating userfaults for readonly guest regions.
>> +
>> +The implementation of postcopy live migration currently uses one
>> +single bidirectional socket but in the future two different sockets
>> +will be used (to reduce the latency of the userfaults to the minimum
>> +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
>> +
>> +The QEMU in the source node writes all pages that it knows are missing
>> +in the destination node, into the socket, and the migration thread of
>> +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
>> +ioctls on the userfaultfd in order to map the received pages into the
>> +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
>> +
>> +A different postcopy thread in the destination node listens with
>> +poll() to the userfaultfd in parallel. When a POLLIN event is
>> +generated after a userfault triggers, the postcopy thread read() from
>> +the userfaultfd and receives the fault address (or -EAGAIN in case the
>> +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
>> +by the parallel QEMU migration thread).
>> +
>> +After the QEMU postcopy thread (running in the destination node) gets
>> +the userfault address it writes the information about the missing page
>> +into the socket. The QEMU source node receives the information and
>> +roughly "seeks" to that page address and continues sending all
>> +remaining missing pages from that new page offset. Soon after that
>> +(just the time to flush the tcp_wmem queue through the network) the
>> +migration thread in the QEMU running in the destination node will
>> +receive the page that triggered the userfault and it'll map it as
>> +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
>> +was spontaneously sent by the source or if it was an urgent page
>> +requested through an userfault).
>> +
>> +By the time the userfaults start, the QEMU in the destination node
>> +doesn't need to keep any per-page state bitmap relative to the live
>> +migration around and a single per-page bitmap has to be maintained in
>> +the QEMU running in the source node to know which pages are still
>> +missing in the destination node. The bitmap in the source node is
>> +checked to find which missing pages to send in round robin and we seek
>> +over it when receiving incoming userfaults. After sending each page of
>> +course the bitmap is updated accordingly. It's also useful to avoid
>> +sending the same page twice (in case the userfault is read by the
>> +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
>> +thread).
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/