lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210209000058.GA78818@xz-x1>
Date:   Mon, 8 Feb 2021 19:00:58 -0500
From:   Peter Xu <peterx@...hat.com>
To:     Axel Rasmussen <axelrasmussen@...gle.com>
Cc:     Alexander Viro <viro@...iv.linux.org.uk>,
        Alexey Dobriyan <adobriyan@...il.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Catalin Marinas <catalin.marinas@....com>,
        Chinwen Chang <chinwen.chang@...iatek.com>,
        Huang Ying <ying.huang@...el.com>,
        Ingo Molnar <mingo@...hat.com>, Jann Horn <jannh@...gle.com>,
        Jerome Glisse <jglisse@...hat.com>,
        Lokesh Gidra <lokeshgidra@...gle.com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Michal Koutný <mkoutny@...e.com>,
        Michel Lespinasse <walken@...gle.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Nicholas Piggin <npiggin@...il.com>, Shaohua Li <shli@...com>,
        Shawn Anastasio <shawn@...stas.io>,
        Steven Rostedt <rostedt@...dmis.org>,
        Steven Price <steven.price@....com>,
        Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        Adam Ruprecht <ruprecht@...gle.com>,
        Cannon Matthews <cannonmatthews@...gle.com>,
        "Dr . David Alan Gilbert" <dgilbert@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Mina Almasry <almasrymina@...gle.com>,
        Oliver Upton <oupton@...gle.com>
Subject: Re: [PATCH v4 05/10] userfaultfd: add minor fault registration mode

On Thu, Feb 04, 2021 at 10:34:28AM -0800, Axel Rasmussen wrote:
> This feature allows userspace to intercept "minor" faults. By "minor"
> faults, I mean the following situation:
> 
> Let there exist two mappings (i.e., VMAs) to the same page(s). One of
> the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have
> already been allocated & filled with some contents. The UFFD mapping
> has not yet been faulted in; when it is touched for the first time,
> this results in what I'm calling a "minor" fault. As a concrete
> example, when working with hugetlbfs, we have huge_pte_none(), but
> find_lock_page() finds an existing page.
> 
> This commit adds the new registration mode, and sets the relevant flag
> on the VMAs being registered. In the hugetlb fault path, if we find
> that we have huge_pte_none(), but find_lock_page() does indeed find an
> existing page, then we have a "minor" fault, and if the VMA has the
> userfaultfd registration flag, we call into userfaultfd to handle it.
> 
> Why add a new registration mode, as opposed to adding a feature to
> MISSING registration, like UFFD_FEATURE_SIGBUS?
> 
> - The semantics are significantly different. UFFDIO_COPY or
>   UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
>   would instead just memset() or memcpy() or whatever via the non-UFFD
>   mapping. Unlike MISSING registration, MINOR registration only makes
>   sense for hugetlbfs (or, in the future, shmem), as this is the only
>   way to get two VMAs to a single set of underlying pages.
> 
> - Doing so would make handle_userfault()'s "reason" argument confusing.
>   We'd pass in "MISSING" even if the pages weren't really missing.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@...gle.com>
> ---
>  fs/proc/task_mmu.c               |  1 +
>  fs/userfaultfd.c                 | 81 ++++++++++++++++++++------------
>  include/linux/mm.h               |  1 +
>  include/linux/userfaultfd_k.h    | 15 +++++-
>  include/trace/events/mmflags.h   |  1 +
>  include/uapi/linux/userfaultfd.h | 15 +++++-
>  mm/hugetlb.c                     | 32 +++++++++++++
>  7 files changed, 112 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 602e3a52884d..94e951ea3e03 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  		[ilog2(VM_MTE)]		= "mt",
>  		[ilog2(VM_MTE_ALLOWED)]	= "",
>  #endif
> +		[ilog2(VM_UFFD_MINOR)]	= "ui",
>  #ifdef CONFIG_ARCH_HAS_PKEYS
>  		/* These come out via ProtectionKey: */
>  		[ilog2(VM_PKEY_BIT0)]	= "",
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index a0f66e12026b..c643cf13d957 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -197,24 +197,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
>  	msg_init(&msg);
>  	msg.event = UFFD_EVENT_PAGEFAULT;
>  	msg.arg.pagefault.address = address;
> +	/*
> +	 * These flags indicate why the userfault occurred:
> +	 * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
> +	 * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
> +	 * - Neither of these flags being set indicates a MISSING fault.
> +	 *
> +	 * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
> +	 * fault. Otherwise, it was a read fault.
> +	 */
>  	if (flags & FAULT_FLAG_WRITE)
> -		/*
> -		 * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> -		 * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
> -		 * was not set in a UFFD_EVENT_PAGEFAULT, it means it
> -		 * was a read fault, otherwise if set it means it's
> -		 * a write fault.
> -		 */
>  		msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
>  	if (reason & VM_UFFD_WP)
> -		/*
> -		 * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> -		 * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
> -		 * not set in a UFFD_EVENT_PAGEFAULT, it means it was
> -		 * a missing fault, otherwise if set it means it's a
> -		 * write protect fault.
> -		 */
>  		msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
> +	if (reason & VM_UFFD_MINOR)
> +		msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
>  	if (features & UFFD_FEATURE_THREAD_ID)
>  		msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
>  	return msg;
> @@ -401,8 +398,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  
>  	BUG_ON(ctx->mm != mm);
>  
> -	VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> -	VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> +	/* Any unrecognized flag is a bug. */
> +	VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
> +	/* 0 or > 1 flags set is a bug; we expect exactly 1. */
> +	VM_BUG_ON(!reason || !!(reason & (reason - 1)));
>  
>  	if (ctx->features & UFFD_FEATURE_SIGBUS)
>  		goto out;
> @@ -612,7 +611,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
>  		for (vma = mm->mmap; vma; vma = vma->vm_next)
>  			if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
>  				vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> -				vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> +				vma->vm_flags &= ~__VM_UFFD_FLAGS;
>  			}
>  		mmap_write_unlock(mm);
>  
> @@ -644,7 +643,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
>  	octx = vma->vm_userfaultfd_ctx.ctx;
>  	if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
>  		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> -		vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> +		vma->vm_flags &= ~__VM_UFFD_FLAGS;
>  		return 0;
>  	}
>  
> @@ -726,7 +725,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
>  	} else {
>  		/* Drop uffd context if remap feature not enabled */
>  		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> -		vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> +		vma->vm_flags &= ~__VM_UFFD_FLAGS;
>  	}
>  }
>  
> @@ -867,12 +866,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  		cond_resched();
>  		BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> -		       !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> +		       !!(vma->vm_flags & __VM_UFFD_FLAGS));
>  		if (vma->vm_userfaultfd_ctx.ctx != ctx) {
>  			prev = vma;
>  			continue;
>  		}
> -		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> +		new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
>  		prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
>  				 new_flags, vma->anon_vma,
>  				 vma->vm_file, vma->vm_pgoff,
> @@ -1305,9 +1304,29 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
>  				     unsigned long vm_flags)
>  {
>  	/* FIXME: add WP support to hugetlbfs and shmem */
> -	return vma_is_anonymous(vma) ||
> -		((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> -		 !(vm_flags & VM_UFFD_WP));
> +	if (vm_flags & VM_UFFD_WP) {
> +		if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
> +			return false;
> +	}
> +
> +	if (vm_flags & VM_UFFD_MINOR) {
> +		/*
> +		 * The use case for minor registration (intercepting minor
> +		 * faults) is to handle the case where a page is present, but
> +		 * needs to be modified before it can be used. This only makes
> +		 * sense when you have two mappings to the same underlying
> +		 * pages (one UFFD registered, one not), but the memory doesn't
> +		 * have to be shared (consider one process mapping a hugetlbfs
> +		 * file with MAP_SHARED, and then a second process doing
> +		 * MAP_PRIVATE).

No strong opinion, but I'd drop the whole chunk of comment here..

  - "what is minor fault" should be covered in the documentation file already.

  - "two mappings" seems slightly superfluous too, since we can still use minor
    fault with TRUNCATE+UFFDIO_COPY.. if we want?  maybe?

  - "memory doesn't have to be shared" would be a bit odd too if saying that
    without any code checking against "shared" at all, I'd say. :)

The FIXME below it is fine.

If you agree with above, feel free to add my r-b after dropping the chunk:

Reviewed-by: Peter Xu <peterx@...hat.com>

Thanks,

-- 
Peter Xu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ