lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 6 Jul 2021 15:40:42 +1000
From:   Alistair Popple <apopple@...dia.com>
To:     Peter Xu <peterx@...hat.com>
CC:     <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Hugh Dickins <hughd@...gle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Jerome Glisse <jglisse@...hat.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

> > > > > > >  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > > > > >  			     pte_t pte);
> > > > > > >  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > > > > > index 355ea1ee32bd..c29a6ef3a642 100644
> > > > > > > --- a/include/linux/mm_inline.h
> > > > > > > +++ b/include/linux/mm_inline.h
> > > > > > > @@ -4,6 +4,8 @@
> > > > > > >  
> > > > > > >  #include <linux/huge_mm.h>
> > > > > > >  #include <linux/swap.h>
> > > > > > > +#include <linux/userfaultfd_k.h>
> > > > > > > +#include <linux/swapops.h>
> > > > > > >  
> > > > > > >  /**
> > > > > > >   * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > > > > > > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > > > > > >  	update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > > > > > >  			-thp_nr_pages(page));
> > > > > > >  }
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > > > > > > + * replace a none pte.  NOTE!  This should only be called when *pte is already
> > > > > > > + * cleared so we will never accidentally replace something valuable.  Meanwhile
> > > > > > > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > > > > > > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > > > > > > + * even better.
> > > > > > > + *
> > > > > > > + * Must be called with pgtable lock held.
> > > > > > > + */
> > > > > > > +static inline void
> > > > > > > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > +			      pte_t *pte, pte_t pteval)
> > > > > > > +{
> > > > > > > +#ifdef CONFIG_USERFAULTFD
> > > > > > > +	bool arm_uffd_pte = false;
> > > > > > > +
> > > > > > > +	/* The current status of the pte should be "cleared" before calling */
> > > > > > > +	WARN_ON_ONCE(!pte_none(*pte));
> > > > > > > +
> > > > > > > +	if (vma_is_anonymous(vma))
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	/* A uffd-wp wr-protected normal pte */
> > > > > > > +	if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > > > > > > +		arm_uffd_pte = true;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * A uffd-wp wr-protected swap pte.  Note: this should even work for
> > > > > > > +	 * pte_swp_uffd_wp_special() too.
> > > > > > > +	 */
> > > > > > 
> > > > > > I'm probably missing something but when can we actually have this case and why
> > > > > > would we want to leave a special pte behind? From what I can tell this is
> > > > > > called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> > > > > > when not skipping swap pages.
> > > > > 
> > > > > Yes this is a good question..
> > > > > 
> > > > > Initially I made this function make sure I cover all forms of uffd-wp bit, that
> > > > > contains both swap and present ptes; imho that's pretty safe.  However for
> > > > > !anonymous cases we don't keep swap entry inside pte even if swapped out, as
> > > > > they should reside in shmem page cache indeed.  The only missing piece seems to
> > > > > be the device private entries as you also spotted below.
> > > > 
> > > > Yes, I think it's *probably* safe although I don't yet have a strong opinion
> > > > here ...
> > > > 
> > > > > > > +	if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> > > > 
> > > > ... however if this can never happen would a WARN_ON() be better? It would also
> > > > mean you could remove arm_uffd_pte.
> > > 
> > > Hmm, after a second thought I think we can't make it a WARN_ON_ONCE().. this
> > > can still be useful for private mapping of shmem files: in that case we'll have
> > > swap entry stored in pte not page cache, so after page reclaim it will contain
> > > a valid swap entry, while it's still "!anonymous".
> > 
> > There's something (probably obvious) I must still be missing here. During
> > reclaim won't a private shmem mapping still have a present pteval here?
> > Therefore it won't trigger this case - the uffd wp bit is set when the swap
> > entry is established further down in try_to_unmap_one() right?
> 
> I agree if it's at the point when it get reclaimed, however what if we zap a
> pte of a page already got reclaimed?  It should have the swap pte installed,
> imho, which will have "is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)"==true.

Apologies for the delay getting back to this, I hope to find some more time
to look at this again this week.

I guess what I am missing is why we care about a swap pte for a reclaimed page
getting zapped. I thought that would imply the mapping was getting torn down,
although I suppose in that case you still want the uffd-wp to apply in case a
new mapping appears there?

> > 
> > > > 
> > > > > > > +		arm_uffd_pte = true;
> > > > > > > +
> > > > > > > +	if (unlikely(arm_uffd_pte))
> > > > > > > +		set_pte_at(vma->vm_mm, addr, pte,
> > > > > > > +			   pte_swp_mkuffd_wp_special(vma));
> > > > > > > +#endif
> > > > > > > +}
> > > > > > > +
> > > > > > >  #endif
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index 319552efc782..3453b8ae5f4f 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -73,6 +73,7 @@
> > > > > > >  #include <linux/perf_event.h>
> > > > > > >  #include <linux/ptrace.h>
> > > > > > >  #include <linux/vmalloc.h>
> > > > > > > +#include <linux/mm_inline.h>
> > > > > > >  
> > > > > > >  #include <trace/events/kmem.h>
> > > > > > >  
> > > > > > > @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > > > > >  	return ret;
> > > > > > >  }
> > > > > > >  
> > > > > > > +/*
> > > > > > > + * This function makes sure that we'll replace the none pte with an uffd-wp
> > > > > > > + * swap special pte marker when necessary. Must be with the pgtable lock held.
> > > > > > > + */
> > > > > > > +static inline void
> > > > > > > +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> > > > > > > +			      unsigned long addr, pte_t *pte,
> > > > > > > +			      struct zap_details *details, pte_t pteval)
> > > > > > > +{
> > > > > > > +	if (zap_drop_file_uffd_wp(details))
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> > > > > > > +}
> > > > > > > +
> > > > > > >  static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > >  				struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > >  				unsigned long addr, unsigned long end,
> > > > > > > @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > >  			ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > > > > >  							tlb->fullmm);
> > > > > > >  			tlb_remove_tlb_entry(tlb, pte, addr);
> > > > > > > +			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > > > > +						      ptent);
> > > > > > >  			if (unlikely(!page))
> > > > > > >  				continue;
> > > > > > >  
> > > > > > > @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > >  			continue;
> > > > > > >  		}
> > > > > > >  
> > > > > > > +		/*
> > > > > > > +		 * If this is a special uffd-wp marker pte... Drop it only if
> > > > > > > +		 * enforced to do so.
> > > > > > > +		 */
> > > > > > > +		if (unlikely(is_swap_special_pte(ptent))) {
> > > > > > > +			WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
> > > > > > 
> > > > > > Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?
> > > > > > 
> > > > > > > +			/*
> > > > > > > +			 * If this is a common unmap of ptes, keep this as is.
> > > > > > > +			 * Drop it only if this is a whole-vma destruction.
> > > > > > > +			 */
> > > > > > > +			if (zap_drop_file_uffd_wp(details))
> > > > > > > +				ptep_get_and_clear_full(mm, addr, pte,
> > > > > > > +							tlb->fullmm);
> > > > > > > +			continue;
> > > > > > > +		}
> > > > > > > +
> > > > > > >  		entry = pte_to_swp_entry(ptent);
> > > > > > >  		if (is_device_private_entry(entry) ||
> > > > > > >  		    is_device_exclusive_entry(entry)) {
> > > > > > > @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > >  				page_remove_rmap(page, false);
> > > > > > >  
> > > > > > >  			put_page(page);
> > > > > > > +			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > > > > +						      ptent);
> > > > > > 
> > > > > > Device entries only support anonymous vmas at present so should we drop this?
> > > > > > I guess I'm also a little confused by this because I'm not sure in what
> > > > > > scenarios you would want to zap swap entries but leave special swap ptes behind
> > > > > > (see also my earlier question above as well).
> > > > > 
> > > > > If that's the case, maybe indeed this is not needed, and I can use a
> > > > > WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
> > > > > possible one day to have !anonymous support for device private entries?
> > > > > Frankly I have no solid idea on how device private is used, so some more
> > > > > context would be nice too; since I think you should know much better than me,
> > > > > so maybe it's a good chance to learn more about it. :)
> > > > 
> > > > Yes, a WARN_ON_ONCE() would be good if you remove it. We are planning to add
> > > > support for !anonymous device private entries at some point.
> > > > 
> > > > There's nothing too special about device private entries. They exist to store
> > > > some state and look up a device driver callback that gets called when the CPU
> > > > tries to access the page. For example see how do_swap_page() handles them:
> > > > 
> > > >                 } else if (is_device_private_entry(entry)) {
> > > >                         vmf->page = pfn_swap_entry_to_page(entry);
> > > >                         ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > > 
> > > > Normally a device driver provides the implementation of migrate_to_ram() which
> > > > will copy the page back to CPU addressable memory and restore the PTE to a
> > > > normal functioning PTE using the migrate_vma_*() interfaces. Typically this is
> > > > used to allow migration of a page to memory that is not directly CPU addressable
> > > > (eg. GPU memory). Hopefully that goes some way to explaining what they are, but
> > > > if you have more questions let me know!
> > > 
> > > Thanks for offering these details!  So one thing I'm still uncertain is what
> > > exact type of memory is allowed to be mapped to device private.  E.g., would
> > > "anonymous shared" allowed as "anonymous"?  I saw there seems to have one ioctl
> > > defined that's used to bind these things:
> > > 
> > > 	DRM_IOCTL_DEF_DRV(NOUVEAU_SVM_BIND, nouveau_svmm_bind, DRM_RENDER_ALLOW),
> > > 
> > > Then nouveau_dmem_migrate_chunk() will initiates the device private entries, am
> > > I right?  Then to ask my previous question in another form: if the vaddr range
> > > is coming from an userspace extention driver, would it be allowed to pass in
> > > some vaddr range mapped with MAP_ANONYMOUS|MAP_SHARED?
> > 
> > I should have been more specific - device private pages currently only support
> > non-file/shmem backed pages. In other words the migrate_vma_*() calls will fail
> > for MAP_ANONYMOUS | MAP_SHARED when the target page is a device private page.
> > 
> > For a present page this is enforced in migrate_vma_pages() when trying to
> > migrate to a device private page:
> > 
> >                 mapping = page_mapping(page);
> > 
> >                 if (is_zone_device_page(newpage)) {
> >                         if (is_device_private_page(newpage)) {
> >                                 /*
> >                                  * For now only support private anonymous when
> >                                  * migrating to un-addressable device memory.
> >                                  */
> >                                 if (mapping) {
> >                                         migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> >                                         continue;
> >                                 }
> 
> Ah fair enough. :)
> 
> When I looked again, I did also see that there's vma_is_anonymous() check right
> at the entry of migrate_vma_insert_page() too.
> 
> I'll convert this device private call to a WARN_ON_ONCE() then, with proper
> comments explaining why.
> 
> Thanks,
> 
> 




Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ