lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191121093941.GA18190@quack2.suse.cz>
Date:   Thu, 21 Nov 2019 10:39:41 +0100
From:   Jan Kara <jack@...e.cz>
To:     John Hubbard <jhubbard@...dia.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        Alex Williamson <alex.williamson@...hat.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        Christoph Hellwig <hch@...radead.org>,
        Dan Williams <dan.j.williams@...el.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Dave Chinner <david@...morbit.com>,
        David Airlie <airlied@...ux.ie>,
        "David S . Miller" <davem@...emloft.net>,
        Ira Weiny <ira.weiny@...el.com>, Jan Kara <jack@...e.cz>,
        Jason Gunthorpe <jgg@...pe.ca>, Jens Axboe <axboe@...nel.dk>,
        Jonathan Corbet <corbet@....net>,
        Jérôme Glisse <jglisse@...hat.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        Mauro Carvalho Chehab <mchehab@...nel.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Michal Hocko <mhocko@...e.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Paul Mackerras <paulus@...ba.org>,
        Shuah Khan <shuah@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>, bpf@...r.kernel.org,
        dri-devel@...ts.freedesktop.org, kvm@...r.kernel.org,
        linux-block@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kselftest@...r.kernel.org,
        linux-media@...r.kernel.org, linux-rdma@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org, netdev@...r.kernel.org,
        linux-mm@...ck.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v7 17/24] mm/gup: track FOLL_PIN pages

On Wed 20-11-19 23:13:47, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
>     https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
>     https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
>     https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara <jack@...e.cz>
> Suggested-by: Jérôme Glisse <jglisse@...hat.com>
> Signed-off-by: John Hubbard <jhubbard@...dia.com>

Thanks for the patch! We are mostly getting there. Some smaller comments
below.

> +/**
> + * try_pin_compound_head() - mark a compound page as being used by
> + * pin_user_pages*().
> + *
> + * This is the FOLL_PIN counterpart to try_get_compound_head().
> + *
> + * @page:	pointer to page to be marked
> + * @Return:	true for success, false for failure
> + */
> +__must_check bool try_pin_compound_head(struct page *page, int refs)
> +{
> +	page = try_get_compound_head(page, GUP_PIN_COUNTING_BIAS * refs);
> +	if (!page)
> +		return false;
> +
> +	__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
> +	return true;
> +}
> +
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +static bool __put_devmap_managed_user_page(struct page *page)

Probably call this __unpin_devmap_managed_user_page()? To match the later
conversion of put_user_page() to unpin_user_page()?

> +{
> +	bool is_devmap = page_is_devmap_managed(page);
> +
> +	if (is_devmap) {
> +		int count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
> +
> +		__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
> +		/*
> +		 * devmap page refcounts are 1-based, rather than 0-based: if
> +		 * refcount is 1, then the page is free and the refcount is
> +		 * stable because nobody holds a reference on the page.
> +		 */
> +		if (count == 1)
> +			free_devmap_managed_page(page);
> +		else if (!count)
> +			__put_page(page);
> +	}
> +
> +	return is_devmap;
> +}
> +#else
> +static bool __put_devmap_managed_user_page(struct page *page)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_DEV_PAGEMAP_OPS */
> +
> +/**
> + * put_user_page() - release a dma-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via pin_user_pages*() must be released via either
> + * put_user_page(), or one of the put_user_pages*() routines. This is so that
> + * such pages can be separately tracked and uniquely handled. In particular,
> + * interactions with RDMA and filesystems need special handling.
> + */
> +void put_user_page(struct page *page)
> +{
> +	page = compound_head(page);
> +
> +	/*
> +	 * For devmap managed pages we need to catch refcount transition from
> +	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
> +	 * page is free and we need to inform the device driver through
> +	 * callback. See include/linux/memremap.h and HMM for details.
> +	 */
> +	if (__put_devmap_managed_user_page(page))
> +		return;
> +
> +	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
> +		__put_page(page);
> +
> +	__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
> +}
> +EXPORT_SYMBOL(put_user_page);
> +
>  /**
>   * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
>   * @pages:  array of pages to be maybe marked dirty, and definitely released.
> @@ -237,10 +327,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	}
>  
>  	page = vm_normal_page(vma, address, pte);
> -	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
> +	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
>  		/*
> -		 * Only return device mapping pages in the FOLL_GET case since
> -		 * they are only valid while holding the pgmap reference.
> +		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
> +		 * case since they are only valid while holding the pgmap
> +		 * reference.
>  		 */
>  		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
>  		if (*pgmap)
> @@ -283,6 +374,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  			page = ERR_PTR(-ENOMEM);
>  			goto out;
>  		}
> +	} else if (flags & FOLL_PIN) {
> +		if (unlikely(!try_pin_page(page))) {
> +			page = ERR_PTR(-ENOMEM);
> +			goto out;
> +		}

Use grab_page() here?

>  	}
>  	if (flags & FOLL_TOUCH) {
>  		if ((flags & FOLL_WRITE) &&
> @@ -1890,9 +2000,15 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
>  
> -		head = try_get_compound_head(page, 1);
> -		if (!head)
> -			goto pte_unmap;
> +		if (flags & FOLL_PIN) {
> +			head = page;
> +			if (unlikely(!try_pin_page(head)))
> +				goto pte_unmap;
> +		} else {
> +			head = try_get_compound_head(page, 1);
> +			if (!head)
> +				goto pte_unmap;
> +		}

Why don't you use grab_page() here? Also you seem to loose the head =
compound_head(page) indirection here for the FOLL_PIN case?

>  
>  		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>  			put_page(head);
> @@ -1946,12 +2062,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  
>  		pgmap = get_dev_pagemap(pfn, pgmap);
>  		if (unlikely(!pgmap)) {
> -			undo_dev_pagemap(nr, nr_start, pages);
> +			undo_dev_pagemap(nr, nr_start, flags, pages);
>  			return 0;
>  		}
>  		SetPageReferenced(page);
>  		pages[*nr] = page;
> -		get_page(page);
> +
> +		if (flags & FOLL_PIN) {
> +			if (unlikely(!try_pin_page(page))) {
> +				undo_dev_pagemap(nr, nr_start, flags, pages);
> +				return 0;
> +			}
> +		} else
> +			get_page(page);
> +

Use grab_page() here?

>  		(*nr)++;
>  		pfn++;
>  	} while (addr += PAGE_SIZE, addr != end);
...
> @@ -2025,12 +2149,31 @@ static int __record_subpages(struct page *page, unsigned long addr,
>  	return nr;
>  }
>  
> -static void put_compound_head(struct page *page, int refs)
> +static bool grab_compound_head(struct page *head, int refs, unsigned int flags)
>  {
> +	if (flags & FOLL_PIN) {
> +		if (unlikely(!try_pin_compound_head(head, refs)))
> +			return false;
> +	} else {
> +		head = try_get_compound_head(head, refs);
> +		if (!head)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static void put_compound_head(struct page *page, int refs, unsigned int flags)
> +{
> +	struct page *head = compound_head(page);
> +
> +	if (flags & FOLL_PIN)
> +		refs *= GUP_PIN_COUNTING_BIAS;
> +
>  	/* Do a get_page() first, in case refs == page->_refcount */
> -	get_page(page);
> -	page_ref_sub(page, refs);
> -	put_page(page);
> +	get_page(head);
> +	page_ref_sub(head, refs);
> +	put_page(head);
>  }
>  
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
> @@ -2064,14 +2207,13 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  
>  	head = pte_page(pte);
>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> -	refs = __record_subpages(page, addr, end, pages + *nr);
> +	refs = record_subpages(page, addr, end, pages + *nr);
>  
> -	head = try_get_compound_head(head, refs);
> -	if (!head)
> +	if (!grab_compound_head(head, refs, flags))

Are you sure this is correct? Historically we seem to have always had logic
like:

	head = compound_head(pte_page / pmd_page / ... (orig))

in this code. And you removed this now. Looking at the code I'm not sure
whether the compound_head() indirection is really needed or not. We seem to
have already huge page head in the page table but maybe there's some subtle
case I'm missing. So I'd be calmer if we left the head=compound_head(...)
in the code but if you really want to remove it, I'd like to see Ack from
someone actually familiar with huge pages - e.g. Kirill Shutemov...

And even if we find out that compound_head() indirection isn't really
needed, that is big enough change in the logic that it would deserve to be
done in a separate patch (if only for debugging by bisection purposes).

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 13cc93785006..981a9ea0b83f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
...
> @@ -5034,8 +5052,20 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	pte = huge_ptep_get((pte_t *)pmd);
>  	if (pte_present(pte)) {
>  		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +
>  		if (flags & FOLL_GET)
>  			get_page(page);
> +		else if (flags & FOLL_PIN) {
> +			/*
> +			 * try_pin_page() is not actually expected to fail
> +			 * here because we hold the ptl.
> +			 */
> +			if (unlikely(!try_pin_page(page))) {
> +				WARN_ON_ONCE(1);
> +				page = NULL;
> +				goto out;
> +			}
> +		}

Use grab_page() here?

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ