linux-kernel - Re: [PATCH v5 17/24] mm/gup: track FOLL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8424f891-271d-5c34-8f7c-ebf3e3aa6664@nvidia.com>
Date:   Mon, 18 Nov 2019 16:22:43 -0800
From:   John Hubbard <jhubbard@...dia.com>
To:     Jan Kara <jack@...e.cz>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        Alex Williamson <alex.williamson@...hat.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        Christoph Hellwig <hch@...radead.org>,
        Dan Williams <dan.j.williams@...el.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Dave Chinner <david@...morbit.com>,
        David Airlie <airlied@...ux.ie>,
        "David S . Miller" <davem@...emloft.net>,
        Ira Weiny <ira.weiny@...el.com>,
        Jason Gunthorpe <jgg@...pe.ca>, Jens Axboe <axboe@...nel.dk>,
        Jonathan Corbet <corbet@....net>,
        Jérôme Glisse <jglisse@...hat.com>,
        Magnus Karlsson <magnus.karlsson@...el.com>,
        Mauro Carvalho Chehab <mchehab@...nel.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Michal Hocko <mhocko@...e.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Paul Mackerras <paulus@...ba.org>,
        Shuah Khan <shuah@...nel.org>,
        Vlastimil Babka <vbabka@...e.cz>, <bpf@...r.kernel.org>,
        <dri-devel@...ts.freedesktop.org>, <kvm@...r.kernel.org>,
        <linux-block@...r.kernel.org>, <linux-doc@...r.kernel.org>,
        <linux-fsdevel@...r.kernel.org>, <linux-kselftest@...r.kernel.org>,
        <linux-media@...r.kernel.org>, <linux-rdma@...r.kernel.org>,
        <linuxppc-dev@...ts.ozlabs.org>, <netdev@...r.kernel.org>,
        <linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages

On 11/18/19 3:58 AM, Jan Kara wrote:
> On Thu 14-11-19 21:53:33, John Hubbard wrote:
>> Add tracking of pages that were pinned via FOLL_PIN.
>>
>> As mentioned in the FOLL_PIN documentation, callers who effectively set
>> FOLL_PIN are required to ultimately free such pages via put_user_page().
>> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
>> for DIO and/or RDMA use".
>>
>> Pages that have been pinned via FOLL_PIN are identifiable via a
>> new function call:
>>
>>    bool page_dma_pinned(struct page *page);
>>
>> What to do in response to encountering such a page, is left to later
>> patchsets. There is discussion about this in [1].
> 						^^ missing this reference
> in the changelog...

I'll add that. 

> 
>> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
>>
>> Suggested-by: Jan Kara <jack@...e.cz>
>> Suggested-by: Jérôme Glisse <jglisse@...hat.com>
>> Signed-off-by: John Hubbard <jhubbard@...dia.com>
>> ---
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 6588d2e02628..db872766480f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page *page)
>>  	return true;
>>  }
>>  
>> +__must_check bool user_page_ref_inc(struct page *page);
>> +
>>  static inline void put_page(struct page *page)
>>  {
>>  	page = compound_head(page);
>> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page)
>>  		__put_page(page);
>>  }
>>  
>> -/**
>> - * put_user_page() - release a gup-pinned page
>> - * @page:            pointer to page to be released
>> +/*
>> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
>> + * the page's refcount so that two separate items are tracked: the original page
>> + * reference count, and also a new count of how many get_user_pages() calls were
> 							^^ pin_user_pages()
> 
>> + * made against the page. ("gup-pinned" is another term for the latter).
>> + *
>> + * With this scheme, get_user_pages() becomes special: such pages are marked
> 			^^^ pin_user_pages()
> 
>> + * as distinct from normal pages. As such, the put_user_page() call (and its
>> + * variants) must be used in order to release gup-pinned pages.
>> + *
>> + * Choice of value:
>>   *
>> - * Pages that were pinned via pin_user_pages*() must be released via either
>> - * put_user_page(), or one of the put_user_pages*() routines. This is so that
>> - * eventually such pages can be separately tracked and uniquely handled. In
>> - * particular, interactions with RDMA and filesystems need special handling.
>> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
>> + * counts with respect to get_user_pages() and put_user_page() becomes simpler,
> 				^^^ pin_user_pages()
> 

Yes.

>> + * due to the fact that adding an even power of two to the page refcount has
>> + * the effect of using only the upper N bits, for the code that counts up using
>> + * the bias value. This means that the lower bits are left for the exclusive
>> + * use of the original code that increments and decrements by one (or at least,
>> + * by much smaller values than the bias value).
>>   *
>> - * put_user_page() and put_page() are not interchangeable, despite this early
>> - * implementation that makes them look the same. put_user_page() calls must
>> - * be perfectly matched up with pin*() calls.
>> + * Of course, once the lower bits overflow into the upper bits (and this is
>> + * OK, because subtraction recovers the original values), then visual inspection
>> + * no longer suffices to directly view the separate counts. However, for normal
>> + * applications that don't have huge page reference counts, this won't be an
>> + * issue.
>> + *
>> + * Locking: the lockless algorithm described in page_cache_get_speculative()
>> + * and page_cache_gup_pin_speculative() provides safe operation for
>> + * get_user_pages and page_mkclean and other calls that race to set up page
>> + * table entries.
>>   */
> ...
>> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
>>  	refs = __record_subpages(page, addr, end, pages + *nr);
>>  
>> -	head = try_get_compound_head(head, refs);
>> -	if (!head)
>> -		return 0;
>> +	if (flags & FOLL_PIN) {
>> +		head = page;
>> +		if (unlikely(!user_page_ref_inc(head)))
>> +			return 0;
>> +		head = page;
> 
> Why do you assign 'head' twice? Also the refcounting logic is repeated
> several times so perhaps you can factor it out in to a helper function or
> even move it to __record_subpages()?

OK.

> 
>> +	} else {
>> +		head = try_get_compound_head(head, refs);
>> +		if (!head)
>> +			return 0;
>> +	}
>>  
>>  	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>  		put_compound_head(head, refs);
> 
> So this will do the wrong thing for FOLL_PIN. We took just one "pin"
> reference there but here we'll release 'refs' normal references AFAICT.
> Also the fact that you take just one pin reference for each huge page
> substantially changes how GUP refcounting works in the huge page case.
> Currently, FOLL_GET users can be completely agnostic of huge pages. So you
> can e.g. GUP whole 2 MB page, submit it as 2 different bios and then
> drop page references from each bio completion function. With your new
> FOLL_PIN behavior you cannot do that and I believe it will be a problem for
> some users. So I think you have to maintain the behavior that you increase
> the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here.
> 

Yes, completely agreed, this was a (big) oversight. I went through the same
reasoning and reached your conclusions, in __gup_device_huge(), but then
did it wrong in these functions. Will fix.

thanks,
-- 
John Hubbard
NVIDIA