linux-kernel - Re: [PATCH 1/2] mm: introduce put_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181219110856.GA18345@quack2.suse.cz>
Date:   Wed, 19 Dec 2018 12:08:56 +0100
From:   Jan Kara <jack@...e.cz>
To:     Jerome Glisse <jglisse@...hat.com>
Cc:     John Hubbard <jhubbard@...dia.com>, Jan Kara <jack@...e.cz>,
        Matthew Wilcox <willy@...radead.org>,
        Dave Chinner <david@...morbit.com>,
        Dan Williams <dan.j.williams@...el.com>,
        John Hubbard <john.hubbard@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux MM <linux-mm@...ck.org>, tom@...pey.com,
        Al Viro <viro@...iv.linux.org.uk>, benve@...co.com,
        Christoph Hellwig <hch@...radead.org>,
        Christopher Lameter <cl@...ux.com>,
        "Dalessandro, Dennis" <dennis.dalessandro@...el.com>,
        Doug Ledford <dledford@...hat.com>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Michal Hocko <mhocko@...nel.org>, mike.marciniszyn@...el.com,
        rcampbell@...dia.com,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > solution for that.  
> > 
> > So as I understand it, this would use page->_mapcount to store both the real
> > mapcount, and the dma pinned count (simply added together), but only do so for
> > file-backed (non-anonymous) pages:
> > 
> > 
> > __get_user_pages()
> > {
> > 	...
> > 	get_page(page);
> > 
> > 	if (!PageAnon)
> > 		atomic_inc(page->_mapcount);
> > 	...
> > }
> > 
> > put_user_page(struct page *page)
> > {
> > 	...
> > 	if (!PageAnon)
> > 		atomic_dec(&page->_mapcount);
> > 
> > 	put_page(page);
> > 	...
> > }
> > 
> > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > had in mind?
> 
> Mostly, with the extra two observations:
>     [1] We only need to know the pin count when a write back kicks in
>     [2] We need to protect GUP code with wait_for_write_back() in case
>         GUP is racing with a write back that might not the see the
>         elevated mapcount in time.
> 
> So for [2]
> 
> __get_user_pages()
> {
>     get_page(page);
> 
>     if (!PageAnon) {
>         atomic_inc(page->_mapcount);
> +       if (PageWriteback(page)) {
> +           // Assume we are racing and curent write back will not see
> +           // the elevated mapcount so wait for current write back and
> +           // force page fault
> +           wait_on_page_writeback(page);
> +           // force slow path that will fault again
> +       }
>     }
> }

This is not needed AFAICT. __get_user_pages() gets page reference (and it
should also increment page->_mapcount) under PTE lock. So at that point we
are sure we have writeable PTE nobody can change. So page_mkclean() has to
block on PTE lock to make PTE read-only and only after going through all
PTEs like this, it can check page->_mapcount. So the PTE lock provides
enough synchronization.

> For [1] only needing pin count during write back turns page_mkclean into
> the perfect spot to check for that so:
> 
> int page_mkclean(struct page *page)
> {
>     int cleaned = 0;
> +   int real_mapcount = 0;
>     struct address_space *mapping;
>     struct rmap_walk_control rwc = {
>         .arg = (void *)&cleaned,
>         .rmap_one = page_mkclean_one,
>         .invalid_vma = invalid_mkclean_vma,
> +       .mapcount = &real_mapcount,
>     };
> 
>     BUG_ON(!PageLocked(page));
> 
>     if (!page_mapped(page))
>         return 0;
> 
>     mapping = page_mapping(page);
>     if (!mapping)
>         return 0;
> 
>     // rmap_walk need to change to count mapping and return value
>     // in .mapcount easy one
>     rmap_walk(page, &rwc);
> 
>     // Big fat comment to explain what is going on
> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> +       SetPageDMAPined(page);
> +   } else {
> +       ClearPageDMAPined(page);
> +   }

This is the detail I'm not sure about: Why cannot rmap_walk_file() race
with e.g. zap_pte_range() which decrements page->_mapcount and thus the
check we do in page_mkclean() is wrong?

> 
>     // Maybe we want to leverage the int nature of return value so that
>     // we can express more than cleaned/truncated and express cleaned/
>     // truncated/pinned for benefit of caller and that way we do not
>     // even need one bit as page flags above.
> 
>     return cleaned;
> }
> 
> You do not want to change page_mapped() i do not see a need for that.
> 
> Then the whole discussion between Jan and Dave seems to indicate that
> the bounce mechanism will need to be in the fs layer and that we can
> not reuse the bio bounce mechanism. This means that more work is needed
> at the fs level for that (so that fs do not freak on bounce page).
> 
> Note that they are few gotcha where we need to preserve the pin count
> ie mostly in truncate code path that can remove page from page cache
> and overwrite the mapcount in the process, this would need to be fixed
> to not overwrite mapcount so that put_user_page does not set the map
> count to an invalid value turning the page into a bad state that will
> at one point trigger kernel BUG_ON();
>
> I am not saying block truncate, i am saying make sure it does not
> erase pin count and keep truncating happily. The how to handle truncate
> is a per existing GUP user discussion to see what they want to do for
> that.
> 
> Obviously a bit deeper analysis of all spot that use mapcount is needed
> to check that we are not breaking anything but from the top of my head
> i can not think of anything bad (migrate will abort and other things will
> assume the page is mapped even it is only in hardware page table, ...).

Hum, grepping for page_mapped() and page_mapcount(), this is actually going
to be non-trivial to get right AFAICT.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR