[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4iDdOGh6wCug9sZsrPdby1Sv1jG5aRUA5PjL0dDW7eNNA@mail.gmail.com>
Date: Thu, 20 Dec 2018 08:57:22 -0800
From: Dan Williams <dan.j.williams@...el.com>
To: Jerome Glisse <jglisse@...hat.com>
Cc: John Hubbard <jhubbard@...dia.com>, Jan Kara <jack@...e.cz>,
Matthew Wilcox <willy@...radead.org>,
Dave Chinner <david@...morbit.com>,
John Hubbard <john.hubbard@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux MM <linux-mm@...ck.org>, tom@...pey.com,
Al Viro <viro@...iv.linux.org.uk>, benve@...co.com,
Christoph Hellwig <hch@...radead.org>,
Christopher Lameter <cl@...ux.com>,
"Dalessandro, Dennis" <dennis.dalessandro@...el.com>,
Doug Ledford <dledford@...hat.com>,
Jason Gunthorpe <jgg@...pe.ca>,
Michal Hocko <mhocko@...nel.org>,
Mike Marciniszyn <mike.marciniszyn@...el.com>,
rcampbell@...dia.com,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
On Thu, Dec 20, 2018 at 8:50 AM Jerome Glisse <jglisse@...hat.com> wrote:
>
> On Thu, Dec 20, 2018 at 02:54:49AM -0800, John Hubbard wrote:
> > On 12/19/18 3:08 AM, Jan Kara wrote:
> > > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > >> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > >>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > >>> *only* the tracking pinned pages aspect), given that it is the lightest weight
> > >>> solution for that.
> > >>>
> > >>> So as I understand it, this would use page->_mapcount to store both the real
> > >>> mapcount, and the dma pinned count (simply added together), but only do so for
> > >>> file-backed (non-anonymous) pages:
> > >>>
> > >>>
> > >>> __get_user_pages()
> > >>> {
> > >>> ...
> > >>> get_page(page);
> > >>>
> > >>> if (!PageAnon)
> > >>> atomic_inc(page->_mapcount);
> > >>> ...
> > >>> }
> > >>>
> > >>> put_user_page(struct page *page)
> > >>> {
> > >>> ...
> > >>> if (!PageAnon)
> > >>> atomic_dec(&page->_mapcount);
> > >>>
> > >>> put_page(page);
> > >>> ...
> > >>> }
> > >>>
> > >>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > >>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you
> > >>> had in mind?
> > >>
> > >> Mostly, with the extra two observations:
> > >> [1] We only need to know the pin count when a write back kicks in
> > >> [2] We need to protect GUP code with wait_for_write_back() in case
> > >> GUP is racing with a write back that might not the see the
> > >> elevated mapcount in time.
> > >>
> > >> So for [2]
> > >>
> > >> __get_user_pages()
> > >> {
> > >> get_page(page);
> > >>
> > >> if (!PageAnon) {
> > >> atomic_inc(page->_mapcount);
> > >> + if (PageWriteback(page)) {
> > >> + // Assume we are racing and curent write back will not see
> > >> + // the elevated mapcount so wait for current write back and
> > >> + // force page fault
> > >> + wait_on_page_writeback(page);
> > >> + // force slow path that will fault again
> > >> + }
> > >> }
> > >> }
> > >
> > > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > > should also increment page->_mapcount) under PTE lock. So at that point we
> > > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > > block on PTE lock to make PTE read-only and only after going through all
> > > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > > enough synchronization.
> > >
> > >> For [1] only needing pin count during write back turns page_mkclean into
> > >> the perfect spot to check for that so:
> > >>
> > >> int page_mkclean(struct page *page)
> > >> {
> > >> int cleaned = 0;
> > >> + int real_mapcount = 0;
> > >> struct address_space *mapping;
> > >> struct rmap_walk_control rwc = {
> > >> .arg = (void *)&cleaned,
> > >> .rmap_one = page_mkclean_one,
> > >> .invalid_vma = invalid_mkclean_vma,
> > >> + .mapcount = &real_mapcount,
> > >> };
> > >>
> > >> BUG_ON(!PageLocked(page));
> > >>
> > >> if (!page_mapped(page))
> > >> return 0;
> > >>
> > >> mapping = page_mapping(page);
> > >> if (!mapping)
> > >> return 0;
> > >>
> > >> // rmap_walk need to change to count mapping and return value
> > >> // in .mapcount easy one
> > >> rmap_walk(page, &rwc);
> > >>
> > >> // Big fat comment to explain what is going on
> > >> + if ((page_mapcount(page) - real_mapcount) > 0) {
> > >> + SetPageDMAPined(page);
> > >> + } else {
> > >> + ClearPageDMAPined(page);
> > >> + }
> > >
> > > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > > check we do in page_mkclean() is wrong?
> >
> > Right. This looks like a dead end, after all. We can't lock a whole chunk
> > of "all these are mapped, hold still while we count you" pages. It's not
> > designed to allow that at all.
> >
> > IMHO, we are now back to something like dynamic_page, which provides an
> > independent dma pinned count.
>
> I will keep looking because allocating a structure for every GUP is
> insane to me they are user out there that are GUPin GigaBytes of data
This is not the common case.
> and it gonna waste tons of memory just to fix crappy hardware.
This is the common case.
Please refrain from the hyperbolic assessments.
Powered by blists - more mailing lists