[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211221010312.GC1432915@nvidia.com>
Date: Mon, 20 Dec 2021 21:03:12 -0400
From: Jason Gunthorpe <jgg@...dia.com>
To: David Hildenbrand <david@...hat.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Nadav Amit <namit@...are.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Shakeel Butt <shakeelb@...gle.com>,
John Hubbard <jhubbard@...dia.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.ibm.com>,
Yang Shi <shy828301@...il.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
Michal Hocko <mhocko@...nel.org>,
Rik van Riel <riel@...riel.com>,
Roman Gushchin <guro@...com>,
Andrea Arcangeli <aarcange@...hat.com>,
Peter Xu <peterx@...hat.com>,
Donald Dutile <ddutile@...hat.com>,
Christoph Hellwig <hch@....de>,
Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
Linux-MM <linux-mm@...ck.org>,
"open list:KERNEL SELFTEST FRAMEWORK"
<linux-kselftest@...r.kernel.org>,
"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via
FAULT_FLAG_UNSHARE (!hugetlb)
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
> handler (COW or unshare). Outside of COW/Unshare code, the bit can only
> be set if page_count() == 1 and we sync against fork(). (and that's the
> problem for gup-fast-only that I'm investigating right now, because it
> would then always have to fallback to the slow variant if the bit isn't
> already set)
I'm having a hard time imagining how gup_fast can maintain any sort of
bit - it lacks all forms of locks so how can we do an atomic test and
set between two pieces of data?
I think the point of Linus's plan really is that the new bit is
derived from page_count, we get the set the new bit when we observe
page_count==1 in various situations and we clear the new bit whenever
we write protect with the intent to copy.
GUP doesn't interact with this bit. A writable page would still be the
second way you can say "you clearly have full ownership of the page',
so GUP just checks writability and bumps the refcount. The challenge
of GUP reamins to sanely sequence it with things that are doing WP.
The elevated GUP refcount prevents the page from getting the bit set
again, regardless of what happens to it.
Then on the WP sides.. Obviously we clear the bit when applying a WP
for copy. So all the bad GUP cases are halted now, as with a cleared
bit and a != 1 refcount COW must happen.
Then, it is also the case that most often a read-only page will have
this bit cleared, UFFWD WP being the exception.
UFFWD WP works fine as it will have the bit set in the cases we care
about and COW will not happen.
If the bit is not set then everything works as is today and you get
extra COWs. We still have to fix the things that are missing the extra
COWs to check the page ref to fix this.
It seems this new bit is acting as a 'COW disable', so, the accuracy
of COW vs GUP&speculative pagerefs now relies on setting the bit as
aggressively as possible when it is safe and cheap to do so.
If I got it right this is why it is not just mapcount reduced to 1
bit. It is quite different, even though "this VM owns the page
outright" sure sounds like "mapcount == 1"..
It seems like an interesting direction - the security properties seem
good as we only have to take care of sites applying WP to decide what
to do with the extra bit, and all places that set the bit to 1 do so
after testing refcount under various locks preventing PTE WP.
That just leave the THP splitting.. I suppose we get the PTL, then
compute the current value of the new bit based on refcount and diffuse
it to all tail pages, then update the PMD and release the PTL. Safe
against concurrent WP - don't need DoubleMap horrors because it isn't
a counter.
> Not set it stone, just an idea what I'm playing with right now ... and I
> have to tripple-check if
> * page is PTE mapped in the page table I'm walking
> * page_count() == 1
> Really means that "this is the only reference.". I do strongly believe
> so .. :)
AFAIK the only places that can break this are places putting struct
page memory into special PTEs. Which is horrific and is just bugs, but
I think I've seen it from time to time :(
ZONE_DEVICE is also messed up, of course, but that is just more
reasons ZONE_DEVICE refcounting needs fixing and you should ignore it.
Jason
Powered by blists - more mailing lists