[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54c492d7-ddcd-dcd0-7209-efb2847adf7c@redhat.com>
Date: Fri, 17 Dec 2021 21:17:58 +0100
From: David Hildenbrand <david@...hat.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Shakeel Butt <shakeelb@...gle.com>,
John Hubbard <jhubbard@...dia.com>,
Jason Gunthorpe <jgg@...dia.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.ibm.com>,
Yang Shi <shy828301@...il.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
Michal Hocko <mhocko@...nel.org>,
Nadav Amit <namit@...are.com>, Rik van Riel <riel@...riel.com>,
Roman Gushchin <guro@...com>,
Andrea Arcangeli <aarcange@...hat.com>,
Peter Xu <peterx@...hat.com>,
Donald Dutile <ddutile@...hat.com>,
Christoph Hellwig <hch@....de>,
Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
Linux-MM <linux-mm@...ck.org>,
"open list:KERNEL SELFTEST FRAMEWORK"
<linux-kselftest@...r.kernel.org>,
"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via
FAULT_FLAG_UNSHARE (!hugetlb)
On 17.12.21 20:22, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
>>
>> If we are doing a COW, we need an *exclusive* access to the page. That
>> is not mapcount, that is the page ref.
>>
>> mapcount is insane, and I think this is making this worse again.
>
> Maybe I'm misreading this, but afaik
>
> - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
>
> This just increments the page count, because mapcount == 1.
>
> - fork()
>
> - unmap in the original
>
> - child now has "mapcount == 1" on a page again, but refcount is
> elevated, and child HAS TO COW before writing.
Hi Linus,
This is just GUP before fork(), which is in general
problematic/incompatible with sharing. What we're concerned about in the
context of this series (see the security issue) is GUP after fork(). And
we're not changing GUP before fork() or even the COW logic in the
context of this series.
I agree that GUP before fork() has to be handled differently: during
fork(): don't share something that cannot possibly be shared in a safe
way. Don't allow COW semantics for something that is just broken with COW.
>
> Notice? "mapcount" is complete BS. The number of times a page is
> mapped is irrelevant for COW. All that matters is that we get an
> exclusive access to the page before we can write to it.
We have to be very careful about the two sides of the story: GUP before
fork and GUP after fork.
>
> Anybody who takes mapcount into account at COW time is broken, and it
> worries me how this is all mixing up with the COW logic.
>
> Now, maybe this "unshare" case is sufficiently different from COW that
> it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it
> doesn't happen for a real COW.
>
> But honestly, for "unshare", I still don't see that the mapcount
> matters. What does "mapcount == 1" mean? Why is it meaningful?
I'll reply to your first mail in a sec.
GUP is the problem with COW, not ordinary processes mapping a page
(mapcount), where you will only get new sharers during fork() -- in a
very controlled way. So GUP has to take care to unshare *before* taking
a reference, such that we can never reach the point of missed COW. GUP
really is the problematic bit with it all.
Without GUP, we'd be living in a wonderful world in regards to COW.
>
> Because if COW does things right, and always breaks a COW based on
> refcount, then what's the problem with taking a read-only ref to the
> page whether it is mapped multiple times or mapped just once? Anybody
> who already had write access to the page can write to it regardless,
> and any new writers go through COW and get a new page.
Let's just take a look at what refcount does *wrong*. Let's use an
adjusted version of your example above, because it's a perfect fit:
1. mem = mmap(pagesize, MAP_PRIVATE)
-> refcount == 1
2. memset(mem, 0, pagesize); /* Page is mapped R/W */
3. fork() /* Page gets mapped R/O */
-> refcount > 1
4. child quits
-> refcount == 1
5. Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1
6. memset(mem, 0xff, pagesize);
-> Write fault -> COW
And GUP sees something different than our MM -- and this is perfectly
valid, the R/O pin is just reading page content we might be modifying
afterwrds. Take out 3. and 4. and it works as expected. This wouldn't
happen when relying on the mapcount.
And 3+4 can really be anything that results in a R/O mapping of an
anonymous page, even if it's just swapout followed by read fault that
maps the page R/O.
>
> I must be missing something realyl fundamental here, but to me it
> really reads like "mapcount can fundamentally never be relevant for
> COW, and if it's not relevant for COW, how can it be relevant for a
> read-only copy?"
It really is the right value to use. Only GUP is the problematic bit
that has to trigger unsharing to not mess up COW logic later. Take GUP
out of the equation and COW just works as expected with the mapcount --
as long as we can read an atomic value and synchronize against fork.
(again, still composing the other mail :) )
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists