[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wgi31FKc9AL6m87+pb2B79V2g_QjdhmtJNW8Pnq2ERQ-Q@mail.gmail.com>
Date: Sun, 10 Jan 2021 11:30:57 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Andrea Arcangeli <aarcange@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Linux-MM <linux-mm@...ck.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
Peter Xu <peterx@...hat.com>,
Pavel Emelyanov <xemul@...nvz.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
Minchan Kim <minchan@...nel.org>,
Will Deacon <will@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Hugh Dickins <hughd@...gle.com>,
"Kirill A. Shutemov" <kirill@...temov.name>,
Matthew Wilcox <willy@...radead.org>,
Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
Kees Cook <keescook@...omium.org>,
John Hubbard <jhubbard@...dia.com>,
Leon Romanovsky <leonro@...dia.com>,
Jason Gunthorpe <jgg@...pe.ca>, Jan Kara <jack@...e.cz>,
Kirill Tkhai <ktkhai@...tuozzo.com>,
Nadav Amit <nadav.amit@...il.com>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> COW is about "I'm about to write to this page, and that means I need
> an _exclusive_ page so that I don't write to a page that somebody else
> is using".
So this kind of fundamentally explains why I hate the games we used to
play wrt page_mapcount(): they were fundamentally fragile. I _much_
prefer just having the rule that we use page_count(), which the above
simple and straightforward single sentence explains 100%.
This gets back to the fact that especially considering how we've had
subtle bugs here (the "wrong-way COW" issue has existed since
literally the first GUP ever, so it goes back decades), I want the
core VM rules to be things that can be explained basically from simple
"first principles".
And the reason I argue for the current direction that I'm pushing, is
exactly that the above is a very simple "first principle" for why COW
exists.
If the rule for COW is simply "I will always COW if it's not clear
that I'm the exclusive user", then COW itself is very simple to think
about.
The other rule I want to stress is that COW is common, and that argues
against the model we used to have of "let's lock the page to make sure
that everything else is stable". That model was garbage anyway, since
page locking doesn't even guarantee any stability wrt exclusive use in
the first place (ie GUP being another example), but it's why I truly
detested the old model that depended so much on the page lock to
serialize things.
So if you start off with the rule that "I will always COW unless I can
trivially see I'm the only owner", then I think we have really made
for a really clear and unambiguous rule.
And remember: COW is only an issue for private mappings. So pretty
much BY DEFINITION, doing a COW is always safe for all normal
circumstances.
Now, this is where it does get subtle: that "all normal circumstances"
part. The one special case is a cache-coherent GUP. It's arguable
whether "pinned" should matter or not, and it would obviously be
better if "pinned" simply didn't matter at all (and the only issue
with any long-term pinning would simply be about resource counting).
The current approach I'm advocating is "coherency means that it must
have been writable", and then the way to solve the whole "Oh, it's
shared with something else" is to simply never accept making it
read-only, because BY DEFINITION it's not _really_ read-only (because
we know we've created that other alias of the virtual address that is
*not* controlled by the page table protection bits).
Notice how this is all both conceptually fairly simple (ie I can
explain the rules in plain English without really making any complex
argument) and it is arguably internally fairly self-consistent (ie the
whole notion of "oh, there's another thing that has write access that
page but doesn't go through the page table, so trying to make it
read-only in the page tables is a nonsensical operation").
Are the end results wrt something like soft-dirty a bit odd? Not
really. If you do soft-dirty, such a GUP-shared page would simply
always show up as dirty. That's still consistent with the rules. If
somebody else may be writing to it because of GUP, that page really
*isn't* clean, and us marking it read-only would be just lying about
things.
I'm admittedly not very happy about mprotect() itself, though. It's
actually ok to do the mprotect(PROT_READ) and turn the page read-only:
that will also disable COW itself (because a page fault will now be a
SIGSEGV, not a COW).
But if you then make it writable again with mprotect(PROT_WRITE), you
*have* lost the WP bit, and you'll COW on a write, and lose the
coherency.
Now, I'm willing to just say: "if you do page pinning, and then do
mprotect(PROT_READ), and then do mprotect(PROT_WRITE) and then write
to the page, you really do get to keep both broken pieces". IOW, I'm
perfectly happy to just say you get what you deserve.
But I'd also be perfectly happy to make the whole "I'm the exclusive
user" logic a bit more extensive. Right now it's basically _purely_
page_count(), and the other part of "I'm the exclusive owner" is that
the RW bit in the page table is simply not clear. That makes things
really easy for COW: it just won't happen in the first place if you
broke the "obviously exclusive" rule with GUP.
But we _could_ do something slightly smarter. But "page_mapcount()" is
not that "slightly smarter" thing, because we already know it's broken
wrt lots of other uses (GUP, page cache, whatever).
Just having a bit in the page flags for "I already made this
exclusive, and fork() will keep it so" is I feel the best option. In a
way, "page is writable" right now _is_ that bit. By definition, if you
have a writable page in an anonymous mapping, that is an exclusive
user.
But because "writable" has these interactions with other operations,
it would be better if it was a harder bit than that "maybe_pinned()",
though. It would be lovely if a regular non-pinning write-GUP just
always set it, for example.
"maybe_pinned()" is good enough for the fork() case, which is the one
that matters for long-term pinning. But it's admittedly not perfect.
Linus
Powered by blists - more mailing lists