linux-kernel - Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wg95CiyT45ZOxtnWQ7cdKmejXcOydEyJcTTNnp5-nd+xg@mail.gmail.com>
Date:   Sat, 18 Dec 2021 11:21:34 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     David Hildenbrand <david@...hat.com>
Cc:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        John Hubbard <jhubbard@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Yang Shi <shy828301@...il.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Nadav Amit <namit@...are.com>, Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        Donald Dutile <ddutile@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
        Linux-MM <linux-mm@...ck.org>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>,
        "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via
 FAULT_FLAG_UNSHARE (!hugetlb)

[ Cutting down ruthlessly to the core of the issue ]

On Sat, Dec 18, 2021 at 1:58 AM David Hildenbrand <david@...hat.com> wrote:
>
> 1) Missed COW
>
> 2) Unnecessary COW
>
> 3) Wrong COW

> Does that make sense? If we agree on the above, then here is how the
> currently discussed approaches differ:
>
> page_count != 1:
> * 1) cannot happen
> * 2) can happen easily (speculative references due to pagecache,
>      migration, daemon, pagevec, ...)
> * 3) can happen in the current code

I claim that (1) "cannot happen" is a huge mother of a deal. It's
*LITERALLY* the bug you are chasing, and it's the security issue, so
on a bug scale, it's about the worst there is.

I further then claim that (2) "happen easily" is you just making
things up. Yes, it can happen. But no, it's not actually that common,
and since (2) is harmless from a correctness standpoint, it is purely
about performance.

And as mentioned, not using the mapcount actually makes *common*
operations much simpler and faster. You don't need the page lock to
serialize the mapcount.

So (2) is a performance argument, and you haven't actually shown it to
be a problem.

Which really only leaves (3). Which I've already explained what the
fix is: don't ever mark pages that shouldn't be COW'ed as being COW
pages.

(3) is really that simple, although it ended up depending on Jason and
John Hubbard and others doing that FOLL_PIN logic to distinguish "I
just want to see a random page, and I don't care about COW" from "I
want to get a page, and that page needs to be coherent with this VM
and not be COW'ed away"

So I'm not claiming (3) is "trivial", but at the same time it's
certainly not some fundamentally complicated thing, and it's easy to
explain what is going on.

> mapcount > 1:
> * 1) your concern is that this can happen due to concurrent swapin
> * 2) cannot happen.
> * 3) your concern is that this can happen due to concurrent swapin

No, my concern about (1) is that IT IS WRONG.

"mapcount" means nothing for COW. I even gave you an example of
exactly where it means nothing. It's crazy. It's illogical. And it's
complicated as hell.

The fact that only one user maps a page is simply not meaningful. That
page can have other users that you don't know anything about, and that
don't show up in the mapcount.

That page can be swapcached, in which case mapcount can change
radically in ways that you earlier indicated cannot happen. You were
wrong.

But even if you fix that - by taking the page lock in every single
place - there are still *other* users that for all you know may want
the old contents. You don't know.

The only thing that says "no other users" is the page count. Not the mapcount.

In other words, I claim that

 (a) mapcount is fundamentally the wrong thing to test. You can be the
only mapper, without being the "owner" of the page.

 (b) it's *LITERALLY* the direct and present source of that bug in the
testcase you added, where a page with a mapcount of 1 has other
concurrent users and needs to be COW'ed but isn't.

 (c) it's complicated and expensive to calculate (where one big part
of the expense is the page lock synchronization requirements, but
there are others)

And this all happens for that "case (1)", which is the worst adn
scariest of them all.

In contrast to that, your argument that "(2) cannot happen" is a total
non-argument. (2) isn't the problem.

And I claim that (3) can happen because you're testing the wrong
counter, so who knows if the COW is wrong or not?

> I am completely missing how 2) or 3) could *ever* be handled properly
> for page_count != 1. 3) is obviously more important and gives me nightmares.

Ok, so if I tell you how (2) and (3) are handled properly, you will
just admit you were wrong?

Here's how they are handled properly with page counts. I have told you
this before, but I'll summarize:

 (2) is handled semantically properly by definition - it may be
"unnecessary", but it has no semantic meaning

This is an IMPORTANT thing to realize. The fact is, (2) is not in the
same class as (1) or (3).

And honestly - we've been doing this for all the common cases already
since at least 5.9, and your performance argument simply has not
really reared its head.  Which makes the whole argument moot. I claim
that it simplifies lots of common operations and avoids having to
serialize on a lock that has been a real and major problem. You claim
it's extra overhead and can cause extra COW events. Neither of has any
numbers worth anything, but at least I can point to the fact that all
the *normal* VM paths have been doing the thing I advocate for many
releases now, and the sky most definitely is NOT falling.

So that only leaves (3).

Handling (3) really is so conceptually simple that I feel silly for
repeating it: if you don't want a COW to happen, then you mark the
page as being not-COW.

That sounds so simple as to be stupid. But it really is the solution.
It's what that pinning logic does, and keeps that "page may be pinned"
state around, and then operations like fork() that would otherwise
create a COW mapping of it will just not do it.

So that incredibly simple approach does require actual code: it
requires that explicit "fork() needs to copy instead of COW" code, it
requires that "if it's pinned, we don't make a new swapcache entry out
of it". So it's real code, and it's a real issue, but it's
conceptually absolutely trivial, and the code is usualyl really simple
to understand too.

So you have a *trivial* concept, and you have simple code that could
be described to a slightly developmentally challenged waterfowl.  If
you're one of the programmers doing the "explain your code to a rubber
ducky", you can look at code like this:

                /*
                 * Anonymous process memory has backing store?
                 * Try to allocate it some swap space here.
                 * Lazyfree page could be freed directly
                 */
                if (PageAnon(page) && PageSwapBacked(page)) {
                        if (!PageSwapCache(page)) {
                                if (!(sc->gfp_mask & __GFP_IO))
                                        goto keep_locked;
                                if (page_maybe_dma_pinned(page))
                                        goto keep_locked;

and you can explain that page_maybe_dma_pinned() test to your rubber
ducky, and that rubber ducky will literally nod its head. It gets it.

To recap:
 (1) is important, and page_count() is the only thing that guarantees
"you get full access to a page only when it's *obviously* exclusively
yours".
 (2) is NOT important, but could be a performance issue, but we have
real data from the past year that it isn't.
 (3) is important, and has a really spectacularly simple conceptual
fix with quite simple code too.

In contrast, with the "mapcount" games you can't even explain why they
should work, and the patches I see are actively buggy because
everything is so subtle.

                  Linus