linux-kernel - Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wh-ETqwd6EC2PR6JJzCFHVxJgdbUcMpW5MS7gCa76EDsQ@mail.gmail.com>
Date:   Sat, 18 Dec 2021 14:53:38 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Nadav Amit <namit@...are.com>
Cc:     Jason Gunthorpe <jgg@...dia.com>,
        David Hildenbrand <david@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        John Hubbard <jhubbard@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Yang Shi <shy828301@...il.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        Donald Dutile <ddutile@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
        Linux-MM <linux-mm@...ck.org>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>,
        "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via
 FAULT_FLAG_UNSHARE (!hugetlb)

On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit <namit@...are.com> wrote:
>
> Yes, I guess that you pin the pages early for RDMA registration, which
> is also something you may do for IO-uring buffers. This would render
> userfaultfd unusable.

I think this is all on usefaultfd.

That code literally stole two of the bits from the page table layout -
bits that we could have used for better things.

And guess what? Because it required those two bits in the page tables,
and because that's non-portable, it turns out that UFFD_WP can only be
enabled and only works on x86-64 in the first place.

So UFFS_WP is fundamentally non-portable. Don't use it.

Anyway, the good news is that I think that exactly because uffd_wp
stole two bits from the page table layout, it already has all the
knowledge it needs to handle this entirely on its own. It's just too
lazy to do so now.

In particular, it has that special UFFD_WP bit that basically says
"this page is actually writable, but I've made it read-only just to
get the fault for soft-dirty".

And the hint here is that if the page truly *was* writable, then COW
just shouldn't happen, and all that the page fault code should do is
set soft-dirty and return with the page set writable.

And if the page was *not* writable, then UFFD_WP wasn't actually
needed in the first place, but the uffd code just sets it blindly.

Notice? It _should_ be just an operation based purely on the page
table contents, never even looking at the page AT ALL. Not even the
page count, much less some mapcount thing.

Done right, that soft-dirty thing could work even with no page backing
at all, I think.

But as far as I know, we've actually never seen a workload that does
all this, so.. Does anybody even have a test-case?

Because I do think that UFFD_WP really should never really look at the
page, and this issue is actually independent of the "page_count() vs
page_mapcount()" discussion.

(Somewhat related aside: Looking up the page is actually one of the
more expensive operations of a page fault and a lot of other page
table manipulation functions - it's where most of the cache misses
happen. That's true on the page fault side, but it's also true for
things like copy_page_range() etc)

                 Linus