linux-kernel - Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X/d2DyLfXZmBIreY@redhat.com>
Date:   Thu, 7 Jan 2021 15:58:55 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Hugh Dickins <hughd@...gle.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Kees Cook <keescook@...omium.org>,
        John Hubbard <jhubbard@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>,
        Jason Gunthorpe <jgg@...pe.ca>, Jan Kara <jack@...e.cz>,
        Kirill Tkhai <ktkhai@...tuozzo.com>
Subject: Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce
 wrprotect_tlb_flush_pending

Hi Linus,

On Thu, Jan 07, 2021 at 12:17:40PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli <aarcange@...hat.com> wrote:
> >
> > However there are two cases that could wrprotecting exclusive anon
> > pages with only the mmap_read_lock:
> 
> I still think the real fix is "Don't do that then", and just take the
> write lock.
> 
> The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a
> normal operation. Same goes for softdirty.
> 
> Why have those become _so_ magical that they can break the VM for
> everybody else?

I see what you mean above and I agree. Like said below:

==
In simple terms: the page_count check in do_wp_page makes it
impossible to wrprotect memory, if such memory is under a !FOLL_WRITE
GUP pin.
==

So to simplify let's ignore UFFDIO_WRITEPROTECT here, which is new and
adds no dependency on top of clear_refs in this respect.

So yes, if we drop any code that has to wrprotect memory in place in
the kernel (since all userland memory can be under GUP pin in read
mode) and we make such an operation illegal, then it's fine, but that
means clear_refs has to go or it has to fail if there's a GUP pin
during the wrprotection.

The problem is it's not even possible to detect reliably if there's
really a long term GUP pin because of speculative pagecache lookups.

We would need to declare that any secondary MMU which is supposed to
be VM-neutral using mmu notifier like a GPU or a RDMA device, cannot
be used in combination on clear_refs and it would need to be enforced
fully in userland. Most mmu notifier users drop the GUP pin during the
invalidate for extra safety in case an invalidate goes missing: they
would all need to drop FOLL_GET to be compliant and stop causing
memory corruption if clear_refs shall be still allowed to happen on
mmu notifier capable secondary MMUs. Even then how does userland know
which devices attaches to the memory with mmu notifer and never using
FOLL_GET and which aren't? It doesn't sound reliable to enforce this
in userland. So I don't see how clear_refs can be saved.

Now let's make another example that still shows at least some
fundamental inefficiency that has nothing to do with clear_refs.

Let's suppose a GUP pin is taken on a subpageA by a RDMA device by
process A (parent). Let's now assume subpageB is mapped in process B
(child of process A). Both subpageA and subpageB are exclusive
(mapcount == 1) and read-write but they share the same page_count
atomic counter (only the page_mapcounts are subpage granular). To
still tame the zygote concern with the page_count in do_wp_page, then
process B when it forks a child (processC) would forever have to do an
extra superflous COW even after process C exits. Is that what we want
on top of the fundamental unsafety added to clear_refs?

Thanks,
Andrea