linux-kernel - Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <X/iEyk0ijxhSvs9T@redhat.com>
Date:   Fri, 8 Jan 2021 11:14:02 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Will Deacon <will@...nel.org>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Minchan Kim <minchan@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Hugh Dickins <hughd@...gle.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Kees Cook <keescook@...omium.org>,
        John Hubbard <jhubbard@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>,
        Jason Gunthorpe <jgg@...pe.ca>, Jan Kara <jack@...e.cz>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce
 wrprotect_tlb_flush_pending

Hello everyone,

On Fri, Jan 08, 2021 at 12:48:16PM +0000, Will Deacon wrote:
> On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote:
> > Please. Why is the correct patch not the attached one (apart from the
> > obvious fact that I haven't tested it and maybe just screwed up
> > completely - but you get the idea)?
> 
> It certainly looks simple and correct to me, although it means we're now
> taking the mmap sem for write in the case where we only want to clear the
> access flag, which should be fine with the thing only held for read, no?

I'm curious, would you also suggest that fixing just the TLB flushing
symptom is enough and we can forget about the ABI break coming from
page_count used in do_wp_page?

One random example: clear_refs will still break all long term GUP
pins, are you ok with that too?

page_count in do_wp_page is a fix for the original security issue from
vmsplice (where the child is fooling the parent in taking the
exclusive page in do_wp_page), that appears worse than the bug itself.

page_count in do_wp_page, instead of isolating as malicious when the
parent is reusing the page queued in the vmsplice pipe, is treating as
malicious also all legit cases that had to reliably reuse the page to
avoid the secondary MMUs to go out of sync.

page_count in do_wp_page is like a credit card provider blocking all
credit cards of all customers, because one credit card may have been
cloned (by vmsplice), but nobody can know which one was it. Of course
this technique will work perfectly as security fix because it will
treat all credit card users as malicious and it'll block them all
("block as in preventing re-use of the anon page").

The problem are those other credit card users that weren't malicious
that get their COW broken too. Those are the very long term GUP pins
if any anon page can be still wrprotected anywhere in the VM.

At the same time the real hanging fruit (vmsplice) that, if taken care
of, would have rendered the bug purely theoretical in security terms
hasn't been fixed yet, despite those unprivileged long term GUP pins
causes more reproducible security issues than just the COW, since they
can still DoS the OOM killer and they bypass at least the mlock
enforcement, even for non compound pages.

Of course just fixing vmsplice to require some privilege won't fix the
bug in full, so it's not suitable long term solution, but it has to
happen orthogonality for other reason, and it'd at least remove the
short term security concern.

In addition you're not experiencing the full fallout of the side
effects of page_count used to decide if to re-use all anon COW pages
because the bug is still there (with enterprise default config options
at least). Not all credit cards are blocked yet with only
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 applied. Only after you will
block them all, you will experience all the side effects of replacing
the per-subpage finegrined mapcount with the compound-wide page count.

The two statements above combined, result in my recommendation at this
point to resolve this in userland by rendering the security issue
theoretical by removing vmsplice from the OCI schema allowlist or by
enforcing it fixing in userland by always using execve after drop
privs (as crun always does when it starts the container of course).

For the long term, I can't see how using page_count in do_wp_page is a
tenable proposition, unless we either drop all secondary MMUs from the
kernel or VM features like clear_refs are dropped or unless the
page_count is magically stabilized and the speculative pagecache
lookups are also dropped.

If trying to manage the fallout by enforcing no anon page can ever be
wrprotected in place (i.e. dropping clear_refs feature or rendering it
unreliable by skipping elevated counts caused by spurious pagecache
lookups), it'd still sounds a too fragile design and too prone to
break to rely on that. There's random arch stuff even wrprotecting
memory, even very vm86 does it under the hood (vm86 is unlikely it has
a long term GUP pin on it of course, but still who knows?). I mean the
VM core cannot make assumptions like: "this vm86 case can still
wrprotect without worry because probably vm86 isn't used anymore with
any advanced secondary MMU, so if there's a GUP pin it probably is a
malicious vmsplice and not a RDMA or GPU or Virtualization secondary
MMU".

Then there's the secondary concern of the inefficiency it introduces
with extra unnecessary copies when a single GUP pin will prevent reuse
of 512 or 262144 subpages, in the 512 case also potentially mapped in
different processes. The TLB flushing discussions registers as the
last concern in my view.

Thanks,
Andrea