linux-kernel - Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X+QLr1WmGXMs33Ld@redhat.com>
Date:   Wed, 23 Dec 2020 22:31:59 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Andy Lutomirski <luto@...capital.net>
Cc:     Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Peter Xu <peterx@...hat.com>,
        Nadav Amit <nadav.amit@...il.com>,
        linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        stable <stable@...r.kernel.org>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote:
> One other near zero cost improvement easy to add if this would be "if
> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made

The next worry then is if UFFDIO_WRITEPROTECT is very large then there
would be a flood of wrprotect faults, and they'd end up all issuing a
tlb flush during the UFFDIO_WRITEPROTECT itself which again is a
performance concern for certain uffd-wp use cases.

Those use cases would be for example redis and postcopy live
snapshotting, to use it for async snapshots, unprivileged too in the
case of redis if it temporarily uses bounce buffers for the syscall
I/O for the duration of the snapshot. hypervisors tuned profiles need
to manually lift the unprivileged_userfaultfd to 1 unless their jailer
leaves one capability in the snapshot thread.

Moving the check after userfaultfd_pte_wp would solve
userfaultfd_writeprotect(mode_wp=true), but that still wouldn't avoid
a flood of tlb flushes during userfaultfd_writeprotect(mode_wp=false)
because change_protection doesn't restore the pte_write:

			} else if (uffd_wp_resolve) {
				/*
				 * Leave the write bit to be handled
				 * by PF interrupt handler, then
				 * things like COW could be properly
				 * handled.
				 */
				ptent = pte_clear_uffd_wp(ptent);
			}

When the snapshot is complete userfaultfd_writeprotect(mode_wp=false)
would need to run again on the whole range which can be very big
again.

Orthogonally I think we should also look to restore the pte_write
above orthogonally in uffd-wp, so it'll get yet an extra boost if
compared to current redis snapshotting fork(), that cannot restore all
pte_write after the snapshot child quit and forces a flood of spurious
wrprotect faults (uffd-wp can solve that too).

However, even if uffd-wp restored the pte_write, things would remain
suboptimal for a terabyte process under clear_refs, since softdirty
wrprotect faults that start happening while softdirty is still running
on the mm, won't be caught in userfaultfd_pte_wp.

Something like below, if cleaned up, abstracted properly and
documented well in the two places involved, will have a better chance
to perform optimally for softdirty too.

And on a side note the CONFIG_MEM_SOFT_DIRTY compile time check is
compulsory because VM_SOFTDIRTY is defined to zero if softdirty is not
built in. (for VM_UFFD_WP the CONFIG_HAVE_ARCH_USERFAULTFD_WP can be
removed and it won't make any measurable difference even when
USERFAULTFD=n)

RFC untested below, it's supposed to fix the softdirty testcase too,
even without the incremental fix, since it already does tlb_gather_mmu
before walk_page_range and tlb_finish_mmu after it and that appears
enough to define the inc/dec_tlb_flush_pending.

diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..66fd6d070c47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2844,11 +2844,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_page)
 			goto oom;
 	} else {
+		bool in_uffd_wp, in_softdirty;
 		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 				vmf->address);
 		if (!new_page)
 			goto oom;

+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+		in_uffd_wp = !!(vma->vm_flags & VM_UFFD_WP);
+#else
+		in_uffd_wp = false;
+#endif
+#ifdef CONFIG_MEM_SOFT_DIRTY
+		in_softdirty = !(vma->vm_flags & VM_SOFTDIRTY);
+#else
+		in_softdirty = false;
+#endif
+		if ((in_uffd_wp || in_softdirty) &&
+		    mm_tlb_flush_pending(mm))
+			flush_tlb_page(vma, vmf->address);
+
 		if (!cow_user_page(new_page, old_page, vmf)) {
 			/*
 			 * COW failed, if the fault was solved by other,