linux-kernel - Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X/NrdnoDHgFd0Ku1@redhat.com>
Date:   Mon, 4 Jan 2021 14:24:38 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Nadav Amit <nadav.amit@...il.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Nadav Amit <namit@...are.com>,
        Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>, Mel Gorman <mgorman@...e.de>
Subject: Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to
 writeprotect

Hello,

On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> 
> > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > 
> > cpu0				cpu1			cpu2
> > ----				----			----
> > 							[ Writable PTE
> > 							  cached in TLB ]
> > userfaultfd_writeprotect()
> > [ write-*unprotect* ]
> > mwriteprotect_range()
> > mmap_read_lock()
> > change_protection()
> > 
> > change_protection_range()
> > ...
> > change_pte_range()
> > [ *clear* “write”-bit ]
> > [ defer TLB flushes ]
> > 				[ page-fault ]
> > 				...
> > 				wp_page_copy()
> > 				 cow_user_page()
> > 				  [ copy page ]
> > 							[ write to old
> > 							  page ]
> > 				...
> > 				 set_pte_at_notify()
> 
> Yuck!
> 

Note, the above was posted before we figured out the details so it
wasn't showing the real deferred tlb flush that caused problems (the
one showed on the left causes zero issues).

The problematic one not pictured is the one of the wrprotect that has
to be running in another CPU which is also isn't picture above. More
accurate traces are posted later in the thread.

> Isn't this all rather similar to the problem that resulted in the
> tlb_flush_pending mess?
> 
> I still think that's all fundamentally buggered, the much saner solution
> (IMO) would've been to make things wait for the pending flush, instead

How do intend you wait in PT lock while the writer also has to take PT
lock repeatedly before it can do wake_up_var?

If you release the PT lock before calling wait_tlb_flush_pending it
all falls apart again.

This I guess explains why a local pte/hugepmd smp local invlpg is the
only working solution for this issue, similarly to how it's done in rmap.

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 07d9acb5b19c..0210547ac424 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
>  	 *
>  	 * Therefore we must rely on tlb_flush_*() to guarantee order.
>  	 */
> -	atomic_dec(&mm->tlb_flush_pending);
> +	if (atomic_dec_and_test(&mm->tlb_flush_pending))
> +		wake_up_var(&mm->tlb_flush_pending);
>  }
>  
>  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>  	return atomic_read(&mm->tlb_flush_pending) > 1;
>  }
>  
> +static inline void wait_tlb_flush_pending(struct mm_struct *mm)
> +{
> +	wait_var_event(&mm->tlb_flush_pending,
> +		       atomic_read(&mm->tlb_flush_pending) == 0);
> +}

I appreciate the effort in not regressing soft dirty and uffd-wp
writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for
writing.

At the same time what was posted so far wasn't clean enough but it
wasn't even tested... if we abstract it in some clean way and we mark
all connected points (soft dirty, uffd-wp, the wrprotect page fault),
then I can be optimistic it will remain understandable when we look at
it again a few years down the road.

Or at the very least it can't get worse than the "tlb_flush_pending
mess" you mentioned above.

flush_tlb_batched_pending() has to be orthogonally re-reviewed for
those things Nadav pointed out. But I'd rather keep that review in a
separate thread since any bug in that code has zero connection to this
issue. The basic idea is similar but the methods and logic are
different and our flush here will be granular and it's going to be
only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or
if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide,
unconditional etc.. Pretty much all different.

Thanks,
Andrea