linux-kernel - Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <73EE9007-65AF-4416-9930-D992C74447A9@vmware.com>
Date:   Mon, 4 Jan 2021 20:39:37 +0000
From:   Nadav Amit <namit@...are.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>, Yu Zhao <yuzhao@...gle.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>, Mel Gorman <mgorman@...e.de>
Subject: Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to
 writeprotect

> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@...hat.com> wrote:
> 
> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@...hat.com> wrote:
>>> 
>>> Hello,
>>> 
>>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>>>> 
>>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
>>>>> 
>>>>> cpu0				cpu1			cpu2
>>>>> ----				----			----
>>>>> 							[ Writable PTE
>>>>> 							  cached in TLB ]
>>>>> userfaultfd_writeprotect()
>>>>> [ write-*unprotect* ]
>>>>> mwriteprotect_range()
>>>>> mmap_read_lock()
>>>>> change_protection()
>>>>> 
>>>>> change_protection_range()
>>>>> ...
>>>>> change_pte_range()
>>>>> [ *clear* “write”-bit ]
>>>>> [ defer TLB flushes ]
>>>>> 				[ page-fault ]
>>>>> 				...
>>>>> 				wp_page_copy()
>>>>> 				 cow_user_page()
>>>>> 				  [ copy page ]
>>>>> 							[ write to old
>>>>> 							  page ]
>>>>> 				...
>>>>> 				 set_pte_at_notify()
>>>> 
>>>> Yuck!
>>> 
>>> Note, the above was posted before we figured out the details so it
>>> wasn't showing the real deferred tlb flush that caused problems (the
>>> one showed on the left causes zero issues).
>> 
>> Actually it was posted after (note that this is v2). The aforementioned
>> scenario that Peter regards to is the one that I actually encountered (not
>> the second scenario that is “theoretical”). This scenario that Peter regards
>> is indeed more “stupid” in the sense that we should just not write-protect
>> the PTE on userfaultfd write-unprotect.
>> 
>> Let me know if I made any mistake in the description.
> 
> I didn't say there is a mistake. I said it is not showing the real
> deferred tlb flush that cause problems.
> 
> The issue here is that we have a "defer tlb flush" that runs after
> "write to old page".
> 
> If you look at the above, you're induced to think the "defer tlb
> flush" that causes issues is the one in cpu0. It's not. That is
> totally harmless.

I do not understand what you say. The deferred TLB flush on cpu0 *is* the
the one that causes the problem. The PTE is write-protected (although it is
a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
the page-fault (and copy), while cpu2 keeps writing to the source page. If
cpu0 did not defer the TLB flush, this problem would not happen.

>>> The problematic one not pictured is the one of the wrprotect that has
>>> to be running in another CPU which is also isn't picture above. More
>>> accurate traces are posted later in the thread.
>> 
>> I think I included this scenario as well in the commit log (of v2). Let me
>> know if I screwed up and the description is not clear.
> 
> Instead of not showing the real "defer tlb flush" in the trace and
> then fixing it up in the comment, why don't you take the trace showing
> the real problematic "defer tlb flush"? No need to reinvent it.

The scenario you mention is indeed identical to the second scenario I
mention in the commit log. I think the version I included is cleared since
it shows the write that triggers the corruption instead of discussing
“windows”, which might be less clear. Running copy_user_page() with stale
TLB is by itself not a bug if you detect it later (e.g., using pte_same()).

Note that my second scenario is also consistent in style with the first
scenario.

I am not married to my description and if you (and others) insist I would
copy-paste your version.

> This show the real deferred tlb flush, your v2 does not include it
> instead.

Are you talking about the first scenario (write-unprotect), the second one
(write-protect followed by write-unprotect), both? It seems to me that all
the deferred TLB flushes are mentioned at the point they are deferred. I can
add the point in which they are performed.