linux-kernel - Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0201238b-e716-2a3c-e9ea-d5294ff77525@linux.vnet.ibm.com>
Date:   Tue, 12 Jan 2021 16:47:17 +0100
From:   Laurent Dufour <ldufour@...ux.vnet.ibm.com>
To:     Vinayak Menon <vinmenon@...eaurora.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Andy Lutomirski <luto@...nel.org>, Peter Xu <peterx@...hat.com>,
        Nadav Amit <nadav.amit@...il.com>, Yu Zhao <yuzhao@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        stable <stable@...r.kernel.org>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>, surenb@...gle.com
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
> On 1/5/2021 9:07 PM, Peter Zijlstra wrote:
>> On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote:
>>
>>> So I think the basic rule is that "if you hold mmap_sem for writing,
>>> you're always safe". And that really should be considered the
>>> "default" locking.
>>>
>>> ANY time you make a modification to the VM layer, you should basically
>>> always treat it as a write operation, and get the mmap_sem for
>>> writing.
>>>
>>> Yeah, yeah, that's a bit simplified, and it ignores various special
>>> cases (and the hardware page table walkers that obviously take no
>>> locks at all), but if you hold the mmap_sem for writing you won't
>>> really race with anything else - not page faults, and not other
>>> "modify this VM".
>>> To a first approximation, everybody that changes the VM should take
>>> the mmap_sem for writing, and the readers should just be just about
>>> page fault handling (and I count GUP as "page fault handling" too -
>>> it's kind of the same "look up page" rather than "modify vm" kind of
>>> operation).
>>>
>>> And there are just a _lot_ more page faults than there are things that
>>> modify the page tables and the vma's.
>>>
>>> So having that mental model of "lookup of pages in a VM take mmap_semn
>>> for reading, any modification of the VM uses it for writing" makes
>>> sense both from a performance angle and a logical standpoint. It's the
>>> correct model.
>>> And it's worth noting that COW is still "lookup of pages", even though
>>> it might modify the page tables in the process. The same way lookup
>>> can modify the page tables to mark things accessed or dirty.
>>>
>>> So COW is still a lookup operation, in ways that "change the
>>> writabiility of this range" very much is not. COW is "lookup for
>>> write", and the magic we do to copy to make that write valid is still
>>> all about the lookup of the page.
>> (your other email clarified this point; the COW needs to copy while
>> holding the PTL and we need TLBI under PTL if we're to change this)
>>
>>> Which brings up another mental mistake I saw earlier in this thread:
>>> you should not think "mmap_sem is for vma, and the page table lock is
>>> for the page table changes".
>>>
>>> mmap_sem is the primary lock for any modifications to the VM layout,
>>> whether it be in the vma's or in the page tables.
>>>
>>> Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
>>> it is partly because
>>>
>>>   (a) we have things that historically walked the page tables _without_
>>> walking the vma's (notably the virtual memory scanning)
>>>
>>>   (b) we do allow concurrent page faults, so we then need a lower-level
>>> lock to serialize the parallelism we _do_ have.
>> And I'm thinking the speculative page fault series steps right into all
>> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
>>
>> Which opens it up to exactly these races explored here.
>>
>> The range lock approach does not suffer this, but I'm still worried
>> about the actual performance of that thing.
> 
> 
> Some thoughts on why there may not be an issue with speculative page fault.
> Adding Laurent as well.
> 
> Possibility of race against other PTE modifiers
> 
> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
> 2) mprotect - change_protection in mprotect which does the deferred flush is
> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults on those 
> VMAs.
> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> But SPF does not take UFFD faults.
> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> (2) above.
> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
> I could not see a case where speculative path cannot see a PTE update done via
> a fault on another CPU.
> 

Thanks Vinayak,

You explained it fine. Indeed SPF is handling deferred TLB invalidation by 
marking the VMA through vm_write_begin/end(), as for the fork case you 
mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE 
values read are valid.

Cheers,
Laurent.