linux-kernel - Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <E9A1084A-30B3-4328-8B0D-31AD978375AD@gmail.com>
Date:   Tue, 22 Dec 2020 12:19:49 -0800
From:   Nadav Amit <nadav.amit@...il.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
Cc:     Peter Xu <peterx@...hat.com>, Yu Zhao <yuzhao@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        stable <stable@...r.kernel.org>,
        Minchan Kim <minchan@...nel.org>,
        Andy Lutomirski <luto@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

> On Dec 22, 2020, at 11:44 AM, Andrea Arcangeli <aarcange@...hat.com> wrote:
> 
> On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
>> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
> 
> I assume you mean "in" mprotect_fixup, after change_protection.
> 
> If you would downgrade the mmap_lock to read there, then it'd severely
> slowdown the non contention case, if there's more than vma that needs
> change_protection.
> 
> You'd need to throw away the prev->vm_next info and you'd need to do a
> new find_vma after droping the mmap_lock for reading and re-taking the
> mmap_lock for writing at every iteration of the loop.
> 
> To do less harm to the non-contention case you could perhaps walk
> vma->vm_next and check if it's outside the mprotect range and only
> downgrade in such case. So let's assume we intend to optimize with
> mmap_write_downgrade only the last vma.
…

I read in detail your response and you make great points. To be fair,
I did not think it through and just tried to make a point that not
taking mmap_lock for write is an unrelated optimization.

So you make a great point that it is actually related and can only(?)
benefit uffd and arguably soft-dirty, to which I am going to add
mmap_write_lock().

Yet, my confidence in doing the optimization that you suggested (keep using
mmap_read_lock()) as part of the bug fix is very low and just got lower
since we discussed. So I do want in the future to try to avoid the overheads
I introduce (sorry), but it requires a systematic solution and some thought.

Perhaps any change to PTE in a page-table should increase a page-table
generation that we would save in the page-table page-struct (next to the
PTL) and pte_same() would also look at the original and updated "page-table
generation” when it checks if a PTE changed. So if a PTE went through
not-writable -> writable -> not-writable cycle without a TLB flush this can
go detected. Yet, there is still a question of how to detect pending TLB
flushes in finer granularity to avoid frequent unwarranted TLB flushes while
a TLB flush is pending.

It all requires some thought, and the fact that soft-dirty appears to be
broken too indicates that bugs can get unnoticed for some time.

Sorry for being a chicken, but I prefer to be safe than sorry.