[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <E9A1084A-30B3-4328-8B0D-31AD978375AD@gmail.com>
Date: Tue, 22 Dec 2020 12:19:49 -0800
From: Nadav Amit <nadav.amit@...il.com>
To: Andrea Arcangeli <aarcange@...hat.com>
Cc: Peter Xu <peterx@...hat.com>, Yu Zhao <yuzhao@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
linux-mm <linux-mm@...ck.org>,
lkml <linux-kernel@...r.kernel.org>,
Pavel Emelyanov <xemul@...nvz.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
stable <stable@...r.kernel.org>,
Minchan Kim <minchan@...nel.org>,
Andy Lutomirski <luto@...nel.org>,
Will Deacon <will@...nel.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
> On Dec 22, 2020, at 11:44 AM, Andrea Arcangeli <aarcange@...hat.com> wrote:
>
> On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
>> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
>
> I assume you mean "in" mprotect_fixup, after change_protection.
>
> If you would downgrade the mmap_lock to read there, then it'd severely
> slowdown the non contention case, if there's more than vma that needs
> change_protection.
>
> You'd need to throw away the prev->vm_next info and you'd need to do a
> new find_vma after droping the mmap_lock for reading and re-taking the
> mmap_lock for writing at every iteration of the loop.
>
> To do less harm to the non-contention case you could perhaps walk
> vma->vm_next and check if it's outside the mprotect range and only
> downgrade in such case. So let's assume we intend to optimize with
> mmap_write_downgrade only the last vma.
…
I read in detail your response and you make great points. To be fair,
I did not think it through and just tried to make a point that not
taking mmap_lock for write is an unrelated optimization.
So you make a great point that it is actually related and can only(?)
benefit uffd and arguably soft-dirty, to which I am going to add
mmap_write_lock().
Yet, my confidence in doing the optimization that you suggested (keep using
mmap_read_lock()) as part of the bug fix is very low and just got lower
since we discussed. So I do want in the future to try to avoid the overheads
I introduce (sorry), but it requires a systematic solution and some thought.
Perhaps any change to PTE in a page-table should increase a page-table
generation that we would save in the page-table page-struct (next to the
PTL) and pte_same() would also look at the original and updated "page-table
generation” when it checks if a PTE changed. So if a PTE went through
not-writable -> writable -> not-writable cycle without a TLB flush this can
go detected. Yet, there is still a question of how to detect pending TLB
flushes in finer granularity to avoid frequent unwarranted TLB flushes while
a TLB flush is pending.
It all requires some thought, and the fact that soft-dirty appears to be
broken too indicates that bugs can get unnoticed for some time.
Sorry for being a chicken, but I prefer to be safe than sorry.
Powered by blists - more mailing lists