[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <B8095F3C-81E3-4AF9-A6A5-F597D51264BD@gmail.com>
Date: Mon, 21 Dec 2020 14:55:12 -0800
From: Nadav Amit <nadav.amit@...il.com>
To: Peter Xu <peterx@...hat.com>
Cc: Yu Zhao <yuzhao@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrea Arcangeli <aarcange@...hat.com>,
linux-mm <linux-mm@...ck.org>,
lkml <linux-kernel@...r.kernel.org>,
Pavel Emelyanov <xemul@...nvz.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
stable <stable@...r.kernel.org>,
Minchan Kim <minchan@...nel.org>,
Andy Lutomirski <luto@...nel.org>,
Will Deacon <will@...nel.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
> On Dec 21, 2020, at 2:30 PM, Peter Xu <peterx@...hat.com> wrote:
>
> On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
>> BTW: In general, I think that you are right, and that changing of PTEs
>> should not require taking mmap_lock for write. However, I am not sure
>> cow_user_page() is not the only one that poses a problem and whether a more
>> systematic solution is needed. If cow_user_pages() is the only problem, do
>> you think it is possible to do the copying while holding the PTL? It works
>> for normal-pages, but I am not sure whether special-pages pose special
>> problems.
>>
>> Anyhow, this is an enhancement that we can try later.
>
> AFAIU mprotect() is the only one who modifies the pte using the mmap write
> lock. NUMA balancing is also using read mmap lock when changing pte
> protections, while my understanding is mprotect() used write lock only because
> it manipulates the address space itself (aka. vma layout) rather than modifying
> the ptes, so it needs to.
You are correct about NUMA balancing in general. Yet in practice it is not
an issue in our “use-case” since NUMA balancing preserves the write-bit.
> At the pte level, it seems always to be the pgtable lock that serializes things.
>
> So it's perfectly legal to me for e.g. a driver to modify ptes with the read
> lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
> taken when doing so.
>
> If there's a driver that manipulated the ptes, changed the content of the page,
> recover the ptes to origin, and all these happen right after wp_page_copy()
> unlocked the pgtable lock but before wp_page_copy() retakes the same lock
> again, we may face the same issue finding that the page got copied contains
> corrupted data at last. While I don't know what to blame on the driver either
> because it seems to be exactly following the rules.
The driver would have to do so without flushing the TLB. Having said that,
the driver could have used inc_tlb_flush_pending() and batch flushes.
>
> I believe changing into write lock would solve the race here because tlb
> flushing would be guaranteed along the way, but I'm just a bit worried it's not
> the best way to go..
It might be too big of a hammer. But the question that comes to my mind is,
if it is ok to change the PTEs without mmap_lock held for write, why
wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
mprotect change of PTE would not be done with the write-lock)? If you did
so, you would have the same problem as the one we encountered (concurrent
protect-unprotect allow concurrent cow-#PF copying the wrong data).
So as an alternative solution, I can do copying under the PTL after
flushing, which seems to solve the problem. First copying (without a lock)
and then comparing seems to me as suboptimal - I do not think the benefit
(if there is one) of shortening the time in which the lock is taken - worth
the additional compare (and the complexity with special pages).
There are 2 problems in doing so:
1. I think that copy_user_highpage() and __copy_from_user_inatomic() can be
called while holding the PTL, but I am not sure.
2. For special pages we would need 2 TLB flushes: one to ensure the
write-bit is cleared, and a second one after we clear the PTE. We
can limit ourselves to soft-dirty/UFFD VMAs, but if we have your
hypothetical driver, this would not be good enough.
Powered by blists - more mailing lists