linux-kernel - Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <B8095F3C-81E3-4AF9-A6A5-F597D51264BD@gmail.com>
Date:   Mon, 21 Dec 2020 14:55:12 -0800
From:   Nadav Amit <nadav.amit@...il.com>
To:     Peter Xu <peterx@...hat.com>
Cc:     Yu Zhao <yuzhao@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        stable <stable@...r.kernel.org>,
        Minchan Kim <minchan@...nel.org>,
        Andy Lutomirski <luto@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

> On Dec 21, 2020, at 2:30 PM, Peter Xu <peterx@...hat.com> wrote:
> 
> On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
>> BTW: In general, I think that you are right, and that changing of PTEs
>> should not require taking mmap_lock for write. However, I am not sure
>> cow_user_page() is not the only one that poses a problem and whether a more
>> systematic solution is needed. If cow_user_pages() is the only problem, do
>> you think it is possible to do the copying while holding the PTL? It works
>> for normal-pages, but I am not sure whether special-pages pose special
>> problems.
>> 
>> Anyhow, this is an enhancement that we can try later.
> 
> AFAIU mprotect() is the only one who modifies the pte using the mmap write
> lock.  NUMA balancing is also using read mmap lock when changing pte
> protections, while my understanding is mprotect() used write lock only because
> it manipulates the address space itself (aka. vma layout) rather than modifying
> the ptes, so it needs to.

You are correct about NUMA balancing in general. Yet in practice it is not
an issue in our “use-case” since NUMA balancing preserves the write-bit.

> At the pte level, it seems always to be the pgtable lock that serializes things.
> 
> So it's perfectly legal to me for e.g. a driver to modify ptes with the read
> lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
> taken when doing so.
> 
> If there's a driver that manipulated the ptes, changed the content of the page,
> recover the ptes to origin, and all these happen right after wp_page_copy()
> unlocked the pgtable lock but before wp_page_copy() retakes the same lock
> again, we may face the same issue finding that the page got copied contains
> corrupted data at last.  While I don't know what to blame on the driver either
> because it seems to be exactly following the rules.

The driver would have to do so without flushing the TLB. Having said that,
the driver could have used inc_tlb_flush_pending() and batch flushes.

> 
> I believe changing into write lock would solve the race here because tlb
> flushing would be guaranteed along the way, but I'm just a bit worried it's not
> the best way to go..

It might be too big of a hammer. But the question that comes to my mind is,
if it is ok to change the PTEs without mmap_lock held for write, why
wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
mprotect change of PTE would not be done with the write-lock)? If you did
so, you would have the same problem as the one we encountered (concurrent
protect-unprotect allow concurrent cow-#PF copying the wrong data).

So as an alternative solution, I can do copying under the PTL after
flushing, which seems to solve the problem. First copying (without a lock)
and then comparing seems to me as suboptimal - I do not think the benefit
(if there is one) of shortening the time in which the lock is taken - worth
the additional compare (and the complexity with special pages).

There are 2 problems in doing so:

1. I think that copy_user_highpage() and __copy_from_user_inatomic() can be
called while holding the PTL, but I am not sure.

2. For special pages we would need 2 TLB flushes: one to ensure the
write-bit is cleared, and a second one after we clear the PTE. We
can limit ourselves to soft-dirty/UFFD VMAs, but if we have your
hypothetical driver, this would not be good enough.