linux-kernel - Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X+K7JMrTEC9SpVIB@google.com>
Date:   Tue, 22 Dec 2020 20:36:04 -0700
From:   Yu Zhao <yuzhao@...gle.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Peter Xu <peterx@...hat.com>,
        Nadav Amit <nadav.amit@...il.com>,
        linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        stable <stable@...r.kernel.org>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 09:56:11PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> > We are talking about non-COW anon pages here -- they can't be mapped
> > more than once. So why not just identify them by checking
> > page_mapcount == 1 and then unconditionally reuse them? (This is
> > probably where I've missed things.)
> 
> The problem in depending on page_mapcount to decide if it's COW or
> non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP
> may elevate the count of a COW anon page that become a non-COW anon
> page.
> 
> This is Jann's idea not mine.
> 
> The problem is we have an unprivileged long term GUP like vmsplice
> that facilitates elevating the page count indefinitely, until the
> parent finally writes a secret to it. Theoretically a short term pin
> would do it too so it's not just vmpslice, but the short term pin
> would be incredibly more challenging to become a concern since it'd
> kill a phone battery and flash before it can read any data.
> 
> So what happens with your page_mapcount == 1 check is that it doesn't
> mean non-COW (we thought it did until it didn't for the long term gup
> pin in vmsplice).
> 
> Jann's testcases does fork() and set page_mapcount 2 and page_count to
> 2, vmsplice, take unprivileged infinitely long GUP pin to set
> page_count to 3, queue the page in the pipe with page_count elevated,
> munmap to drop page_count to 2 and page_mapcount to 1.
> 
> page_mapcount is 1, so you'd think the page is non-COW and owned by
> the parent, but the child can still read it so it's very much still
> wp_page_copy material if the parent tries to modify it. Otherwise the
> child can read the content.
> 
> This was supposed to be solvable by just doing the COW in gup(write=0)
> case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly
> sure why that didn't fly and it had to be reverted by Peter in
> a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was
> happening I was side tracked by urgent issues and I didn't manage to
> look back of how we ended up with the big hammer page_count == 1 check
> instead to decide if to call wp_page_reuse or wp_page_shared.
> 
> So anyway, the only thing that is clear to me is that keeping the
> child from reading the page_mapcount == 1 pages of the parent, is the
> only reason why wp_page_reuse(vmf) will only be called on
> page_count(page) == 1 and not on page_mapcount(page) == 1.
> 
> It's also the reason why your page_mapcount assumption will risk to
> reintroduce the issue, and I only wish we could put back page_mapcount
> == 1 back there.
> 
> Still even if we put back page_mapcount there, it is not ok to leave
> the page fault with stale TLB entries and to rely on the fact
> wp_page_shared won't run. It'd also avoid the problem but I think if
> you leave stale TLB entries in change_protection just like NUMA
> balancing does, it also requires a catcher just like NUMA balancing
> has, or it'd truly work by luck.
> 
> So until we can put a page_mapcount == 1 check back there, the
> page_count will be by definition unreliable because of the speculative
> lookups randomly elevating all non zero page_counts at any time in the
> background on all pages, so you will never be able to tell if a page
> is true COW or if it's just a spurious COW because of a speculative
> lookup. It is impossible to differentiate a speculative lookup from a
> vmsplice ref in a child.

Thanks for the details.

In your patch, do we need to take wrprotect_rwsem in
handle_userfault() as well? Otherwise, it seems userspace would have
to synchronize between its wrprotect ioctl and fault handler? i.e.,
the fault hander needs to be aware that the content of write-
protected pages can actually change before the iotcl returns.