linux-kernel - Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X+JMiHv+EktzyZgr@redhat.com>
Date:   Tue, 22 Dec 2020 14:44:08 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Peter Xu <peterx@...hat.com>, Yu Zhao <yuzhao@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        stable <stable@...r.kernel.org>,
        Minchan Kim <minchan@...nel.org>,
        Andy Lutomirski <luto@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so

I assume you mean "in" mprotect_fixup, after change_protection.

If you would downgrade the mmap_lock to read there, then it'd severely
slowdown the non contention case, if there's more than vma that needs
change_protection.

You'd need to throw away the prev->vm_next info and you'd need to do a
new find_vma after droping the mmap_lock for reading and re-taking the
mmap_lock for writing at every iteration of the loop.

To do less harm to the non-contention case you could perhaps walk
vma->vm_next and check if it's outside the mprotect range and only
downgrade in such case. So let's assume we intend to optimize with
mmap_write_downgrade only the last vma.

The problem is once you had to take mmap_lock for writing, you already
stalled for I/O and waited all concurrent page faults and blocked them
as well for the vma allocations in split_vma, so that extra boost in
SMP scalability you get is lost in the noise there at best.

And the risk is that at worst that extra locked op of
mmap_write_downgrade() will hurt SMP scalability because it would
increase the locked ops of mprotect on the hottest false-shared
cacheline by 50% and that may outweight the benefit from unblocking
the page faults half a usec sooner on large systems.

But the ultimate reason why mprotect cannot do mmap_write_downgrade()
while userfaultfd_writeprotect can do mmap_read_lock and avoid the
mmap_write_lock altogether, is that mprotect leaves no mark in the
pte/hugepmd that allows to detect when the TLB is stale in order to
redirect the page fault in a dead end (handle_userfault() or
do_numa_page) until after the TLB has been flushed as it happens in
the the 4 cases below:

	/*
	 * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB
	 * can be stale. We cannot allow do_wp_page to proceed or
	 * it'll wrongly assume that nobody can still be writing to
	 * the page if !pte_write.
	 */
	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
		/*
		 * STALE_TLB_WARNING: while the uffd_wp bit is set,
		 * the TLB can be stale. We cannot allow wp_huge_pmd()
		 * to proceed or it'll wrongly assume that nobody can
		 * still be writing to the page if !pmd_write.
		 */
		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
	/*
	 * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can
	 * be stale.
	 */
	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
			/*
			 * STALE_TLB_WARNING: if the pmd is NUMA
			 * protnone the TLB can be stale.
			 */
			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))

Thanks,
Andrea