linux-kernel - Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X+JmVmDvBOYuw5Zl@redhat.com>
Date:   Tue, 22 Dec 2020 16:34:14 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Andy Lutomirski <luto@...nel.org>, linux-mm <linux-mm@...ck.org>,
        Peter Xu <peterx@...hat.com>,
        lkml <linux-kernel@...r.kernel.org>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        stable <stable@...r.kernel.org>,
        Minchan Kim <minchan@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 12:58:18PM -0800, Nadav Amit wrote:
> I had somewhat similar ideas - saving in each page-struct the generation,
> which would allow to: (1) extend pte_same() to detect interim changes
> that were reverted (RO->RW->RO) and (2) per-PTE pending flushes.

What don't you feel safe about, what's the problem with RO->RO->RO, I
don't get it.

The pte_same is perfectly ok without sequence counter in my view, I
never seen anything that would not be ok with pte_same given all the
invariant are respected. It's actually a great optimization compared
to any unscalable sequence counter.

The counter would slowdown everything, having to increase a counter
every time you change a pte, no matter if it's a counter per pgtable
or per-vma or per-mm, sounds very bad.

I'd rather prefer to take mmap_lock_write across the whole userfaultfd
ioctl, than having to deal with a new sequence counter increase for
every pte modification on a heavily contended cacheline.

Also note the counter would have solved nothing for
userfaultfd_writeprotect, it's useless to detect stale TLB entries.

See how !pte_write check happens after the counter was already increased:

CPU0			CPU 1		CPU 2
------			--------	-------
userfaultfd_wrprotect(mode_wp = true)
PT lock
atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE
false_shared_counter_counter++ 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PT unlock

			do_page_fault FAULT_FLAG_WRITE
					userfaultfd_wrprotect(mode_wp = false)
					PT lock
					ATOMIC clear _PAGE_UFFD_WP <- problem
					/* _PAGE_WRITE not set */
					false_shared_counter_counter++ 
					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
					PT unlock
					XXXXXXXXXXXXXX BUG RACE window open here

			PT lock
			counter = false_shared_counter_counter
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			FAULT_FLAG_WRITE is set by CPU
			_PAGE_WRITE is still clear in pte
			PT unlock

			wp_page_copy
			copy_user_page runs with stale TLB

			pte_same(counter, orig_pte, pte) -> PASS
				 ^^^^^^^                    ^^^^
			commit the copy to the pte with the lost writes

deferred tlb flush <- too late
XXXXXXXXXXXXXX BUG RACE window close here
================================================================================