lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220831083024.37138-1-david@redhat.com>
Date:   Wed, 31 Aug 2022 10:30:24 +0200
From:   David Hildenbrand <david@...hat.com>
To:     linux-kernel@...r.kernel.org
Cc:     linux-mm@...ck.org, David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Jason Gunthorpe <jgg@...dia.com>,
        John Hubbard <jhubbard@...dia.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Hugh Dickins <hughd@...gle.com>, Peter Xu <peterx@...hat.com>
Subject: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

The comment is stale, because a TLB flush is no longer sufficient and
required to synchronize against concurrent GUP-fast. This used to be true
in the past, whereby a TLB flush would have implied an IPI on architectures
that support GUP-fast, resulting in GUP-fast that disables local interrupts
from completing before completing the flush.

However, ever since general RCU GUP-fast was introduced in
commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"),
this handling no longer applies in general. RCU primarily prevents page
tables from getting freed while they might still get walked by GUP-fast,
but we can race with GUP-fast after clearing the PTE and flushing the
TLB.

Nowadays, we can see a refcount change from GUP-fast at any point in
time. However, GUP-fast detects concurrent PTE changes by looking up the
PTE, temporarily grabbing a reference, and dropping the reference again if
the PTE changed in the meantime.

An explanation by Jason Gunthorpe regarding how existing memory barriers
should be fine to make sure that concurrent GUP-fast cannot succeed in
grabbing a reference with write permissions after we cleared the PTE and
flushed the TLB can be found at [1].

Note that clearing PageAnonExclusive via page_try_share_anon_rmap()
might still need some explicit memory barriers to not have any possible
races with RCU GUP-fast.

[1] https://lkml.kernel.org/r/Yw5rwIUPm49XtqOB@nvidia.com

Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Jason Gunthorpe <jgg@...dia.com>
Cc: John Hubbard <jhubbard@...dia.com>
Cc: Andrea Arcangeli <aarcange@...hat.com>
Cc: Hugh Dickins <hughd@...gle.com>
Cc: Peter Xu <peterx@...hat.com>
Signed-off-by: David Hildenbrand <david@...hat.com>
---
 mm/ksm.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 42ab153335a2..e88291f63461 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1072,23 +1072,20 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		swapped = PageSwapCache(page);
 		flush_cache_page(vma, pvmw.address, page_to_pfn(page));
 		/*
-		 * Ok this is tricky, when get_user_pages_fast() run it doesn't
-		 * take any lock, therefore the check that we are going to make
-		 * with the pagecount against the mapcount is racy and
-		 * O_DIRECT can happen right after the check.
-		 * So we clear the pte and flush the tlb before the check
-		 * this assure us that no O_DIRECT can happen after the check
-		 * or in the middle of the check.
-		 *
-		 * No need to notify as we are downgrading page table to read
-		 * only not changing it to point to a new page.
+		 * Especially if we're downgrading protection, make sure to
+		 * flush the TLB now. No need to notify as we are not changing
+		 * the PTE to point at a different page.
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
 		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
+
 		/*
-		 * Check that no O_DIRECT or similar I/O is in progress on the
-		 * page
+		 * Make sure that there are no unexpected references (e.g.,
+		 * concurrent O_DIRECT). Note that while concurrent GUP-fast
+		 * could raise the refcount temporarily to grab a write
+		 * reference, it will observe the changed PTE and drop that
+		 * temporary reference again.
 		 */
 		if (page_mapcount(page) + 1 + swapped != page_count(page)) {
 			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
-- 
2.37.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ