[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <44bbf475-7a78-95fc-3a2c-83f4878840ad@redhat.com>
Date: Tue, 29 Mar 2022 18:09:27 +0200
From: David Hildenbrand <david@...hat.com>
To: linux-kernel@...r.kernel.org
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
David Rientjes <rientjes@...gle.com>,
Shakeel Butt <shakeelb@...gle.com>,
John Hubbard <jhubbard@...dia.com>,
Jason Gunthorpe <jgg@...dia.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.ibm.com>,
Yang Shi <shy828301@...il.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
Michal Hocko <mhocko@...nel.org>,
Nadav Amit <namit@...are.com>, Rik van Riel <riel@...riel.com>,
Roman Gushchin <guro@...com>,
Andrea Arcangeli <aarcange@...hat.com>,
Peter Xu <peterx@...hat.com>,
Donald Dutile <ddutile@...hat.com>,
Christoph Hellwig <hch@....de>,
Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
Liang Zhang <zhangliang5@...wei.com>,
Pedro Gomes <pedrodemargomes@...il.com>,
Oded Gabbay <oded.gabbay@...il.com>, linux-mm@...ck.org,
Khalid Aziz <khalid.aziz@...cle.com>
Subject: Re: [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of
anonymous pages
On 29.03.22 18:04, David Hildenbrand wrote:
> This series is the result of the discussion on the previous approach [2].
> More information on the general COW issues can be found there. It is based
> on latest linus/master (post v5.17, with relevant core-MM changes for
> v5.18-rc1).
>
> v3 is located at:
> https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v3
>
>
> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> on an anonymous page and COW logic fails to detect exclusivity of the page
> to then replacing the anonymous page by a copy in the page table: The
> GUP pin lost synchronicity with the pages mapped into the page tables.
>
> This issue, including other related COW issues, has been summarized in [3]
> under 3):
> "
> 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
>
> page_maybe_dma_pinned() is used to check if a page may be pinned for
> DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
> tolerable, false negatives are problematic: pages that are pinned for
> DMA must not be added to the swapcache. If it happens, the (now pinned)
> page could be faulted back from the swapcache into page tables
> read-only. Future write-access would detect the pinning and COW the
> page, losing synchronicity. For the interested reader, this is nicely
> documented in feb889fb40fa ("mm: don't put pinned pages into the swap
> cache").
>
> Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
> cases and can result in a violation of the documented semantics:
> giving false negatives because of the race.
>
> There are cases where we call it without properly taking a per-process
> sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
> one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
> handle, there is especially one rmap case (shrink_page_list) that's hard
> to fix: in the rmap world, we're not limited to a single process.
>
> The shrink_page_list() issue is really subtle. If we race with
> someone pinning a page, we can trigger the same issue as in the FOLL_GET
> case. See the detail section at the end of this mail on a discussion how
> bad this can bite us with VFIO or other FOLL_PIN user.
>
> It's harder to reproduce, but I managed to modify the O_DIRECT
> reproducer to use io_uring fixed buffers [15] instead, which ends up
> using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
> similarly trigger a loss of synchronicity and consequently a memory
> corruption.
>
> Again, the root issue is that a write-fault on a page that has
> additional references results in a COW and thereby a loss of
> synchronicity and consequently a memory corruption if two parties
> believe they are referencing the same page.
> "
>
> This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
> especially also taking care of concurrent pinning via GUP-fast,
> for example, also fully fixing an issue reported regarding NUMA
> balancing [4] recently. While doing that, it further reduces "unnecessary
> COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
> COW security for hugetlb for FOLL_PIN.
>
> In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
> anonymous page is exclusive. Exclusive anonymous pages that are mapped
> R/O can directly be mapped R/W by the COW logic in the write fault handler.
> Exclusive anonymous pages that want to be shared (fork(), KSM) first have
> to be marked shared -- which will fail if there are GUP pins on the page.
> GUP is only allowed to take a pin on anonymous pages that are exclusive.
> The PT lock is the primary mechanism to synchronize modifications of
> PG_anon_exclusive. We synchronize against GUP-fast either via the
> src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of
> the relevant page table entry.
>
> Special care has to be taken about swap, migration, and THPs (whereby a
> PMD-mapping can be converted to a PTE mapping and we have to track
> information for subpages). Besides these, we let the rmap code handle most
> magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
> logic as part of our previous approach [2], however, it's now 100% mapcount
> free and I further simplified it a bit.
>
> #1 is a fix
> #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
> #11 introduces PG_anon_exclusive
> #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
> reliable
> #13 is a preparation for reliable R/O pins
> #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
> make R/O pins of anonymous pages reliable
> #16 adds sanity check when (un)pinning anonymous pages
>
>
> [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
> [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
> [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
> [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
>
>
> v2 -> v3:
> * Note 1: Left the terminology "unshare" in place for now instead of
> switching to "make anon exclusive".
> * Note 2: We might have to tackle undoing effects of arch_unmap_one() on
> sparc, to free some tag memory immediately instead of when tearing down
> the vma/mm; looks like this needs more care either way, so I'll ignore it
> for now.
> * Rebased on top of core MM changes for v5.18-rc1 (most conflicts were due
> to folio and ZONE_DEVICE migration rework). No severe changes were
> necessary -- mostly folio conversion and code movement.
> * Retested on aarch64, ppc64, s390x and x86_64
> * "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
> -> Missed to convert one instance in restore_exclusive_pte()
> * "mm/rmap: pass rmap flags to hugepage_add_anon_rmap()"
> -> Use "!!(flags & RMAP_EXCLUSIVE)" to avoid sparse warnings
> * "mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()"
> -> Added, as we can trigger that now more frequently
> * "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
> -> Use subpage in VM_BUG_ON_PAGE() in try_to_migrate_one()
> -> Move comment from folio_migrate_mapping() to folio_migrate_flags()
> regarding PG_anon_exclusive/PG_mappedtodisk
> -> s/int rmap_flags/rmap_t rmap_flags/ in remove_migration_pmd()
> * "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
> exclusive when (un)pinning"
> -> Use IS_ENABLED(CONFIG_DEBUG_VM) instead of ifdef
>
> v1 -> v2:
> * Tested on aarch64, ppc64, s390x and x86_64
> * "mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon()
> pages"
> -> Use PG_mappedtodisk instead of PG_slab (thanks Willy!), this simlifies
> the patch and necessary handling a lot. Add safety BUG_ON's
> -> Move most documentation to the patch description, to be placed in a
> proper documentation doc in the future, once everything's in place
> * ""mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
> -> Skip check+clearing in page_try_dup_anon_rmap(), otherwise we might
> trigger a wrong VM_BUG_ON() for KSM pages in ClearPageAnonExclusive()
> -> In __split_huge_pmd_locked(), call page_try_share_anon_rmap() only
> for "anon_exclusive", otherwise we might trigger a wrong VM_BUG_ON()
> -> In __split_huge_page_tail(), drop any remaining PG_anon_exclusive on
> tail pages, and document why that is fine
>
> RFC -> v1:
> * Rephrased/extended some patch descriptions+comments
> * Tested on aarch64, ppc64 and x86_64
> * "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
> -> Added
> * "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()"
> -> Added
> * "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
> -> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily
> dropping the PT lock.
> -> Use "reuse" label in __do_huge_pmd_anonymous_page()
> -> Slightly simplify logic in hugetlb_cow()
> -> In remove_migration_pte(), remove unrelated changes around
> page_remove_rmap()
> * "mm: support GUP-triggered unsharing of anonymous pages"
> -> In handle_pte_fault(), trigger pte_mkdirty() only with
> FAULT_FLAG_WRITE
> -> In __handle_mm_fault(), extend comment regarding anonymous PUDs
> * "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
> anonymous page"
> -> Added unsharing logic to gup_hugepte() and gup_huge_pud()
> -> Changed return logic in __follow_hugetlb_must_fault(), making sure
> that "unshare" is always set
> * "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
> exclusive when (un)pinning"
> -> Slightly simplified sanity_check_pinned_pages()
>
> David Hildenbrand (16):
> mm/rmap: fix missing swap_free() in try_to_unmap() after
> arch_unmap_one() failed
> mm/hugetlb: take src_mm->write_protect_seq in
> copy_hugetlb_page_range()
> mm/memory: slightly simplify copy_present_pte()
> mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and
> page_try_dup_anon_rmap()
> mm/rmap: convert RMAP flags to a proper distinct rmap_t type
> mm/rmap: remove do_page_add_anon_rmap()
> mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
> mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
> mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon()
> page exclusively
> mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()
> mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for
> PageAnon() pages
> mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
> mm/gup: disallow follow_page(FOLL_PIN)
> mm: support GUP-triggered unsharing of anonymous pages
> mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
> anonymous page
> mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
> exclusive when (un)pinning
>
> include/linux/mm.h | 46 +++++++-
> include/linux/mm_types.h | 8 ++
> include/linux/page-flags.h | 39 ++++++-
> include/linux/rmap.h | 118 +++++++++++++++++--
> include/linux/swap.h | 15 ++-
> include/linux/swapops.h | 25 ++++
> kernel/events/uprobes.c | 2 +-
> mm/gup.c | 106 ++++++++++++++++-
> mm/huge_memory.c | 127 +++++++++++++++-----
> mm/hugetlb.c | 135 ++++++++++++++-------
> mm/khugepaged.c | 2 +-
> mm/ksm.c | 15 ++-
> mm/memory.c | 234 +++++++++++++++++++++++--------------
> mm/memremap.c | 9 ++
> mm/migrate.c | 18 ++-
> mm/migrate_device.c | 23 +++-
> mm/mprotect.c | 8 +-
> mm/rmap.c | 97 +++++++++++----
> mm/swapfile.c | 8 +-
> mm/userfaultfd.c | 2 +-
> tools/vm/page-types.c | 8 +-
> 21 files changed, 825 insertions(+), 220 deletions(-)
>
>
> base-commit: 1930a6e739c4b4a654a69164dbe39e554d228915
The effective change on v5.17 *before rebasing* compared to v2 is:
diff --git a/mm/gup.c b/mm/gup.c
index 72e39b77da10..e1de0104cb19 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -48,7 +48,9 @@ static void hpage_pincount_sub(struct page *page, int refs)
static inline void sanity_check_pinned_pages(struct page **pages,
unsigned long npages)
{
-#ifdef CONFIG_DEBUG_VM
+ if (!IS_ENABLED(CONFIG_DEBUG_VM))
+ return;
+
/*
* We only pin anonymous pages if they are exclusive. Once pinned, we
* can no longer turn them possibly shared and PageAnonExclusive() will
@@ -74,7 +76,6 @@ static inline void sanity_check_pinned_pages(struct page **pages,
VM_BUG_ON_PAGE(!PageAnonExclusive(head) &&
!PageAnonExclusive(page), page);
}
-#endif /* CONFIG_DEBUG_VM */
}
/* Equivalent to calling put_page() @refs times. */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0cc34addd911..c53764b0640f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2316,8 +2316,6 @@ static void unmap_page(struct page *page)
try_to_migrate(page, ttu_flags);
else
try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);
-
- VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
}
static void remap_page(struct page *page, unsigned int nr)
@@ -3187,7 +3185,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
if (PageAnon(new)) {
- int rmap_flags = RMAP_COMPOUND;
+ rmap_t rmap_flags = RMAP_COMPOUND;
if (!is_readable_migration_entry(entry))
rmap_flags |= RMAP_EXCLUSIVE;
diff --git a/mm/memory.c b/mm/memory.c
index b0b9d07a2850..1c8b52771799 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -727,7 +727,7 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
* created when the swap entry was made.
*/
if (PageAnon(page))
- page_add_anon_rmap(page, vma, address, false);
+ page_add_anon_rmap(page, vma, address, RMAP_NONE);
else
/*
* Currently device exclusive access only supports anonymous
diff --git a/mm/migrate.c b/mm/migrate.c
index 7f440d2103ce..013eb4f52fed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -384,10 +384,6 @@ int folio_migrate_mapping(struct address_space *mapping,
/* No turning back from here */
newfolio->index = folio->index;
newfolio->mapping = folio->mapping;
- /*
- * Note: PG_anon_exclusive is always migrated via migration
- * entries.
- */
if (folio_test_swapbacked(folio))
__folio_set_swapbacked(newfolio);
@@ -540,6 +536,12 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
folio_set_workingset(newfolio);
if (folio_test_checked(folio))
folio_set_checked(newfolio);
+ /*
+ * PG_anon_exclusive (-> PG_mappedtodisk) is always migrated via
+ * migration entries. We can still have PG_anon_exclusive set on an
+ * effectively unmapped and unreferenced first sub-pages of an
+ * anonymous THP: we can simply copy it here via PG_mappedtodisk.
+ */
if (folio_test_mappedtodisk(folio))
folio_set_mappedtodisk(newfolio);
diff --git a/mm/rmap.c b/mm/rmap.c
index 9d2a7e11e8cc..c18f6d7891d0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1939,7 +1939,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
break;
}
VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) &&
- !anon_exclusive, page);
+ !anon_exclusive, subpage);
if (anon_exclusive &&
page_try_share_anon_rmap(subpage)) {
set_pte_at(mm, address, pvmw.pte, pteval);
@@ -2476,7 +2476,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
if (first)
__page_set_anon_rmap(page, vma, address,
- flags & RMAP_EXCLUSIVE);
+ !!(flags & RMAP_EXCLUSIVE));
}
void hugepage_add_new_anon_rmap(struct page *page,
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists