lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3c1665df-9367-4d43-8aa1-6726fbb59640@redhat.com>
Date: Thu, 19 Dec 2024 13:55:12 +0100
From: David Hildenbrand <david@...hat.com>
To: Donet Tom <donettom@...ux.ibm.com>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Cc: Ritesh Harjani <ritesh.list@...il.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 "Aneesh Kumar K . V" <aneesh.kumar@...nel.org>, Zi Yan <ziy@...dia.com>,
 shuah Khan <shuah@...nel.org>, Dev Jain <dev.jain@....com>
Subject: Re: [PATCH] mm: migration :shared anonymous migration test is failing

On 19.12.24 13:47, Donet Tom wrote:
> The migration selftest is currently failing for shared anonymous
> mappings due to a race condition.
> 
> During migration, the source folio's PTE is unmapped by nuking the
> PTE, flushing the TLB,and then marking the page for migration
> (by creating the swap entries). The issue arises when, immediately
> after the PTE is nuked and the TLB is flushed, but before the page
> is marked for migration, another thread accesses the page. This
> triggers a page fault, and the page fault handler invokes
> do_pte_missing() instead of do_swap_page(), as the page is not yet
> marked for migration.
> 
> In the fault handling path, do_pte_missing() calls __do_fault()
> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
> This eventually calls folio_try_get(), incrementing the reference
> count of the folio undergoing migration. The thread then blocks
> on folio_lock(), as the migration path holds the lock. This
> results in the migration failing in __migrate_folio(), which expects
> the folio's reference count to be 2. However, the reference count is
> incremented by the fault handler, leading to the failure.
> 
> The issue arises because, after nuking the PTE and before marking the
> page for migration, the page is accessed. To address this, we have
> updated the logic to first nuke the PTE, then mark the page for
> migration, and only then flush the TLB. With this patch, If the page is
> accessed immediately after nuking the PTE, the TLB entry is still
> valid, so no fault occurs. After marking the page for migration,
> flushing the TLB ensures that the next page fault correctly triggers
> do_swap_page() and waits for the migration to complete.
> 

Does this reproduce with

commit 536ab838a5b37b6ae3f8d53552560b7c51daeb41
Author: Dev Jain <dev.jain@....com>
Date:   Fri Aug 30 10:46:09 2024 +0530

     selftests/mm: relax test to fail after 100 migration failures
     
     It was recently observed at [1] that during the folio unmapping stage of
     migration, when the PTEs are cleared, a racing thread faulting on that
     folio may increase the refcount of the folio, sleep on the folio lock (the
     migration path has the lock), and migration ultimately fails when
     asserting the actual refcount against the expected.  Thereby, the
     migration selftest fails on shared-anon mappings.  The above enforces the
     fact that migration is a best-effort service, therefore, it is wrong to
     fail the test for just a single failure; hence, fail the test after 100
     consecutive failures (where 100 is still a subjective choice).  Note that,
     this has no effect on the execution time of the test since that is
     controlled by a timeout.
     
     [1] https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/
     

part of 6.12?


As part of that discussion, we discussed alternatives, such as
retrying migration more often internally.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ