lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c65f7e9c-2e57-4fd5-973f-fc546c8c5827@redhat.com>
Date: Thu, 19 Dec 2024 13:58:06 +0100
From: David Hildenbrand <david@...hat.com>
To: Donet Tom <donettom@...ux.ibm.com>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Cc: Ritesh Harjani <ritesh.list@...il.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 "Aneesh Kumar K . V" <aneesh.kumar@...nel.org>, Zi Yan <ziy@...dia.com>,
 shuah Khan <shuah@...nel.org>, Dev Jain <dev.jain@....com>
Subject: Re: [PATCH] mm: migration :shared anonymous migration test is failing

On 19.12.24 13:47, Donet Tom wrote:
> The migration selftest is currently failing for shared anonymous
> mappings due to a race condition.
> 
> During migration, the source folio's PTE is unmapped by nuking the
> PTE, flushing the TLB,and then marking the page for migration
> (by creating the swap entries). The issue arises when, immediately
> after the PTE is nuked and the TLB is flushed, but before the page
> is marked for migration, another thread accesses the page. This
> triggers a page fault, and the page fault handler invokes
> do_pte_missing() instead of do_swap_page(), as the page is not yet
> marked for migration.
> 
> In the fault handling path, do_pte_missing() calls __do_fault()
> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
> This eventually calls folio_try_get(), incrementing the reference
> count of the folio undergoing migration. The thread then blocks
> on folio_lock(), as the migration path holds the lock. This
> results in the migration failing in __migrate_folio(), which expects
> the folio's reference count to be 2. However, the reference count is
> incremented by the fault handler, leading to the failure.
> 
> The issue arises because, after nuking the PTE and before marking the
> page for migration, the page is accessed. To address this, we have
> updated the logic to first nuke the PTE, then mark the page for
> migration, and only then flush the TLB. With this patch, If the page is
> accessed immediately after nuking the PTE, the TLB entry is still
> valid, so no fault occurs.

But what about if the PTE is not in the TLB yet, and you get an access 
from another CPU just after clearing the PTE (but not flushing the TLB)? 
The other CPU will still observe PTE=none, trigger a fault etc.

So I don't think what you propose rules out all cases.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ