[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <91dbea57-d5b0-49b7-8920-3a2d252c46b0@redhat.com>
Date: Thu, 16 Oct 2025 22:18:17 +0200
From: David Hildenbrand <david@...hat.com>
To: Lisa Wang <wyihan@...gle.com>, linmiaohe@...wei.com,
nao.horiguchi@...il.com, akpm@...ux-foundation.org, pbonzini@...hat.com,
shuah@...nel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org, linux-kselftest@...r.kernel.org
Cc: rientjes@...gle.com, seanjc@...gle.com, ackerleytng@...gle.com,
vannapurve@...gle.com, michael.roth@....com, jiaqiyan@...gle.com,
tabba@...gle.com, dave.hansen@...ux.intel.com
Subject: Re: [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED
handling on truncation during failure
On 15.10.25 20:58, Lisa Wang wrote:
> The .error_remove_folio a_ops is used by different filesystems to handle
> folio truncation upon discovery of a memory failure in the memory
> associated with the given folio.
>
> Currently, MF_DELAYED is treated as an error, causing "Failed to punch
> page" to be written to the console. MF_DELAYED is then relayed to the
> caller of truncat_error_folio() as MF_FAILED. This further causes
> memory_failure() to return -EBUSY, which then always causes a SIGBUS.
>
> This is also implies that regardless of whether the thread's memory
> corruption kill policy is PR_MCE_KILL_EARLY or PR_MCE_KILL_LATE, a
> memory failure within guest_memfd memory will always cause a SIGBUS.
>
> Update truncate_error_folio() to return MF_DELAYED to the caller if the
> .error_remove_folio() callback reports MF_DELAYED.
>
> Generalize the comment: MF_DELAYED means memory failure was handled and
> some other part of memory failure will be handled later (e.g. a next
> access will result in the process being killed). Specifically for
> guest_memfd, a next access by the guest will result in an error returned
> to the userspace VMM.
>
> With delayed handling, the filemap continues to hold refcounts on the
> folio. Hence, take that into account when checking for extra refcounts
> in me_pagecache_clean(). This is aligned with the implementation in
> me_swapcache_dirty(), where, if a folio is still in the swap cache,
> extra_pins is set to true.
>
> Signed-off-by: Lisa Wang <wyihan@...gle.com>
> ---
> mm/memory-failure.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index df6ee59527dd..77f665c16a73 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -922,9 +922,11 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
> * by the m-f() handler immediately.
> *
> * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed.
> - * The page is unmapped, and is removed from the LRU or file mapping.
> - * An attempt to access the page again will trigger page fault and the
> - * PF handler will kill the process.
> + * It means memory_failure was handled (e.g. removed from file mapping or the
> + * LRU) and some other part of memory failure will be handled later (e.g. a
> + * next access will result in the process being killed). Specifically for
> + * guest_memfd, a next access by the guest will result in an error returned to
> + * the userspace VMM.
> *
> * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
> * The page has been completely isolated, that is, unmapped, taken out of
> @@ -999,6 +1001,9 @@ static int truncate_error_folio(struct folio *folio, unsigned long pfn,
> if (mapping->a_ops->error_remove_folio) {
> int err = mapping->a_ops->error_remove_folio(mapping, folio);
>
> + if (err == MF_DELAYED)
> + return err;
> +
> if (err != 0)
> pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
> else if (!filemap_release_folio(folio, GFP_NOIO))
> @@ -1108,18 +1113,19 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> goto out;
> }
>
> - /*
> - * The shmem page is kept in page cache instead of truncating
> - * so is expected to have an extra refcount after error-handling.
> - */
> - extra_pins = shmem_mapping(mapping);
> -
> /*
> * Truncation is a bit tricky. Enable it per file system for now.
> *
> * Open: to take i_rwsem or not for this? Right now we don't.
> */
> ret = truncate_error_folio(folio, page_to_pfn(p), mapping);
> +
> + /*
> + * The shmem page, or any page with MF_DELAYED error handling, is kept in
> + * page cache instead of truncating, so is expected to have an extra
> + * refcount after error-handling.
> + */
> + extra_pins = shmem_mapping(mapping) || ret == MF_DELAYED;
Well, to do it cleanly shouldn't we let shmem_error_remove_folio() also
return MF_DELAYED and remove this shmem special case?
Or is there a good reason shmem_mapping() wants to return 0 -- and maybe
guest_memfd would also wan to do that?
Just reading the code here the inconsistency is unclear.
--
Cheers
David / dhildenb
Powered by blists - more mailing lists