[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20251027172620.d764b8e0eab34abd427d7945@linux-foundation.org>
Date: Mon, 27 Oct 2025 17:26:20 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: <ankita@...dia.com>
Cc: <aniketa@...dia.com>, <vsethi@...dia.com>, <jgg@...dia.com>,
<mochs@...dia.com>, <skolothumtho@...dia.com>, <linmiaohe@...wei.com>,
<nao.horiguchi@...il.com>, <david@...hat.com>,
<lorenzo.stoakes@...cle.com>, <Liam.Howlett@...cle.com>, <vbabka@...e.cz>,
<rppt@...nel.org>, <surenb@...gle.com>, <mhocko@...e.com>,
<tony.luck@...el.com>, <bp@...en8.de>, <rafael@...nel.org>,
<guohanjun@...wei.com>, <mchehab@...nel.org>, <lenb@...nel.org>,
<kevin.tian@...el.com>, <alex@...zbot.org>, <cjia@...dia.com>,
<kwankhede@...dia.com>, <targupta@...dia.com>, <zhiw@...dia.com>,
<dnigam@...dia.com>, <kjaju@...dia.com>, <linux-kernel@...r.kernel.org>,
<linux-mm@...ck.org>, <linux-edac@...r.kernel.org>,
<Jonathan.Cameron@...wei.com>, <ira.weiny@...el.com>,
<Smita.KoralahalliChannabasappa@....com>, <u.kleine-koenig@...libre.com>,
<peterz@...radead.org>, <linux-acpi@...r.kernel.org>, <kvm@...r.kernel.org>
Subject: Re: [PATCH v4 2/3] mm: handle poisoning of pfn without struct pages
On Sun, 26 Oct 2025 14:19:18 +0000 <ankita@...dia.com> wrote:
> From: Ankit Agrawal <ankita@...dia.com>
>
> Poison (or ECC) errors can be very common on a large size cluster.
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region mapped
> using remap_pfn_range() for example, but not added to the kernel, MM
> will not have associated struct pages. Add a new mechanism to handle
> memory failure on such memory.
>
> Make kernel MM expose a function to allow modules managing the device
> memory to register the device memory SPA and the address space associated
> it. MM maintains this information as an interval tree. On poison, MM can
> search for the range that the poisoned PFN belong and use the address_space
> to determine the mapping VMA.
>
> In this implementation, kernel MM follows the following sequence that is
> largely similar to the memory_failure() handler for struct page backed
> memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed.
> 2. memory_failure_pfn() collects the processes mapped to the PFN.
> 3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
> faulty PFN using kill_procs().
>
> Note that there is one primary difference versus the handling of the
> poison on struct pages, which is to skip unmapping to the faulty PFN.
> This is done to handle the huge PFNMAP support added recently [1] that
> enables VM_PFNMAP vmas to map at PMD or PUD level. A poison to a PFN
> mapped in such as way would need breaking the PMD/PUD mapping into PTEs
> that will get mirrored into the S2. This can greatly increase the cost
> of table walks and have a major performance impact.
>
> ...
>
> @@ -2216,6 +2222,136 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
> kill_procs(&tokill, true, pfn, flags);
> }
>
> +int register_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> + if (!pfn_space)
> + return -EINVAL;
I suggest this be removed - make register_pfn_address_space(NULL)
illegal and let the punishment be an oops.
> + scoped_guard(mutex, &pfn_space_lock) {
> + if (interval_tree_iter_first(&pfn_space_itree,
> + pfn_space->node.start,
> + pfn_space->node.last))
> + return -EBUSY;
> +
> + interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(register_pfn_address_space);
> +
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> + guard(mutex)(&pfn_space_lock);
> +
> + if (pfn_space &&
> + interval_tree_iter_first(&pfn_space_itree,
> + pfn_space->node.start,
> + pfn_space->node.last))
> + interval_tree_remove(&pfn_space->node, &pfn_space_itree);
> +}
> +EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> +
> +static void add_to_kill_pfn(struct task_struct *tsk,
> + struct vm_area_struct *vma,
> + struct list_head *to_kill,
> + unsigned long pfn)
> +{
> + struct to_kill *tk;
> +
> + tk = kmalloc(sizeof(*tk), GFP_ATOMIC);
> + if (!tk)
> + return;
This is unfortunate. GFP_ATOMIC is unreliable and we silently behave
as if it worked OK.
> + /* Check for pgoff not backed by struct page */
> + tk->addr = vma_address(vma, pfn, 1);
> + tk->size_shift = PAGE_SHIFT;
> +
> + if (tk->addr == -EFAULT)
> + pr_info("Unable to find address %lx in %s\n",
> + pfn, tsk->comm);
> +
> + get_task_struct(tsk);
> + tk->tsk = tsk;
> + list_add_tail(&tk->nd, to_kill);
> +}
> +
> +/*
> + * Collect processes when the error hit a PFN not backed by struct page.
> + */
> +static void collect_procs_pfn(struct address_space *mapping,
> + unsigned long pfn, struct list_head *to_kill)
> +{
> + struct vm_area_struct *vma;
> + struct task_struct *tsk;
> +
> + i_mmap_lock_read(mapping);
> + rcu_read_lock();
> + for_each_process(tsk) {
> + struct task_struct *t = tsk;
> +
> + t = task_early_kill(tsk, true);
> + if (!t)
> + continue;
> + vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) {
> + if (vma->vm_mm == t->mm)
> + add_to_kill_pfn(t, vma, to_kill, pfn);
> + }
> + }
> + rcu_read_unlock();
We could play games here to make the GFP_ATOMIC allocation unnecessary,
but nasty. Allocate the to_kill* outside the rcu_read_lock, pass that
pointer into add_to_kill_pfn(). If add_to_kill_pfn()'s
kmalloc(GFP_ATOMIC) failed, add_to_kill_pfn() can then consume the
caller's to_kill*. Then the caller can drop the lock, allocate a new
to_kill* then restart the scan. And teach add_to_kill_pfn() to not
re-add tasks which are already on the list. Ugh.
At the very very least we should tell the user that the kernel goofed
and that one of their processes won't be getting killed.
> + i_mmap_unlock_read(mapping);
> +}
> +
> +/**
> + * memory_failure_pfn - Handle memory failure on a page not backed by
> + * struct page.
> + * @pfn: Page Number of the corrupted page
> + * @flags: fine tune action taken
> + *
> + * Return:
> + * 0 - success,
> + * -EBUSY - Page PFN does not belong to any address space mapping.
> + */
> +static int memory_failure_pfn(unsigned long pfn, int flags)
> +{
> + struct interval_tree_node *node;
> + LIST_HEAD(tokill);
> +
> + scoped_guard(mutex, &pfn_space_lock) {
> + bool mf_handled = false;
> +
> + /*
> + * Modules registers with MM the address space mapping to the device memory they
> + * manage. Iterate to identify exactly which address space has mapped to this
> + * failing PFN.
We're quite lenient about >80 columns nowadays, but overflowing 80 for
a block comment is rather needless.
> + for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> + node = interval_tree_iter_next(node, pfn, pfn)) {
> + struct pfn_address_space *pfn_space =
> + container_of(node, struct pfn_address_space, node);
>
> + collect_procs_pfn(pfn_space->mapping, pfn, &tokill);
> +
> + mf_handled = true;
> + }
> +
> + if (!mf_handled)
> + return action_result(pfn, MF_MSG_PFN_MAP, MF_IGNORED);
> + }
> +
> + /*
> + * Unlike System-RAM there is no possibility to swap in a different
> + * physical page at a given virtual address, so all userspace
> + * consumption of direct PFN memory necessitates SIGBUS (i.e.
> + * MF_MUST_KILL)
> + */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +
> + kill_procs(&tokill, true, pfn, flags);
> +
> + return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
> +}
> +
Powered by blists - more mailing lists