[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251104134831.147584-1-xieyuanbin1@huawei.com>
Date: Tue, 4 Nov 2025 21:48:31 +0800
From: Xie Yuanbin <xieyuanbin1@...wei.com>
To: <david@...nel.org>
CC: <Liam.Howlett@...cle.com>, <akpm@...ux-foundation.org>, <ardb@...nel.org>,
<arnd@...db.de>, <dave@...ilevsky.ca>, <david@...hat.com>,
<ebiggers@...nel.org>, <kees@...nel.org>, <liaohua4@...wei.com>,
<lilinjie8@...wei.com>, <linmiaohe@...wei.com>,
<linux-arm-kernel@...ts.infradead.org>, <linux-kernel@...r.kernel.org>,
<linux-mm@...ck.org>, <linux@...linux.org.uk>, <lorenzo.stoakes@...cle.com>,
<mhocko@...e.com>, <nao.horiguchi@...il.com>, <nathan@...nel.org>,
<peterz@...radead.org>, <rmk+kernel@...linux.org.uk>, <rostedt@...dmis.org>,
<rppt@...nel.org>, <surenb@...gle.com>, <vbabka@...e.cz>, <will@...nel.org>,
<xieyuanbin1@...wei.com>
Subject: Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
On Mon, 3 Nov 2025 17:53:18 +0100, David Hildenbrand wrote:
> Can you go into more details which exact functionality in
> memory-failure.c you would be interested in using?
>
> Only soft-offlining or also the other (possibly architecture-specific)
> handling?
Thanks! Let me describe it in as much detail as possible.
The functions in memory-failure.c are currently used in three ways:
1. When the application is using memory, and ECC detects a UE
(Uncorrectable Errors) bit flip from DRAM (the detection is performed by
hardware and is not perceived by software), it reports an interrupt to the
CPU. The relevant driver (a third-party module) has already
registered the interrupt callback function.
Based on the configuration, the driver calls `memory_failure_queue()`
inside callback function, or wakes up the related kthread to call
`soft_offline_page()`/`memory_failure()` to take the affected memory
offline or kill the process.
2. Hardware memory scanning function: The hardware periodically performs
read/write tests on some memory (This hardware is not a standard hardware,
so it is not included in the ARM spec. The scanning is not perceived by
software) If bit flip is detected during the test, an interrupt is
reported to the operating system to do the memory-failure,
just like what described earlier.
3. Software memory scanning function: The software (such as kthread/
work-queue) periodically use `soft_offline_page()` to isolate some free
memory and performs read/write tests. If bit flip is detected during the
test, it is considered a failure, and the memory will not be recovered.
Otherwise, use `unpoison_memory()` to recover the memory.
Unfortunately, the driver code for these three methods is difficult to
open-source. I have also been thinking about whether there is a
general-purpose function that could use memory-failure, but I haven't
come up with a good idea yet.
> Cheers
>
> David
Thanks!
Xie Yuanbin
Powered by blists - more mailing lists