lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251104134831.147584-1-xieyuanbin1@huawei.com>
Date: Tue, 4 Nov 2025 21:48:31 +0800
From: Xie Yuanbin <xieyuanbin1@...wei.com>
To: <david@...nel.org>
CC: <Liam.Howlett@...cle.com>, <akpm@...ux-foundation.org>, <ardb@...nel.org>,
	<arnd@...db.de>, <dave@...ilevsky.ca>, <david@...hat.com>,
	<ebiggers@...nel.org>, <kees@...nel.org>, <liaohua4@...wei.com>,
	<lilinjie8@...wei.com>, <linmiaohe@...wei.com>,
	<linux-arm-kernel@...ts.infradead.org>, <linux-kernel@...r.kernel.org>,
	<linux-mm@...ck.org>, <linux@...linux.org.uk>, <lorenzo.stoakes@...cle.com>,
	<mhocko@...e.com>, <nao.horiguchi@...il.com>, <nathan@...nel.org>,
	<peterz@...radead.org>, <rmk+kernel@...linux.org.uk>, <rostedt@...dmis.org>,
	<rppt@...nel.org>, <surenb@...gle.com>, <vbabka@...e.cz>, <will@...nel.org>,
	<xieyuanbin1@...wei.com>
Subject: Re: [RFC PATCH 1/2] ARM: mm: support memory-failure

On Mon, 3 Nov 2025 17:53:18 +0100, David Hildenbrand wrote:
> Can you go into more details which exact functionality in
> memory-failure.c you would be interested in using?
>
> Only soft-offlining or also the other (possibly architecture-specific)
> handling?

Thanks! Let me describe it in as much detail as possible.

The functions in memory-failure.c are currently used in three ways:
1. When the application is using memory, and ECC detects a UE
(Uncorrectable Errors) bit flip from DRAM (the detection is performed by
hardware and is not perceived by software), it reports an interrupt to the
CPU. The relevant driver (a third-party module) has already
registered the interrupt callback function.
Based on the configuration, the driver calls `memory_failure_queue()`
inside callback function, or wakes up the related kthread to call
`soft_offline_page()`/`memory_failure()` to take the affected memory
offline or kill the process.

2. Hardware memory scanning function: The hardware periodically performs
read/write tests on some memory (This hardware is not a standard hardware,
so it is not included in the ARM spec. The scanning is not perceived by
software) If bit flip is detected during the test, an interrupt is
reported to the operating system to do the memory-failure,
just like what described earlier.

3. Software memory scanning function: The software (such as kthread/
work-queue) periodically use `soft_offline_page()` to isolate some free
memory and performs read/write tests. If bit flip is detected during the
test, it is considered a failure, and the memory will not be recovered.
Otherwise, use `unpoison_memory()` to recover the memory.

Unfortunately, the driver code for these three methods is difficult to
open-source. I have also been thinking about whether there is a
general-purpose function that could use memory-failure, but I haven't
come up with a good idea yet.

> Cheers
>
> David

Thanks!

Xie Yuanbin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ