lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3e6500dc-723f-4682-9e37-b28bc78a2bdb@redhat.com>
Date: Mon, 13 Oct 2025 11:25:23 +0200
From: David Hildenbrand <david@...hat.com>
To: Lance Yang <lance.yang@...ux.dev>
Cc: Longlong Xia <xialonglong2025@....com>, nao.horiguchi@...il.com,
 akpm@...ux-foundation.org, wangkefeng.wang@...wei.com, xu.xin16@....com.cn,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Longlong Xia <xialonglong@...inos.cn>, lorenzo.stoakes@...cle.com,
 Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com,
 mhocko@...e.com, Miaohe Lin <linmiaohe@...wei.com>, qiuxu.zhuo@...el.com
Subject: Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory
 failures

On 13.10.25 11:15, Lance Yang wrote:
> @David
> 
> Cc: MM CORE folks
> 
> On 2025/10/13 12:42, Lance Yang wrote:
> [...]
> 
> Cool. Hardware error injection with EINJ was the way to go!
> 
> I just ran some tests on the shared zero page (both regular and huge), and
> found a tricky behavior:
> 
> 1) When a hardware error is injected into the zeropage, the process that
> attempts to read from a mapping backed by it is correctly killed with a
> SIGBUS.
> 
> 2) However, even after the error is detected, the kernel continues to
> install
> the known-poisoned zeropage for new anonymous mappings ...
> 
> 
> For the shared zeropage:
> ```
> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in
> user-access at 29b8cf5000
> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to
> read_zeropage:13767 due to hardware memory corruption
> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action
> for already poisoned page: Failed
> ```
> And for the shared huge zeropage:
> ```
> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in
> user-access at 1e1e00000
> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to
> read_huge_zerop:13891 due to hardware memory corruption
> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for
> already poisoned page: Failed
> ```
> 
> Since we've identified an uncorrectable hardware error on such a critical,
> singleton page, should we be doing something more?

I mean, regarding the shared zeropage, we could try walking all page 
tables of all processes and replace it be a fresh shared zeropage.

But then, the page might also be used for other things (I/O etc), the 
shared zeropage is allocated by the architecture, we'd have to make 
is_zero_pfn() succeed on the old+new page etc ...

So a lot of work for little benefit I guess? The question is how often 
we would see that in practice. I'd assume we'd see it happen on random 
kernel memory more frequently where we can really just bring down the 
whole machine.

-- 
Cheers

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ