lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 20 Dec 2023 08:44:29 +0000
From: "Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>
To: Naoya Horiguchi <naoya.horiguchi@...ux.dev>
CC: "naoya.horiguchi@....com" <naoya.horiguchi@....com>,
	"linmiaohe@...wei.com" <linmiaohe@...wei.com>, "akpm@...ux-foundation.org"
	<akpm@...ux-foundation.org>, "Luck, Tony" <tony.luck@...el.com>, "Huang,
 Ying" <ying.huang@...el.com>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Yin, Fengwei"
	<fengwei.yin@...el.com>
Subject: RE: [PATCH 1/1] mm: memory-failure: Re-split hw-poisoned huge page on
 -EAGAIN

Hi Naoya Horiguchi,

Thanks for the review. 
See the comments below.

> From: Naoya Horiguchi <naoya.horiguchi@...ux.dev>
> Sent: Tuesday, December 19, 2023 10:17 AM
> ...
> > The kernel log (before):
> >   [ 1116.862895] Memory failure: 0x4097fa7: recovery action for
> > unsplit thp: Ignored
> >
> > The kernel log (after):
> >   [  793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp:
> Delayed
> >   [  793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.
> 
> I'm unclear about the user-visible benefit of ensuring that the error thp is
> split.
> So could you explain about it?

During our testing, we observed that the hardware-poisoned huge page had been 
mapped for the victim application's text and was present in the file cache.
Unfortunately, when attempting to restart the application without splitting the thp,
the application restart failed. This was possible because its text was remapped to the 
hardware-poisoned huge page from the file cache, leading to its swift termination 
due to another MCE.

So, after re-splitting the unsplit thp successfully (drop the text mapping), 
the application restart is successful.  I'll also add this description in the commit message in the v2.

> I think that the raw error page is not unmapped (with hwpoisoned entry)
> after delayed re-splitting, so recovery action seems not complete even with
> this patch.
> So this patch seems to just convert a hwpoisoned unrecovered thp into a
> hwpoisoned unrecovered raw page.

You're correct. Thanks for catching this.
Instead of creating a new work just to split the thp, I'll leverage the existing memory_failure_queue()
 to re-split the thp in the v2, which should make the recovery action more complete.
 
-Qiuxu


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ