lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CY8PR11MB7134D3ADA0BCDAB938E6E2A58961A@CY8PR11MB7134.namprd11.prod.outlook.com>
Date: Tue, 2 Jan 2024 02:41:01 +0000
From: "Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>
To: Andrew Morton <akpm@...ux-foundation.org>
CC: "naoya.horiguchi@....com" <naoya.horiguchi@....com>,
	"linmiaohe@...wei.com" <linmiaohe@...wei.com>, "Luck, Tony"
	<tony.luck@...el.com>, "Huang, Ying" <ying.huang@...el.com>, "Yin, Fengwei"
	<fengwei.yin@...el.com>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page
 on -EAGAIN

> From: Andrew Morton <akpm@...ux-foundation.org>

Hi Andrew, 

Happy New Year. 
Thanks for reviewing the patch.
Please see the comments inline.

> ...
> 
> So we're hoping that when the worker runs to split the page, the process and
> its threads have exited.  What guarantees this timing?

Case 1: If the threads of the victim process do not access the new mapping to 
the h/w-poisoned huge page(no refcnt increase), the h/w-poisoned huge page
should be successfully split in the process context. No need for the worker to
split this h/w-poisoned page.

Case 2: If the threads of the victim process access the new mapping to the
hardware-poisoned huge page (refcnt increase), causing the failure of splitting
the hardware-poisoned huge page, a new MCE will be re-triggered immediately.
Consequently, the process will be promptly terminated upon re-entering the
code below:

MCE occurs:
  memory_failure()
  {
    { 
      ...
      if (TestSetPageHWPoison(p)) {
      ...
      kill_accessing_process(current, pfn, flags); 
      ...
	}
      ...
  }

The worker splits the h/w-poisoned background with retry delays of 1ms, 2ms,
4ms, 8ms, ..., 512ms. Before reaching the max 512ms timeout, the process and
its threads should already exit. So, the retry delays can guarantee the timing.

> And we're hoping that the worker has split the page before userspace
> attempts to restart the process.  What guarantees this timing?

Our experiments showed that an immediate restart of the victim process was
consistently successful. This success could be attributed to the duration between
the process being killed and its subsequent restart being sufficiently long,
allowing the worker enough time to split the hardware-poisoned page.
However, in theory, this timing indeed isn't guaranteed.

> All this reliance upon fortunate timing sounds rather unreliable, doesn't it?

The timing of the victim process exit can be guaranteed.
The timing of the new restart of the process cannot be guaranteed in theory.

The patch is not perfect, but it still provides the victim process with the
opportunity to be restarted successfully.

Thanks!
-Qiuxu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ