[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20231222062706.5221-2-qiuxu.zhuo@intel.com>
Date: Fri, 22 Dec 2023 14:27:06 +0800
From: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
To: naoya.horiguchi@....com
Cc: linmiaohe@...wei.com,
akpm@...ux-foundation.org,
tony.luck@...el.com,
ying.huang@...el.com,
fengwei.yin@...el.com,
qiuxu.zhuo@...el.com,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN
During the process of splitting a hw-poisoned huge page, it is possible
for the reference count of the huge page to be increased by the threads
within the affected process, leading to a failure in splitting the
hw-poisoned huge page with an error code of -EAGAIN.
This issue can be reproduced when doing memory error injection to a
multiple-thread process, and the error occurs within a huge page.
The call path with the returned -EAGAIN during the testing is shown below:
memory_failure()
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list() {
...
Step A: can_split_folio() - Checked that the thp can be split.
Step B: unmap_folio()
Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
...
}
The testing logs indicated that some huge pages were split successfully
via the call path above (Step C was successful for these huge pages).
However, some huge pages failed to split due to a failure at Step C, and
it was observed that the reference count of the huge page increased between
Step A and Step C.
Testing has shown that after receiving -EAGAIN, simply re-splitting the
hw-poisoned huge page within memory_failure() always results in the same
-EAGAIN. This is possible because memory_failure() is executed in the
currently affected process. Before this process exits memory_failure() and
is terminated, its threads could increase the reference count of the
hw-poisoned page.
Furthermore, if the h/w-poisoned huge page had been mapped for the victim
application's text and was present in the file cache and it was failed to
be split. When attempting to restart the process without splitting the
h/w-poisoned huge page, the application restart failed. This was possible
because its text was remapped to the hardware-poisoned huge page from the
file cache, leading to its swift termination due to another MCE.
To address this issue, leverage memory_failure_queue_delayed() to re-split
the h/w-poisoned huge page by a kernel worker. By the time this worker
begins re-splitting the h/w-poisoned huge page, the affected process has
already been terminated, preventing its threads from increasing the
reference count. Experimental results have consistently shown that this
worker successfully re-splits these h/w-poisoned huge pages on its first
attempt. After that the victim application restarted successfully.
The kernel log (before):
[ 1116.862895] Memory failure: 0x4097fa7: recovery action for unsplit thp: Ignored
After that, the victim application may fail to restart.
The kernel log (after):
[ 405.195308] Memory failure: 0x410756b: recovery action for unsplit thp: Delayed
[ 405.200267] Memory failure: 0x410756b: recovery action for clean LRU page: Recovered
After that, the victim application restarted successfully.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
---
v1->v2:
1. Include user-visible benefits in both the commit message and code comments.
[ Address Naoya Horiguchi's comment on the unclear user-visible benefits in v1. ]
2. Leverage memory_failure_queue_delayed() to re-split the h/w-poisoned huge page,
which makes the recovery action more complete.
[ Address Naoya Horiguchi's comment regarding the incomplete recovery in v1.
Address Miaohe's concern about the possibility of being freed when the worker is scheduled in v1. ]
include/linux/mm.h | 1 +
mm/memory-failure.c | 54 ++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 52 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..87bd67d0a046 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3904,6 +3904,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ /* Bits {31:28} are reserved for encoding the THP split retry count. */
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8f2c11863bae..2334d0417258 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2147,6 +2147,28 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
return rc;
}
+static void memory_failure_queue_delayed(unsigned long pfn, int flags, unsigned long delay);
+
+#define MF_RETRY_CNT_MASK GENMASK(31, 28)
+#define MF_RETRY_CNT_SHIFT 28
+#define MF_RETRY_CNT_GET(f) (((f) & MF_RETRY_CNT_MASK) >> MF_RETRY_CNT_SHIFT)
+#define MF_RETRY_CNT_SET(f, c) (((f) & ~MF_RETRY_CNT_MASK) | ((c) << MF_RETRY_CNT_SHIFT))
+
+#define MAX_RETRY_CNT 10
+#define INIT_DELAYED_MS 1
+
+static bool split_thp_delayed(unsigned long pfn, int flags)
+{
+ int cnt = MF_RETRY_CNT_GET(flags);
+
+ if (cnt >= MAX_RETRY_CNT)
+ return false;
+
+ memory_failure_queue_delayed(pfn, MF_RETRY_CNT_SET(flags, cnt + 1),
+ msecs_to_jiffies(INIT_DELAYED_MS << cnt));
+ return true;
+}
+
/**
* memory_failure - Handle memory failure of a page.
* @pfn: Page Number of the corrupted page
@@ -2211,7 +2233,7 @@ int memory_failure(unsigned long pfn, int flags)
if (hugetlb)
goto unlock_mutex;
- if (TestSetPageHWPoison(p)) {
+ if (TestSetPageHWPoison(p) && !MF_RETRY_CNT_GET(flags)) {
pr_err("%#lx: already hardware poisoned\n", pfn);
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
@@ -2275,8 +2297,34 @@ int memory_failure(unsigned long pfn, int flags)
* page is a valid handlable page.
*/
SetPageHasHWPoisoned(hpage);
- if (try_to_split_thp_page(p) < 0) {
- res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+ res = try_to_split_thp_page(p);
+ if (res < 0) {
+ /*
+ * Re-attempting try_to_split_thp_page() here could consistently
+ * yield -EAGAIN, as the threads of the process may increment the
+ * reference count of the huge page before the process exits
+ * memory_failure() and terminates.
+ *
+ * Furthermore, if the h/w-poisoned huge page had been mapped for
+ * the victim application's text and was present in the file cache
+ * and it was failed to be split. When attempting to restart the
+ * application without splitting the h/w-poisoned huge page, the
+ * application restart failed. This was possible because its text
+ * was remapped to the hardware-poisoned huge page from the file
+ * cache, leading to its swift termination due to another MCE.
+ *
+ * Leverage memory_failure_queue_delayed() to re-split the h/w-poisoned
+ * huge page by a kernel worker. By the time this worker initiates
+ * the re-splitting and recovery process, the affected process has
+ * already been terminated, preventing its threads from incrementing
+ * the reference count. After re-splitting the h/w-poisoned huge page
+ * successfully (drop the text mapping), the application restart is
+ * successful.
+ */
+ if (res == -EAGAIN && split_thp_delayed(pfn, flags))
+ res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_DELAYED);
+ else
+ res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
goto unlock_mutex;
}
VM_BUG_ON_PAGE(!page_count(p), p);
--
2.17.1
Powered by blists - more mailing lists