linux-kernel - Re: [PATCH v5 00/16] HWPOISON: soft offline rework

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200805204354.GA16406@hori.linux.bs1.fc.nec.co.jp>
Date:   Wed, 5 Aug 2020 20:43:55 +0000
From:   HORIGUCHI NAOYA(堀口　直也) 
        <naoya.horiguchi@....com>
To:     Qian Cai <cai@....pw>
CC:     "nao.horiguchi@...il.com" <nao.horiguchi@...il.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "mhocko@...nel.org" <mhocko@...nel.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "mike.kravetz@...cle.com" <mike.kravetz@...cle.com>,
        "osalvador@...e.de" <osalvador@...e.de>,
        "tony.luck@...el.com" <tony.luck@...el.com>,
        "david@...hat.com" <david@...hat.com>,
        "aneesh.kumar@...ux.vnet.ibm.com" <aneesh.kumar@...ux.vnet.ibm.com>,
        "zeil@...dex-team.ru" <zeil@...dex-team.ru>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v5 00/16] HWPOISON: soft offline rework

On Mon, Aug 03, 2020 at 11:19:09AM -0400, Qian Cai wrote:
> On Mon, Aug 03, 2020 at 01:36:58PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Hello,
> > 
> > On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote:
> > > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@...il.com wrote:
> > > > This patchset is the latest version of soft offline rework patchset
> > > > targetted for v5.9.
> > > > 
> > > > Main focus of this series is to stabilize soft offline.  Historically soft
> > > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > > used to a little too aggressively, which (directly or indirectly) invades
> > > > other mm code which cares little about hwpoison.  This results in unexpected
> > > > behavior or kernel panic, which is very far from soft offline's "do not
> > > > disturb userspace or other kernel component" policy.
> > > > 
> > > > Main point of this change set is to contain target page "via buddy allocator",
> > > > where we first free the target page as we do for normal pages, and remove
> > > > from buddy only when we confirm that it reaches free list. There is surely
> > > > race window of page allocation, but that's fine because someone really want
> > > > that page and the page is still working, so soft offline can happily give up.
> > > > 
> > > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > > seems still work in progress, so I decide to separate it for changes into
> > > > v5.9.  Thank you for your contribution, Oscar.
> > > > 
> > > > The issue reported by Qian Cai is fixed by patch 16/16.
> > > 
> > > I am still getting EIO everywhere on next-20200803 (which includes this v5).
> > > 
> > > # ./random 1
> > > - start: migrate_huge_offline
> > > - use NUMA nodes 0,8.
> > > - mmap and free 8388608 bytes hugepages on node 0
> > > - mmap and free 8388608 bytes hugepages on node 8
> > > madvise: Input/output error
> > > 
> > > From the serial console,
> > > 
> > > [  637.164222][ T8357] soft offline: 0x118ee0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > > [  637.164890][ T8357] Soft offlining pfn 0x20001380 at process virtual address 0x7fff9f000000
> > > [  637.165422][ T8357] Soft offlining pfn 0x3ba00 at process virtual address 0x7fff9f200000
> > > [  637.166409][ T8357] Soft offlining pfn 0x201914a0 at process virtual address 0x7fff9f000000
> > > [  637.166833][ T8357] Soft offlining pfn 0x12b9a0 at process virtual address 0x7fff9f200000
> > > [  637.168044][ T8357] Soft offlining pfn 0x1abb60 at process virtual address 0x7fff9f000000
> > > [  637.168557][ T8357] Soft offlining pfn 0x20014820 at process virtual address 0x7fff9f200000
> > > [  637.169493][ T8357] Soft offlining pfn 0x119720 at process virtual address 0x7fff9f000000
> > > [  637.169603][ T8357] soft offline: 0x119720: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > > [  637.169756][ T8357] Soft offlining pfn 0x118ee0 at process virtual address 0x7fff9f200000
> > > [  637.170653][ T8357] Soft offlining pfn 0x200e81e0 at process virtual address 0x7fff9f000000
> > > [  637.171067][ T8357] Soft offlining pfn 0x201c5f60 at process virtual address 0x7fff9f200000
> > > [  637.172101][ T8357] Soft offlining pfn 0x201c8f00 at process virtual address 0x7fff9f000000
> > > [  637.172241][ T8357] __get_any_page: 0x201c8f00: unknown zero refcount page type 87fff8000000000
> > 
> > I might misjudge to skip the following patch, sorry about that.
> > Could you try with it?
> 
> Still getting EIO after applied this patch.
> 
> [ 1215.499030][T88982] soft offline: 0x201bdc20: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.499775][T88982] Soft offlining pfn 0x201bdc20 at process virtual address 0x7fff91a00000
> [ 1215.500189][T88982] Soft offlining pfn 0x201c19c0 at process virtual address 0x7fff91c00000
> [ 1215.500297][T88982] soft offline: 0x201c19c0: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.500982][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91a00000
> [ 1215.501086][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.501237][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91c00000
> [ 1215.501355][T88982] soft offline: 0x1f4520: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.502196][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91a00000
> [ 1215.502573][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91c00000
> [ 1215.502687][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.503245][T88982] Soft offlining pfn 0x201c3cc0 at process virtual address 0x7fff91a00000
> [ 1215.503594][T88982] Soft offlining pfn 0x201c3ce0 at process virtual address 0x7fff91c00000
> [ 1215.503755][T88982] __get_any_page: 0x201c3ce0: unknown zero refcount page type 87fff8000000000

So I lean to cancel patch 3/16 where madvise_inject_error() releases page
refcount before calling soft_offline_page(). That should solve the issue
because refcount of the target page never becomes zero.
Patch 4/16, 8/16 and 9/16 depend on it, so they should be removed too.
I'll resubmit the series later, but I prepare the delta below (based on
next-20200805), so it would be appreciated if you can run random.c with it
to confrim the fix.

Thanks,
Naoya Horiguchi
---
>From 533090c0869aeca88d8ff14d27c3cef8fc060ccd Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@....com>
Date: Thu, 6 Aug 2020 01:29:08 +0900
Subject: [PATCH] (for testing) revert change around removing
 MF_COUNT_INCREASED

revert the following patches
- mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
- mm,madvise: Refactor madvise_inject_error
- mm,hwpoison: remove MF_COUNT_INCREASED
- mm,hwpoison: remove flag argument from soft offline functions

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@....com>
---
 drivers/base/memory.c |  2 +-
 include/linux/mm.h    |  9 +++++----
 mm/madvise.c          | 36 +++++++++++++++++++-----------------
 mm/memory-failure.c   | 31 ++++++++++++++++++++-----------
 4 files changed, 45 insertions(+), 33 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 3e6d27c9dff6..4db3c660de83 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -463,7 +463,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = soft_offline_page(pfn);
+	ret = soft_offline_page(pfn, 0);
 	return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4f12b2465e80..442921a004a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2976,9 +2976,10 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long nr_pages);
 
 enum mf_flags {
-	MF_ACTION_REQUIRED = 1 << 0,
-	MF_MUST_KILL = 1 << 1,
-	MF_SOFT_OFFLINE = 1 << 2,
+	MF_COUNT_INCREASED = 1 << 0,
+	MF_ACTION_REQUIRED = 1 << 1,
+	MF_MUST_KILL = 1 << 2,
+	MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
@@ -2988,7 +2989,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(unsigned long pfn);
+extern int soft_offline_page(unsigned long pfn, int flags);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index 843f6fad3b89..5fa5f66468b3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -886,15 +886,16 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_inject_error(int behavior,
 		unsigned long start, unsigned long end)
 {
+	struct page *page;
 	struct zone *zone;
 	unsigned long size;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+
 	for (; start < end; start += size) {
 		unsigned long pfn;
-		struct page *page;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
@@ -909,26 +910,27 @@ static int madvise_inject_error(int behavior,
 		 */
 		size = page_size(compound_head(page));
 
-		/*
-		 * The get_user_pages_fast() is just to get the pfn of the
-		 * given address, and the refcount has nothing to do with
-		 * what we try to test, so it should be released immediately.
-		 * This is racy but it's intended because the real hardware
-		 * errors could happen at any moment and memory error handlers
-		 * must properly handle the race.
-		 */
-		put_page(page);
-
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-			ret = soft_offline_page(pfn);
-		} else {
-			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-			ret = memory_failure(pfn, 0);
+					pfn, start);
+
+			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+			if (ret)
+				return ret;
+			continue;
 		}
 
+		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+				pfn, start);
+
+		/*
+		 * Drop the page reference taken by get_user_pages_fast(). In
+		 * the absence of MF_COUNT_INCREASED the memory_failure()
+		 * routine is responsible for pinning the page to prevent it
+		 * from being released back to the page allocator.
+		 */
+		put_page(page);
+		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
 	}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a229d4694954..fd50a6f9a60d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1167,7 +1167,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!get_hwpoison_page(p)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1363,7 +1363,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!get_hwpoison_page(p)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
 			return 0;
@@ -1392,7 +1392,10 @@ int memory_failure(unsigned long pfn, int flags)
 	shake_page(p, 0);
 	/* shake_page could have turned it free. */
 	if (!PageLRU(p) && is_free_buddy_page(p)) {
-		action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+		if (flags & MF_COUNT_INCREASED)
+			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+		else
+			action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
 		return 0;
 	}
 
@@ -1540,7 +1543,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (!gotten)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
-			soft_offline_page(entry.pfn);
+			soft_offline_page(entry.pfn, entry.flags);
 		else
 			memory_failure(entry.pfn, entry.flags);
 	}
@@ -1679,10 +1682,13 @@ EXPORT_SYMBOL(unpoison_memory);
  * that is not free, and 1 for any other page type.
  * For 1 the page is returned with increased page count, otherwise not.
  */
-static int __get_any_page(struct page *p, unsigned long pfn)
+static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 {
 	int ret;
 
+	if (flags & MF_COUNT_INCREASED)
+		return 1;
+
 	/*
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
@@ -1709,12 +1715,12 @@ static int __get_any_page(struct page *p, unsigned long pfn)
 	return ret;
 }
 
-static int get_any_page(struct page *page, unsigned long pfn)
+static int get_any_page(struct page *page, unsigned long pfn, int flags)
 {
-	int ret = __get_any_page(page, pfn);
+	int ret = __get_any_page(page, pfn, flags);
 
 	if (ret == -EBUSY)
-		ret = __get_any_page(page, pfn);
+		ret = __get_any_page(page, pfn, flags);
 
 	if (ret == 1 && !PageHuge(page) &&
 	    !PageLRU(page) && !__PageMovable(page)) {
@@ -1727,7 +1733,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
 		/*
 		 * Did it turn free?
 		 */
-		ret = __get_any_page(page, pfn);
+		ret = __get_any_page(page, pfn, 0);
 		if (ret == 1 && !PageLRU(page)) {
 			/* Drop page reference which is from __get_any_page() */
 			put_page(page);
@@ -1869,6 +1875,7 @@ static int soft_offline_free_page(struct page *page)
 /**
  * soft_offline_page - Soft offline a page.
  * @pfn: pfn to soft-offline
+ * @flags: flags. Same as memory_failure().
  *
  * Returns 0 on success, otherwise negated errno.
  *
@@ -1887,7 +1894,7 @@ static int soft_offline_free_page(struct page *page)
  * This is not a 100% solution for all memory, but tries to be
  * ``good enough'' for the majority of memory.
  */
-int soft_offline_page(unsigned long pfn)
+int soft_offline_page(unsigned long pfn, int flags)
 {
 	int ret;
 	struct page *page;
@@ -1901,11 +1908,13 @@ int soft_offline_page(unsigned long pfn)
 
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
+		if (flags & MF_COUNT_INCREASED)
+			put_page(page);
 		return 0;
 	}
 
 	get_online_mems();
-	ret = get_any_page(page, pfn);
+	ret = get_any_page(page, pfn, flags);
 	put_online_mems();
 
 	if (ret > 0)
-- 
2.25.1