[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200805204354.GA16406@hori.linux.bs1.fc.nec.co.jp>
Date: Wed, 5 Aug 2020 20:43:55 +0000
From: HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@....com>
To: Qian Cai <cai@....pw>
CC: "nao.horiguchi@...il.com" <nao.horiguchi@...il.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"mhocko@...nel.org" <mhocko@...nel.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"mike.kravetz@...cle.com" <mike.kravetz@...cle.com>,
"osalvador@...e.de" <osalvador@...e.de>,
"tony.luck@...el.com" <tony.luck@...el.com>,
"david@...hat.com" <david@...hat.com>,
"aneesh.kumar@...ux.vnet.ibm.com" <aneesh.kumar@...ux.vnet.ibm.com>,
"zeil@...dex-team.ru" <zeil@...dex-team.ru>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v5 00/16] HWPOISON: soft offline rework
On Mon, Aug 03, 2020 at 11:19:09AM -0400, Qian Cai wrote:
> On Mon, Aug 03, 2020 at 01:36:58PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Hello,
> >
> > On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote:
> > > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@...il.com wrote:
> > > > This patchset is the latest version of soft offline rework patchset
> > > > targetted for v5.9.
> > > >
> > > > Main focus of this series is to stabilize soft offline. Historically soft
> > > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > > used to a little too aggressively, which (directly or indirectly) invades
> > > > other mm code which cares little about hwpoison. This results in unexpected
> > > > behavior or kernel panic, which is very far from soft offline's "do not
> > > > disturb userspace or other kernel component" policy.
> > > >
> > > > Main point of this change set is to contain target page "via buddy allocator",
> > > > where we first free the target page as we do for normal pages, and remove
> > > > from buddy only when we confirm that it reaches free list. There is surely
> > > > race window of page allocation, but that's fine because someone really want
> > > > that page and the page is still working, so soft offline can happily give up.
> > > >
> > > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > > seems still work in progress, so I decide to separate it for changes into
> > > > v5.9. Thank you for your contribution, Oscar.
> > > >
> > > > The issue reported by Qian Cai is fixed by patch 16/16.
> > >
> > > I am still getting EIO everywhere on next-20200803 (which includes this v5).
> > >
> > > # ./random 1
> > > - start: migrate_huge_offline
> > > - use NUMA nodes 0,8.
> > > - mmap and free 8388608 bytes hugepages on node 0
> > > - mmap and free 8388608 bytes hugepages on node 8
> > > madvise: Input/output error
> > >
> > > From the serial console,
> > >
> > > [ 637.164222][ T8357] soft offline: 0x118ee0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > > [ 637.164890][ T8357] Soft offlining pfn 0x20001380 at process virtual address 0x7fff9f000000
> > > [ 637.165422][ T8357] Soft offlining pfn 0x3ba00 at process virtual address 0x7fff9f200000
> > > [ 637.166409][ T8357] Soft offlining pfn 0x201914a0 at process virtual address 0x7fff9f000000
> > > [ 637.166833][ T8357] Soft offlining pfn 0x12b9a0 at process virtual address 0x7fff9f200000
> > > [ 637.168044][ T8357] Soft offlining pfn 0x1abb60 at process virtual address 0x7fff9f000000
> > > [ 637.168557][ T8357] Soft offlining pfn 0x20014820 at process virtual address 0x7fff9f200000
> > > [ 637.169493][ T8357] Soft offlining pfn 0x119720 at process virtual address 0x7fff9f000000
> > > [ 637.169603][ T8357] soft offline: 0x119720: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > > [ 637.169756][ T8357] Soft offlining pfn 0x118ee0 at process virtual address 0x7fff9f200000
> > > [ 637.170653][ T8357] Soft offlining pfn 0x200e81e0 at process virtual address 0x7fff9f000000
> > > [ 637.171067][ T8357] Soft offlining pfn 0x201c5f60 at process virtual address 0x7fff9f200000
> > > [ 637.172101][ T8357] Soft offlining pfn 0x201c8f00 at process virtual address 0x7fff9f000000
> > > [ 637.172241][ T8357] __get_any_page: 0x201c8f00: unknown zero refcount page type 87fff8000000000
> >
> > I might misjudge to skip the following patch, sorry about that.
> > Could you try with it?
>
> Still getting EIO after applied this patch.
>
> [ 1215.499030][T88982] soft offline: 0x201bdc20: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.499775][T88982] Soft offlining pfn 0x201bdc20 at process virtual address 0x7fff91a00000
> [ 1215.500189][T88982] Soft offlining pfn 0x201c19c0 at process virtual address 0x7fff91c00000
> [ 1215.500297][T88982] soft offline: 0x201c19c0: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.500982][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91a00000
> [ 1215.501086][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.501237][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91c00000
> [ 1215.501355][T88982] soft offline: 0x1f4520: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.502196][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91a00000
> [ 1215.502573][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91c00000
> [ 1215.502687][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.503245][T88982] Soft offlining pfn 0x201c3cc0 at process virtual address 0x7fff91a00000
> [ 1215.503594][T88982] Soft offlining pfn 0x201c3ce0 at process virtual address 0x7fff91c00000
> [ 1215.503755][T88982] __get_any_page: 0x201c3ce0: unknown zero refcount page type 87fff8000000000
So I lean to cancel patch 3/16 where madvise_inject_error() releases page
refcount before calling soft_offline_page(). That should solve the issue
because refcount of the target page never becomes zero.
Patch 4/16, 8/16 and 9/16 depend on it, so they should be removed too.
I'll resubmit the series later, but I prepare the delta below (based on
next-20200805), so it would be appreciated if you can run random.c with it
to confrim the fix.
Thanks,
Naoya Horiguchi
---
>From 533090c0869aeca88d8ff14d27c3cef8fc060ccd Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@....com>
Date: Thu, 6 Aug 2020 01:29:08 +0900
Subject: [PATCH] (for testing) revert change around removing
MF_COUNT_INCREASED
revert the following patches
- mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
- mm,madvise: Refactor madvise_inject_error
- mm,hwpoison: remove MF_COUNT_INCREASED
- mm,hwpoison: remove flag argument from soft offline functions
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@....com>
---
drivers/base/memory.c | 2 +-
include/linux/mm.h | 9 +++++----
mm/madvise.c | 36 +++++++++++++++++++-----------------
mm/memory-failure.c | 31 ++++++++++++++++++++-----------
4 files changed, 45 insertions(+), 33 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 3e6d27c9dff6..4db3c660de83 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -463,7 +463,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
if (kstrtoull(buf, 0, &pfn) < 0)
return -EINVAL;
pfn >>= PAGE_SHIFT;
- ret = soft_offline_page(pfn);
+ ret = soft_offline_page(pfn, 0);
return ret == 0 ? count : ret;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4f12b2465e80..442921a004a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2976,9 +2976,10 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
unsigned long nr_pages);
enum mf_flags {
- MF_ACTION_REQUIRED = 1 << 0,
- MF_MUST_KILL = 1 << 1,
- MF_SOFT_OFFLINE = 1 << 2,
+ MF_COUNT_INCREASED = 1 << 0,
+ MF_ACTION_REQUIRED = 1 << 1,
+ MF_MUST_KILL = 1 << 2,
+ MF_SOFT_OFFLINE = 1 << 3,
};
extern int memory_failure(unsigned long pfn, int flags);
extern void memory_failure_queue(unsigned long pfn, int flags);
@@ -2988,7 +2989,7 @@ extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p, int access);
extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(unsigned long pfn);
+extern int soft_offline_page(unsigned long pfn, int flags);
/*
diff --git a/mm/madvise.c b/mm/madvise.c
index 843f6fad3b89..5fa5f66468b3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -886,15 +886,16 @@ static long madvise_remove(struct vm_area_struct *vma,
static int madvise_inject_error(int behavior,
unsigned long start, unsigned long end)
{
+ struct page *page;
struct zone *zone;
unsigned long size;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
+
for (; start < end; start += size) {
unsigned long pfn;
- struct page *page;
int ret;
ret = get_user_pages_fast(start, 1, 0, &page);
@@ -909,26 +910,27 @@ static int madvise_inject_error(int behavior,
*/
size = page_size(compound_head(page));
- /*
- * The get_user_pages_fast() is just to get the pfn of the
- * given address, and the refcount has nothing to do with
- * what we try to test, so it should be released immediately.
- * This is racy but it's intended because the real hardware
- * errors could happen at any moment and memory error handlers
- * must properly handle the race.
- */
- put_page(page);
-
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
- pfn, start);
- ret = soft_offline_page(pfn);
- } else {
- pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
- pfn, start);
- ret = memory_failure(pfn, 0);
+ pfn, start);
+
+ ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+ if (ret)
+ return ret;
+ continue;
}
+ pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+ pfn, start);
+
+ /*
+ * Drop the page reference taken by get_user_pages_fast(). In
+ * the absence of MF_COUNT_INCREASED the memory_failure()
+ * routine is responsible for pinning the page to prevent it
+ * from being released back to the page allocator.
+ */
+ put_page(page);
+ ret = memory_failure(pfn, 0);
if (ret)
return ret;
}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a229d4694954..fd50a6f9a60d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1167,7 +1167,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
num_poisoned_pages_inc();
- if (!get_hwpoison_page(p)) {
+ if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
/*
* Check "filter hit" and "race with other subpage."
*/
@@ -1363,7 +1363,7 @@ int memory_failure(unsigned long pfn, int flags)
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
*/
- if (!get_hwpoison_page(p)) {
+ if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
if (is_free_buddy_page(p)) {
action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
return 0;
@@ -1392,7 +1392,10 @@ int memory_failure(unsigned long pfn, int flags)
shake_page(p, 0);
/* shake_page could have turned it free. */
if (!PageLRU(p) && is_free_buddy_page(p)) {
- action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+ if (flags & MF_COUNT_INCREASED)
+ action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+ else
+ action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
return 0;
}
@@ -1540,7 +1543,7 @@ static void memory_failure_work_func(struct work_struct *work)
if (!gotten)
break;
if (entry.flags & MF_SOFT_OFFLINE)
- soft_offline_page(entry.pfn);
+ soft_offline_page(entry.pfn, entry.flags);
else
memory_failure(entry.pfn, entry.flags);
}
@@ -1679,10 +1682,13 @@ EXPORT_SYMBOL(unpoison_memory);
* that is not free, and 1 for any other page type.
* For 1 the page is returned with increased page count, otherwise not.
*/
-static int __get_any_page(struct page *p, unsigned long pfn)
+static int __get_any_page(struct page *p, unsigned long pfn, int flags)
{
int ret;
+ if (flags & MF_COUNT_INCREASED)
+ return 1;
+
/*
* When the target page is a free hugepage, just remove it
* from free hugepage list.
@@ -1709,12 +1715,12 @@ static int __get_any_page(struct page *p, unsigned long pfn)
return ret;
}
-static int get_any_page(struct page *page, unsigned long pfn)
+static int get_any_page(struct page *page, unsigned long pfn, int flags)
{
- int ret = __get_any_page(page, pfn);
+ int ret = __get_any_page(page, pfn, flags);
if (ret == -EBUSY)
- ret = __get_any_page(page, pfn);
+ ret = __get_any_page(page, pfn, flags);
if (ret == 1 && !PageHuge(page) &&
!PageLRU(page) && !__PageMovable(page)) {
@@ -1727,7 +1733,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
/*
* Did it turn free?
*/
- ret = __get_any_page(page, pfn);
+ ret = __get_any_page(page, pfn, 0);
if (ret == 1 && !PageLRU(page)) {
/* Drop page reference which is from __get_any_page() */
put_page(page);
@@ -1869,6 +1875,7 @@ static int soft_offline_free_page(struct page *page)
/**
* soft_offline_page - Soft offline a page.
* @pfn: pfn to soft-offline
+ * @flags: flags. Same as memory_failure().
*
* Returns 0 on success, otherwise negated errno.
*
@@ -1887,7 +1894,7 @@ static int soft_offline_free_page(struct page *page)
* This is not a 100% solution for all memory, but tries to be
* ``good enough'' for the majority of memory.
*/
-int soft_offline_page(unsigned long pfn)
+int soft_offline_page(unsigned long pfn, int flags)
{
int ret;
struct page *page;
@@ -1901,11 +1908,13 @@ int soft_offline_page(unsigned long pfn)
if (PageHWPoison(page)) {
pr_info("soft offline: %#lx page already poisoned\n", pfn);
+ if (flags & MF_COUNT_INCREASED)
+ put_page(page);
return 0;
}
get_online_mems();
- ret = get_any_page(page, pfn);
+ ret = get_any_page(page, pfn, flags);
put_online_mems();
if (ret > 0)
--
2.25.1
Powered by blists - more mailing lists