[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200615061951.GA26108@hori.linux.bs1.fc.nec.co.jp>
Date: Mon, 15 Jun 2020 06:19:53 +0000
From: HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@....com>
To: Dmitry Yakunin <zeil@...dex-team.ru>
CC: "osalvador@...e.de" <osalvador@...e.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"mhocko@...nel.org" <mhocko@...nel.org>,
"mike.kravetz@...cle.com" <mike.kravetz@...cle.com>,
"n-horiguchi@...jp.nec.com" <n-horiguchi@...jp.nec.com>,
"max7255@...dex-team.ru" <max7255@...dex-team.ru>
Subject: Re: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline
Hi Dmitry,
On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote:
> Hello!
>
> We are faced with similar problems with hwpoisoned pages
> on one of our production clusters after kernel update to stable 4.19.
> Application that does a lot of memory allocations sometimes caught SIGBUS signal
> with message in dmesg about hardware memory corruption fault.
> In kernel and mce logs we saw messages about soft offlining pages with
> correctable errors. Those events always had happened before application
> was killed. This is not the behavior we expect. We want our application to
> continue working on a smaller set of available pages in the system.
>
> This issue is difficult to reproduce, but we suppose that the reason for such
> behavior is that compaction does not check for page poisonness while processing
> free pages, so as a result valid userspace data gets migrated to bad pages.
> We wrote the simple test:
> - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
> through writing pfn to /sys/devices/system/memory/soft_offline_page
> - force compaction by echo 1 >> /proc/sys/vm/compact_memory
> Without this patch series after these steps bash became unusable
> and every attempt to run any command leads to SIGBUS with message about
> hardware memory corruption fault. And after applying this series to our kernel
> tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
> this behavior is still reproducible.
>
> So, we want to know, why this patchset wasn't merged to the upstream?
> Is there any problems in such rework for {soft,hard}-offline handling?
No technical reason, it's just because I didn't have enough power to push
this to be merged. Really sorry about that.
> BTW, this patchset should be updated with upstream changes in mm.
I'm working this now and still need more testing to confirm, but I hope
I'll update and post this for 5.9.
Thanks,
Naoya Horiguchi
Powered by blists - more mailing lists