linux-kernel - Re: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20200615061951.GA26108@hori.linux.bs1.fc.nec.co.jp>
Date:   Mon, 15 Jun 2020 06:19:53 +0000
From:   HORIGUCHI NAOYA(堀口　直也) 
        <naoya.horiguchi@....com>
To:     Dmitry Yakunin <zeil@...dex-team.ru>
CC:     "osalvador@...e.de" <osalvador@...e.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "mhocko@...nel.org" <mhocko@...nel.org>,
        "mike.kravetz@...cle.com" <mike.kravetz@...cle.com>,
        "n-horiguchi@...jp.nec.com" <n-horiguchi@...jp.nec.com>,
        "max7255@...dex-team.ru" <max7255@...dex-team.ru>
Subject: Re: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline

Hi Dmitry,

On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote:
> Hello!
> 
> We are faced with similar problems with hwpoisoned pages
> on one of our production clusters after kernel update to stable 4.19.
> Application that does a lot of memory allocations sometimes caught SIGBUS signal
> with message in dmesg about hardware memory corruption fault.
> In kernel and mce logs we saw messages about soft offlining pages with
> correctable errors. Those events always had happened before application
> was killed. This is not the behavior we expect. We want our application to
> continue working on a smaller set of available pages in the system.
> 
> This issue is difficult to reproduce, but we suppose that the reason for such
> behavior is that compaction does not check for page poisonness while processing
> free pages, so as a result valid userspace data gets migrated to bad pages.
> We wrote the simple test:
>   - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
>     through writing pfn to /sys/devices/system/memory/soft_offline_page
>   - force compaction by echo 1 >> /proc/sys/vm/compact_memory
> Without this patch series after these steps bash became unusable
> and every attempt to run any command leads to SIGBUS with message about
> hardware memory corruption fault. And after applying this series to our kernel
> tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
> this behavior is still reproducible.
> 
> So, we want to know, why this patchset wasn't merged to the upstream?
> Is there any problems in such rework for {soft,hard}-offline handling?

No technical reason, it's just because I didn't have enough power to push
this to be merged. Really sorry about that.

> BTW, this patchset should be updated with upstream changes in mm.

I'm working this now and still need more testing to confirm, but I hope
I'll update and post this for 5.9.

Thanks,
Naoya Horiguchi