lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200611164319.16860-1-zeil@yandex-team.ru>
Date:   Thu, 11 Jun 2020 19:43:19 +0300
From:   Dmitry Yakunin <zeil@...dex-team.ru>
To:     osalvador@...e.de
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        mhocko@...nel.org, mike.kravetz@...cle.com,
        n-horiguchi@...jp.nec.com, max7255@...dex-team.ru
Subject: Re: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline

Hello!

We are faced with similar problems with hwpoisoned pages
on one of our production clusters after kernel update to stable 4.19.
Application that does a lot of memory allocations sometimes caught SIGBUS signal
with message in dmesg about hardware memory corruption fault.
In kernel and mce logs we saw messages about soft offlining pages with
correctable errors. Those events always had happened before application
was killed. This is not the behavior we expect. We want our application to
continue working on a smaller set of available pages in the system.

This issue is difficult to reproduce, but we suppose that the reason for such
behavior is that compaction does not check for page poisonness while processing
free pages, so as a result valid userspace data gets migrated to bad pages.
We wrote the simple test:
  - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
    through writing pfn to /sys/devices/system/memory/soft_offline_page
  - force compaction by echo 1 >> /proc/sys/vm/compact_memory
Without this patch series after these steps bash became unusable
and every attempt to run any command leads to SIGBUS with message about
hardware memory corruption fault. And after applying this series to our kernel
tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
this behavior is still reproducible.

So, we want to know, why this patchset wasn't merged to the upstream?
Is there any problems in such rework for {soft,hard}-offline handling?
BTW, this patchset should be updated with upstream changes in mm.

Thanks for you replies.

--
Dmitry Yakunin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ