lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b1673cd8-dd6d-8b50-6c5a-c715f368f12d@redhat.com>
Date:   Thu, 8 Sep 2022 08:45:38 +0200
From:   Renaud Métrich <rmetrich@...hat.com>
To:     Luis Chamberlain <mcgrof@...nel.org>,
        Oleksandr Natalenko <oleksandr@...hat.com>
Cc:     linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, Jonathan Corbet <corbet@....net>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Huang Ying <ying.huang@...el.com>,
        "Jason A . Donenfeld" <Jason@...c4.com>,
        Will Deacon <will@...nel.org>,
        "Guilherme G . Piccoli" <gpiccoli@...lia.com>,
        Laurent Dufour <ldufour@...ux.ibm.com>,
        Stephen Kitt <steve@....org>, Rob Herring <robh@...nel.org>,
        Joel Savitz <jsavitz@...hat.com>,
        "Eric W . Biederman" <ebiederm@...ssion.com>,
        Kees Cook <keescook@...omium.org>,
        Xiaoming Ni <nixiaoming@...wei.com>,
        Oleg Nesterov <oleg@...hat.com>,
        Grzegorz Halat <ghalat@...hat.com>, Qi Guo <qguo@...hat.com>
Subject: Re: [PATCH] core_pattern: add CPU specifier

Hello,

I have been working closely with Oleksandr on a couple of cases where 
customers could see segfaults for various processes, including basic 
tools ("grep", "cut", etc.) that usually don't die.

The coredumps showed of course nothing because from userland's 
perspective there was nothing wrong, but just a bad pointer which 
couldn't be explained.

Memory testing (e.g. Memtest86+) and CPU testing (usually from hardware 
vendor) never showed any issue with the hardware as well, even though 
there was, probably because it required special conditions, such as 
specific load and/or thermal conditions.

The troubleshooting of such cases takes several weeks or even months, 
until we have enough evidence it's not the OS that is faulty, and it's 
always struggling.

Usually when we start getting kernel crashes, we are then happy because 
kernel crashes indicate the CPU the task was running on, and it seems to 
always be reliable enough information to point to faulty CPU. For other 
cases where no kernel crash could be observed, these are solved after 
requesting the customer to replace the hardware components, which is 
something difficult to explain since it usually costs the customer money 
and time.

I hope such feature will be helpful for everybody doing Linux support.

Renaud.

Le 9/7/22 à 17:53, Luis Chamberlain a écrit :
> On Sat, Sep 03, 2022 at 08:43:30AM +0200, Oleksandr Natalenko wrote:
>> Statistically, in a large deployment regular segfaults may indicate a CPU issue.
> Can you elaborate on this? How common is this observed to be true? Are
> there any public findings or bugs where it showed this?
>
>    Luis
>

Download attachment "OpenPGP_signature" of type "application/pgp-signature" (841 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ