[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b1673cd8-dd6d-8b50-6c5a-c715f368f12d@redhat.com>
Date: Thu, 8 Sep 2022 08:45:38 +0200
From: Renaud Métrich <rmetrich@...hat.com>
To: Luis Chamberlain <mcgrof@...nel.org>,
Oleksandr Natalenko <oleksandr@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
linux-fsdevel@...r.kernel.org, Jonathan Corbet <corbet@....net>,
Alexander Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Huang Ying <ying.huang@...el.com>,
"Jason A . Donenfeld" <Jason@...c4.com>,
Will Deacon <will@...nel.org>,
"Guilherme G . Piccoli" <gpiccoli@...lia.com>,
Laurent Dufour <ldufour@...ux.ibm.com>,
Stephen Kitt <steve@....org>, Rob Herring <robh@...nel.org>,
Joel Savitz <jsavitz@...hat.com>,
"Eric W . Biederman" <ebiederm@...ssion.com>,
Kees Cook <keescook@...omium.org>,
Xiaoming Ni <nixiaoming@...wei.com>,
Oleg Nesterov <oleg@...hat.com>,
Grzegorz Halat <ghalat@...hat.com>, Qi Guo <qguo@...hat.com>
Subject: Re: [PATCH] core_pattern: add CPU specifier
Hello,
I have been working closely with Oleksandr on a couple of cases where
customers could see segfaults for various processes, including basic
tools ("grep", "cut", etc.) that usually don't die.
The coredumps showed of course nothing because from userland's
perspective there was nothing wrong, but just a bad pointer which
couldn't be explained.
Memory testing (e.g. Memtest86+) and CPU testing (usually from hardware
vendor) never showed any issue with the hardware as well, even though
there was, probably because it required special conditions, such as
specific load and/or thermal conditions.
The troubleshooting of such cases takes several weeks or even months,
until we have enough evidence it's not the OS that is faulty, and it's
always struggling.
Usually when we start getting kernel crashes, we are then happy because
kernel crashes indicate the CPU the task was running on, and it seems to
always be reliable enough information to point to faulty CPU. For other
cases where no kernel crash could be observed, these are solved after
requesting the customer to replace the hardware components, which is
something difficult to explain since it usually costs the customer money
and time.
I hope such feature will be helpful for everybody doing Linux support.
Renaud.
Le 9/7/22 à 17:53, Luis Chamberlain a écrit :
> On Sat, Sep 03, 2022 at 08:43:30AM +0200, Oleksandr Natalenko wrote:
>> Statistically, in a large deployment regular segfaults may indicate a CPU issue.
> Can you elaborate on this? How common is this observed to be true? Are
> there any public findings or bugs where it showed this?
>
> Luis
>
Download attachment "OpenPGP_signature" of type "application/pgp-signature" (841 bytes)
Powered by blists - more mailing lists