linux-kernel - Re: [PATCH diagnostic qspinlock] Diagnostics for excessive lock-drop wait loop time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20230112234907.GT4028633@paulmck-ThinkPad-P17-Gen-1>
Date:   Thu, 12 Jan 2023 15:49:07 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Jonas Oberhauser <jonas.oberhauser@...wei.com>
Cc:     "riel@...riel.com" <riel@...riel.com>,
        "davej@...emonkey.org.uk" <davej@...emonkey.org.uk>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [PATCH diagnostic qspinlock] Diagnostics for excessive lock-drop
 wait loop time

On Thu, Jan 12, 2023 at 08:51:04PM +0000, Jonas Oberhauser wrote:
> Hi Paul,
> 
> -----Original Message-----
> From: Paul E. McKenney [mailto:paulmck@...nel.org] 
> > We see systems stuck in the queued_spin_lock_slowpath() loop that waits for the lock to become unlocked in the case where the current CPU has set pending state.
> 
> Interesting!
> Do you know if the hangs started with a recent patch? What codepaths are active (virtualization/arch/...)? Does it happen extremely rarely? Do you have any additional information?

As best we can tell right now, see it about three times per day per
million systems on x86 systems running v5.12 plus backports.  It is
entirely possible that it is a hardware/firmware problem, but normally
that would cause the failure to cluster on a specific piece of hardware
or specific type of hardware, and we are not seeing that.

But we are in very early days investigating this.  In particular,
everything in the previous paragraph is subject to change.  For example,
we have not yet eliminated the possibility that the lockword is being
corrupted by unrelated kernel software, which is part of the motivation
for the patch in my earlier email.

> I saw a similar situation a few years ago in a proprietary kernel, but it only happened once ever and I gave up on looking for the reason after a few days (including some time combing through the compiler generated assembler).

If it makes you feel better, yesterday I was sure that I had found the
bug by inspection.  But no, just confusion on my part!  ;-)

But thank you very much for the possible corroborating information.
You never know!

							Thanx, Paul