lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20230112234907.GT4028633@paulmck-ThinkPad-P17-Gen-1>
Date:   Thu, 12 Jan 2023 15:49:07 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Jonas Oberhauser <jonas.oberhauser@...wei.com>
Cc:     "riel@...riel.com" <riel@...riel.com>,
        "davej@...emonkey.org.uk" <davej@...emonkey.org.uk>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [PATCH diagnostic qspinlock] Diagnostics for excessive lock-drop
 wait loop time

On Thu, Jan 12, 2023 at 08:51:04PM +0000, Jonas Oberhauser wrote:
> Hi Paul,
> 
> -----Original Message-----
> From: Paul E. McKenney [mailto:paulmck@...nel.org] 
> > We see systems stuck in the queued_spin_lock_slowpath() loop that waits for the lock to become unlocked in the case where the current CPU has set pending state.
> 
> Interesting!
> Do you know if the hangs started with a recent patch? What codepaths are active (virtualization/arch/...)? Does it happen extremely rarely? Do you have any additional information?

As best we can tell right now, see it about three times per day per
million systems on x86 systems running v5.12 plus backports.  It is
entirely possible that it is a hardware/firmware problem, but normally
that would cause the failure to cluster on a specific piece of hardware
or specific type of hardware, and we are not seeing that.

But we are in very early days investigating this.  In particular,
everything in the previous paragraph is subject to change.  For example,
we have not yet eliminated the possibility that the lockword is being
corrupted by unrelated kernel software, which is part of the motivation
for the patch in my earlier email.

> I saw a similar situation a few years ago in a proprietary kernel, but it only happened once ever and I gave up on looking for the reason after a few days (including some time combing through the compiler generated assembler).

If it makes you feel better, yesterday I was sure that I had found the
bug by inspection.  But no, just confusion on my part!  ;-)

But thank you very much for the possible corroborating information.
You never know!

							Thanx, Paul

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ