lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8ce6c104-d2fb-4fed-a108-224a6113c227@nvidia.com>
Date: Sat, 7 Jun 2025 11:00:50 -0700
From: John Hubbard <jhubbard@...dia.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Jason Gunthorpe <jgg@...pe.ca>
Cc: David Hildenbrand <david@...hat.com>, Michal Hocko <mhocko@...e.com>,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
 <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v1] mm/gup: remove (VM_)BUG_ONs

On 6/7/25 6:53 AM, Lorenzo Stoakes wrote:
> On Sat, Jun 07, 2025 at 10:42:14AM -0300, Jason Gunthorpe wrote:
>> On Fri, Jun 06, 2025 at 08:03:15PM +0100, Lorenzo Stoakes wrote:
>>> On Fri, Jun 06, 2025 at 07:46:52PM +0100, Lorenzo Stoakes wrote:
>>>> On Fri, Jun 06, 2025 at 03:42:12PM -0300, Jason Gunthorpe wrote:
>>>>> On Fri, Jun 06, 2025 at 08:23:25PM +0200, David Hildenbrand wrote:
...
>>> And I guess we'd not want a new interface for this like WARN_ON_ONCE_STORED()
>>> because that'd be... weird and how would anyone think to use that and nearly all
>>> cases wouldn't.
>>
>> No! Delete WARN_ON_ONCE and say the new global mechanism is good
>> enough to capture the first WARN_ON, everyone always uses it always
>> and then nobody needs to think about this anymore when writing new
>> code.
Interesting. This would cause WARN_ON*() to act similarly to lockdep, in
that the first instance is the only one that works.

That works, but if-and-only-if the system is kept completely WARN-free.
However, I have a mitigation for that, below.

> 
> Well that is simpler :)
> 
> I have encountered situations where I've had more than one and needed 2nd+
> but it is rare as you say.
> 
> My late night incoherent babbling yesterday was perhaps because I
> misunderstood David/John as to what they encountered in the past... maybe
> they can clarify...

I've debugged lots of production systems, often these were large HPC 
clusters and supercomputers. I've seen:

a) Long up-times, with (of course!) relatively small dmesg buffer sizes,
so that early logs are long gone. This means that WARN_ON_ONCE() is
quite often gone (overwritten). This is common.

The worst part is that if you go to reproduce a problem, you don't
see the next warning in the logs!! This is devastating, especially if
the site makes it hard to ask for a system reboot. (If you have
~20,000 nodes in the cluster, a reboot is not a small affair.)

b) WARN_ON() loops that fill up the dmesg buffer immediately, which
of course also overwrites anything leading up to the first WARN_ON.

However, (b) is rare on production kernels--it's more commonly
self-inflicted, if I make a mistake and write WARN_ON() when I should
have written WARN_ON_ONCE(), in a custom build to explore a problem.

c) Other printk-based noise in a loop, occasionally this also wraps
the dmesg buffer. Especially if a customer has done a bit of their
own "special" logging code.


> 
> I do find myself grepping dmesg to find the first warning and it's
> _annoying_ to do so, so this would be handy.
> 
> But I'm not sure it'd justify getting rid of WARN_ON_ONCE() when you are in
> a loop or something and now your dmesg is going to go to hell, still useful
> not to do that, esp. if you know there's no value to further warnings

In all cases, the ideal outcome is a dmesg buffer that includes (let's
fantasize for a moment):

1) the earliest dmesg output (showing any oddness with system
configuration, and what hardware was brought online, etc), plus

2) the first problem that is logged, plus

3) ...sometimes two or even three things happen in order to get to the
problem, so (again, fantasizing!) really the first *three* warnings
would be good to have.

So this is just a 3-element array of global warnings--not another
circular buffer, but a little larger than a TAINT or lockdep type
of capture.

thanks,
-- 
John Hubbard


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ