linux-kernel - Re: [PATCH v1] mm/gup: remove (VM_)BUG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b387a3c6-1bf9-434f-a255-6e92269e6ba5@suse.cz>
Date: Mon, 9 Jun 2025 11:57:48 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: John Hubbard <jhubbard@...dia.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: David Hildenbrand <david@...hat.com>, Michal Hocko <mhocko@...e.com>,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Mike Rapoport
 <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v1] mm/gup: remove (VM_)BUG_ONs

On 6/7/25 8:00 PM, John Hubbard wrote:
> On 6/7/25 6:53 AM, Lorenzo Stoakes wrote:
>>
>> Well that is simpler :)
>>
>> I have encountered situations where I've had more than one and needed
>> 2nd+
>> but it is rare as you say.
>>
>> My late night incoherent babbling yesterday was perhaps because I
>> misunderstood David/John as to what they encountered in the past... maybe
>> they can clarify...
> 
> I've debugged lots of production systems, often these were large HPC
> clusters and supercomputers. I've seen:
> 
> a) Long up-times, with (of course!) relatively small dmesg buffer sizes,
> so that early logs are long gone. This means that WARN_ON_ONCE() is
> quite often gone (overwritten). This is common.

There's no e.g. journald storing them permanently? I think trying to
hard in the kernel to provide this "recall first warning" if userspace
can handle this, is suboptimal. I think there are two main scenarios:

- the warning is indeed not fatal - userspace can likely save it
- it's (part of) something fatal - the system will crash before it
disappears from the ring buffer

> The worst part is that if you go to reproduce a problem, you don't
> see the next warning in the logs!! This is devastating, especially if
> the site makes it hard to ask for a system reboot. (If you have
> ~20,000 nodes in the cluster, a reboot is not a small affair.)

Assuming you know how to reproduce the problem... I wonder if it would
help if there was a way (sysctl?) to re-arm all the _ONCE warnings. It
shouldn't be that hard hopefully?