[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b387a3c6-1bf9-434f-a255-6e92269e6ba5@suse.cz>
Date: Mon, 9 Jun 2025 11:57:48 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: John Hubbard <jhubbard@...dia.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: David Hildenbrand <david@...hat.com>, Michal Hocko <mhocko@...e.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Andrew Morton <akpm@...ux-foundation.org>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>, Mike Rapoport
<rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v1] mm/gup: remove (VM_)BUG_ONs
On 6/7/25 8:00 PM, John Hubbard wrote:
> On 6/7/25 6:53 AM, Lorenzo Stoakes wrote:
>>
>> Well that is simpler :)
>>
>> I have encountered situations where I've had more than one and needed
>> 2nd+
>> but it is rare as you say.
>>
>> My late night incoherent babbling yesterday was perhaps because I
>> misunderstood David/John as to what they encountered in the past... maybe
>> they can clarify...
>
> I've debugged lots of production systems, often these were large HPC
> clusters and supercomputers. I've seen:
>
> a) Long up-times, with (of course!) relatively small dmesg buffer sizes,
> so that early logs are long gone. This means that WARN_ON_ONCE() is
> quite often gone (overwritten). This is common.
There's no e.g. journald storing them permanently? I think trying to
hard in the kernel to provide this "recall first warning" if userspace
can handle this, is suboptimal. I think there are two main scenarios:
- the warning is indeed not fatal - userspace can likely save it
- it's (part of) something fatal - the system will crash before it
disappears from the ring buffer
> The worst part is that if you go to reproduce a problem, you don't
> see the next warning in the logs!! This is devastating, especially if
> the site makes it hard to ask for a system reboot. (If you have
> ~20,000 nodes in the cluster, a reboot is not a small affair.)
Assuming you know how to reproduce the problem... I wonder if it would
help if there was a way (sysctl?) to re-arm all the _ONCE warnings. It
shouldn't be that hard hopefully?
Powered by blists - more mailing lists