linux-kernel - Re: [PATCH v1] mm/gup: remove (VM_)BUG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com>
Date: Wed, 4 Jun 2025 08:42:30 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: David Hildenbrand <david@...hat.com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	Andrew Morton <akpm@...ux-foundation.org>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>, 
	Michal Hocko <mhocko@...e.com>, Jason Gunthorpe <jgg@...pe.ca>, John Hubbard <jhubbard@...dia.com>, 
	Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v1] mm/gup: remove (VM_)BUG_ONs

On Wed, 4 Jun 2025 at 07:49, Lorenzo Stoakes <lorenzo.stoakes@...cle.com> wrote:
>
> Linus's point of view is that we shouldn't use them _at all_ right? So
> maybe even this situation isn't one where we'd want to use one?

So I think BUG_ON() basically grew from

 (a) laziness. Not a good reason.

 (b) we historically had a model where we'd just kill processes on
fatal errors, particularly page faults

That (b) in particular *used* to work quite well for recovery - a
couple of decades ago ago.  A kernel bug would print out the
backtrace, and then kill the process (literally with do_exit()) and
try to run something else.

It was wonderfully useful in that you'd get an oops, and the system
would continue, but that *thread* wouldn't continue.

And decades ago, it worked quite well, because the system was much
simpler, and the likelihood that we held any critical locks was
generally pretty low.

But then SMP happened, and at first it wasn't a huge deal: we had one
special lock, and the exit path would just clean *that* lock up, and
life continued to be good.

But that was literally over two decades ago, and none of the above
actually ever used BUG_ON(). The page fault code would literally do

        die("Oops", regs, error_code);

on a fatal page fault. A "BUG_ON()" didn't even exist back then, and
die() looked like this:

        console_verbose();
        spin_lock_irq(&die_lock);
        printk("%s: %04lx\n", str, err & 0xffff);
        show_registers(regs);

        spin_unlock_irq(&die_lock);
        do_exit(SIGSEGV);

which tried to simply serialize the error output, and then kill the process.

When it worked, it worked quite well.

(And yes, page faults are very relevant, because this is what BUG
looked like back then:

    #define BUG() *(int *)0 = 0

so it all depended on that page fault printing out the state and exiting)

But as you can well imagine, it worked increasingly badly with
increasing complexity and locking depth.

When you come from that kind of "kill the process on errors" and you
then realize that you can't really do that any more, you end up with
BUG_ON().

The BUG_ON() thing was introduced in 2.5.0, and initially came from
debug code in the block layer rewrite.

And in that particular context, it actually made sense: this was new
code that changed the block elevator, and if that code got it wrong,
you were pretty much *guaranteed* disk corruption.

But then it became a pattern. And I think that pattern is basically never good.

I really think that the *ONLY* situation where BUG() is valid is when
you absolutely *know* that corruption will happen, and you cannot
continue.

Very much *not* some kind of "this is problematic, and who knows what
corruption it might cause".  But "I *know* I can't continue without
major system because the hardware is broken sh*t".

In other words, don't use it. Ever. Unless you can explain exactly why
without any handwaving.

Cloud providers or others can do "panic-on-warn" if they want to stop
the machine at the first sign of trouble.

                  Linus