linux-kernel - Re: [PATCH v1 1/3] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d1ceedca-b28e-c47e-aa0d-aa1cb36d12b9@redhat.com>
Date:   Thu, 22 Sep 2022 16:12:05 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Kalle Valo <kvalo@...nel.org>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-doc@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...nel.org>,
        David Laight <David.Laight@...LAB.COM>,
        Jonathan Corbet <corbet@....net>,
        Andy Whitcroft <apw@...onical.com>,
        Joe Perches <joe@...ches.com>,
        Dwaipayan Ray <dwaipayanray1@...il.com>,
        Lukas Bulwahn <lukas.bulwahn@...il.com>,
        Baoquan He <bhe@...hat.com>, Vivek Goyal <vgoyal@...hat.com>,
        Dave Young <dyoung@...hat.com>,
        Jani Nikula <jani.nikula@...ux.intel.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Nicholas Piggin <npiggin@...il.com>,
        Christophe Leroy <christophe.leroy@...roup.eu>
Subject: Re: [PATCH v1 1/3] coding-style.rst: document BUG() and WARN() rules
 ("do not crash the kernel")

On 21.09.22 06:40, Kalle Valo wrote:
> David Hildenbrand <david@...hat.com> writes:
> 
>> Linus notes [1] that the introduction of new code that uses VM_BUG_ON()
>> is just as bad as BUG_ON(), because it will crash the kernel on
>> distributions that enable CONFIG_DEBUG_VM (like Fedora):
>>
>>      VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
>>      no different, the only difference is "we can make the code smaller
>>      because these are less important". [2]
>>
>> This resulted in a more generic discussion about usage of BUG() and
>> friends. While there might be corner cases that still deserve a BUG_ON(),
>> most BUG_ON() cases should simply use WARN_ON_ONCE() and implement a
>> recovery path if reasonable:
>>
>>      The only possible case where BUG_ON can validly be used is "I have
>>      some fundamental data corruption and cannot possibly return an
>>      error". [2]
>>
>> As a very good approximation is the general rule:
>>
>>      "absolutely no new BUG_ON() calls _ever_" [2]
>>
>> ... not even if something really shouldn't ever happen and is merely for
>> documenting that an invariant always has to hold. However, there are sill
>> exceptions where BUG_ON() may be used:
>>
>>      If you have a "this is major internal corruption, there's no way we can
>>      continue", then BUG_ON() is appropriate. [3]
>>
>> There is only one good BUG_ON():
>>
>>      Now, that said, there is one very valid sub-form of BUG_ON():
>>      BUILD_BUG_ON() is absolutely 100% fine. [2]
>>
>> While WARN will also crash the machine with panic_on_warn set, that's
>> exactly to be expected:
>>
>>      So we have two very different cases: the "virtual machine with good
>>      logging where a dead machine is fine" - use 'panic_on_warn'. And
>>      the actual real hardware with real drivers, running real loads by
>>      users. [4]
>>
>> The basic idea is that warnings will similarly get reported by users
>> and be found during testing. However, in contrast to a BUG(), there is a
>> way to actually influence the expected behavior (e.g., panic_on_warn)
>> and to eventually keep the machine alive to extract some debug info.
>>
>> Ingo notes that not all WARN_ON_ONCE cases need recovery. If we don't ever
>> expect this code to trigger in any case, recovery code is not really
>> helpful.
>>
>>      I'd prefer to keep all these warnings 'simple' - i.e. no attempted
>>      recovery & control flow, unless we ever expect these to trigger.
>>      [5]
>>
>> There have been different rules floating around that were never properly
>> documented. Let's try to clarify.
>>
>> [1] https://lkml.kernel.org/r/CAHk-=wiEAH+ojSpAgx_Ep=NKPWHU8AdO3V56BXcCsU97oYJ1EA@mail.gmail.com
>> [2] https://lore.kernel.org/r/CAHk-=wg40EAZofO16Eviaj7mfqDhZ2gVEbvfsMf6gYzspRjYvw@mail.gmail.com
>> [2] https://lkml.kernel.org/r/CAHk-=wit-DmhMfQErY29JSPjFgebx_Ld+pnerc4J2Ag990WwAA@mail.gmail.com
>> [4] https://lore.kernel.org/r/CAHk-=wgF7K2gSSpy=m_=K3Nov4zaceUX9puQf1TjkTJLA2XC_g@mail.gmail.com
>> [5] https://lore.kernel.org/r/YwIW+mVeZoTOxn%2F4@gmail.com
>>
>> Signed-off-by: David Hildenbrand <david@...hat.com>
> 
> [...]
> 
>> +Use WARN_ON_ONCE() rather than WARN() or WARN_ON()
>> +**************************************************
>> +
>> +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it
>> +is common for a given warning condition, if it occurs at all, to occur
>> +multiple times. This can fill up and wrap the kernel log, and can even slow
>> +the system enough that the excessive logging turns into its own, additional
>> +problem.
> 
> FWIW I have had cases where WARN() messages caused a reboot, maybe
> mention that here? In my case the logging was so excessive that the
> watchdog wasn't updated and in the end the device was forcefully
> rebooted.
> 

That should be covered by the last part, no? What would be your suggestion?

-- 
Thanks,

David / dhildenb