linux-kernel - Re: [PATCH RFC] panic: call panic handlers before panic_other_cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20250302063638.7096-1-ryotkkr98@gmail.com>
Date: Sun,  2 Mar 2025 15:36:38 +0900
From: Ryo Takakura <ryotkkr98@...il.com>
To: pmladek@...e.com
Cc: akpm@...ux-foundation.org,
	bhe@...hat.com,
	decui@...rosoft.com,
	gregkh@...uxfoundation.org,
	haiyangz@...rosoft.com,
	hamzamahfooz@...ux.microsoft.com,
	jani.nikula@...el.com,
	jfalempe@...hat.com,
	joel.granados@...nel.org,
	john.ogness@...utronix.de,
	linux-hyperv@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	ryotkkr98@...il.com,
	wei.liu@...nel.org
Subject: Re: [PATCH RFC] panic: call panic handlers before panic_other_cpus_shutdown()

Hi Petr!

On Wed, 26 Feb 2025 16:49:26 +0100, Petr Mladek wrote:
>On Sat 2025-02-22 14:44:05, Ryo Takakura wrote:
>> On Fri, 21 Feb 2025 16:23:07 -0500, Hamza Mahfooz wrote:
>> >On Fri, Feb 21, 2025 at 11:23:28AM +0900, Ryo Takakura wrote:
>> >> On Thu, 20 Feb 2025 17:53:00 -0500, Hamza Mahfooz wrote:
>> >> >Since, the panic handlers may require certain cpus to be online to panic
>> >> >gracefully, we should call them before turning off SMP. Without this
>> >> >re-ordering, on Hyper-V hv_panic_vmbus_unload() times out, because the
>> >> >vmbus channel is bound to VMBUS_CONNECT_CPU and unless the crashing cpu
>> >> >is the same as VMBUS_CONNECT_CPU, VMBUS_CONNECT_CPU will be offlined by
>> >> >crash_smp_send_stop() before the vmbus channel can be deconstructed.
>> >> >
>> >> So maybe panic_other_cpus_shutdown() should be palced after 
>> >> atomic_notifier_call_chain() along with printk_legacy_allow_panic_sync()
>> >> like below?
>> >> 
>> >> ----- BEGIN -----
>> >> diff --git a/kernel/panic.c b/kernel/panic.c
>> >> index d8635d5cecb2..7ac40e85ee27 100644
>> >> --- a/kernel/panic.c
>> >> +++ b/kernel/panic.c
>> >> @@ -372,16 +372,16 @@ void panic(const char *fmt, ...)
>> >>         if (!_crash_kexec_post_notifiers)
>> >>                 __crash_kexec(NULL);
>> >> 
>> >> -       panic_other_cpus_shutdown(_crash_kexec_post_notifiers);
>> >> -
>> >> -       printk_legacy_allow_panic_sync();
>> >> -
>> >>         /*
>> >>          * Run any panic handlers, including those that might need to
>> >>          * add information to the kmsg dump output.
>> >>          */
>> >>         atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
>> >> 
>> >> +       panic_other_cpus_shutdown(_crash_kexec_post_notifiers);
>> >> +
>> >> +       printk_legacy_allow_panic_sync();
>> >> 
>> >>         panic_print_sys_info(false);
>> >> 
>> >>         kmsg_dump_desc(KMSG_DUMP_PANIC, buf);
>> >> ----- END -----
>> >
>> >Ya, that looks fine to me, that's actually how I had it initally, but I
>> >wasn't sure if it had to go before the panic handlers. So, I erred on
>> >the side of caution.
>
>The ordering (stopping CPUs before allowing printk_legacy loop)
>is important from the printk POV. So, keep it, please.

Thanks for the check.

>> I see, sorry that I was only speaking in relation to stored backtraces.
>> It seems that printk_legacy_allow_panic_sync() is placed before
>> atomic_notifier_call_chain() so that it can handle flushing before calling
>> any panic handlers as described [0].
>
>> [0] https://lore.kernel.org/lkml/ZeHSgZs9I3Ihvpye@alley/
>
>> I'm not really familar with the problems associated with panic handlers
>> so I hope maybe John and Petr can help on this matter...
>
>Honestly, I do not have much experience with failures of the panic
>notifiers. But I saw a patchset which tried to add filtering of
>some problematic ones, see
>https://lore.kernel.org/lkml/20220108153451.195121-1-gpiccoli@igalia.com/
>
>I did not like the way of ad-hoc filtering. The right solution was to
>fix the problematic notifiers.
>
>Anyway, it went out that the situation was not that easy. The notifiers
>do various things. Some of them just printing extra information. Others
>stopped or suspended some devices or services. Some should be called
>before and some after crash_dump.
>
>The outcome was a monster-patchset which tried to fix some problematic
>notifiers and split them into more notifier chains, see
>https://lore.kernel.org/all/20220427224924.592546-1-gpiccoli@igalia.com/
>
>Some of the fixes were accepted but the split has never been done.

I see. I went through some of the discussions on the thread [0]
and I can see how complicated the subject is...

>My opinion:
>
>1. The best solution would be to make the problematic notifier working
>   with stopped CPUs. The discussion around [v2] suggests that the author
>   made it working at least for x86_64, see
>   https://lore.kernel.org/r/20250221213055.133849-1-hamzamahfooz@linux.microsoft.com

I agree. But I also like the below solution as well!

>2. Another good solution might be to do the split of the notifier
>   chain, for an example, see
>   https://lore.kernel.org/lkml/Yn0TnsWVxCcdB2yO@alley/
>
>   The problematic notifier can be then added into a chain which
>   is called before stopping CPUs.

Thanks for sharing this! Such an interesting discussion on what and when 
should be handled in panic path. I think I have a better picture of panic()
now :).

>3. In the worst case, you could change the ordering as proposed above.
>   I am just afraid that it might bring in new problems. There might
>   be notifiers which were not tested with more running CPUs...
>
>
>In general, the system is in an unpredictable state when panic() is
>called. Notifiers should not expect that non-panic CPUs will be
>able to handle any requests.
>
>Also it looks like a good idea to stop non-panic CPUs as soon as possible.
>Otherwise, they might create more harm than good.

I see. I also think that 3. might introduce new problems...
It goes against the general statements which you pointed out 
in which case its not really a problem of handler anymore.

Sincerely,
Ryo Takakura

>Best Regards,
>Petr

[0] https://lore.kernel.org/lkml/20220427224924.592546-25-gpiccoli@igalia.com/