linux-kernel - Re: [PATCH v17 3/6] crash: add generic infrastructure for crash hotplug support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <92fd9892-e1cf-79f6-529c-c4e5b1516802@oracle.com>
Date:   Fri, 20 Jan 2023 15:24:32 -0600
From:   Eric DeVolder <eric.devolder@...cle.com>
To:     Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org,
        x86@...nel.org, kexec@...ts.infradead.org, ebiederm@...ssion.com,
        dyoung@...hat.com, bhe@...hat.com, vgoyal@...hat.com
Cc:     mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, nramas@...ux.microsoft.com, thomas.lendacky@....com,
        robh@...nel.org, efault@....de, rppt@...nel.org, david@...hat.com,
        sourabhjain@...ux.ibm.com, konrad.wilk@...cle.com,
        boris.ostrovsky@...cle.com
Subject: Re: [PATCH v17 3/6] crash: add generic infrastructure for crash
 hotplug support


On 1/19/23 15:31, Thomas Gleixner wrote:
> Eric!
> 
> On Wed, Jan 18 2023 at 16:35, Eric DeVolder wrote:
>> CPU and memory change notifications are received in order to
>> regenerate the elfcorehdr.
>>
>> To support cpu hotplug, a callback is registered to capture the
>> CPUHP_AP_ONLINE_DYN online and offline events via
>> cpuhp_setup_state_nocalls().
> 
> This sentence does not make sense. The callback is not registered to
> capture CPUHP_AP_ONLINE_DYN events >
> What this does is: It installs a dynamic CPU hotplug state with
> callbacks for online and offline. These callbacks store information
> about a CPU coming up and going down. Right?

I agree, the wording is wrong; this code taps into that state, as you suggest, in order to handle 
the online and offline events.

> 
> But why are they required and what's the value?
> 
> This changelog tells WHAT it does and not WHY. I can see the WHAT from
> the patch itself.
> 
> Don't tell me the WHY is in the cover letter. The cover letter is not
> part of the commits and changelogs have to be self contained.
> 
> Now let me cite from your cover letter:
> 
>> When the kdump service is loaded, if a CPU or memory is hot
>> un/plugged, the crash elfcorehdr, which describes the CPUs
>> and memory in the system, must also be updated, else the resulting
>> vmcore is inaccurate (eg. missing either CPU context or memory
>> regions).
I'll work to improve the wording and why for the next iteration.

> 
> The CPU hotplug state you are using for this is patently inaccurate
> too. With your approach the CPU is tracked as online very late in the
> hotplug process and tracked as offline very early on unplug.
> 
> So if the kernel crashes before/after the plug/unplug tracking event
> then your recorded state is bogus and given the amount of callbacks
> between the real online/offline and the recording point there is a
> pretty large window.
> 
> You can argue that this is better than the current state and considered
> good enough for whatever reason, but such information wants to be in the
> changelog, no?
I agree!  I admit that CPUHP_AP_ONLINE_DYN may (is) not the best choice. I did spend time looking at 
the cpu hotplug infrastructure, but did not learn a better/correct way. Fwiw: 
https://lore.kernel.org/lkml/20211118174948.37435-1-eric.devolder@oracle.com/:

"The second problem is the use of CPUHP_AP_ONLINE_DYN.  The
cpuhp_setup_state_nocalls() is invoked with parameter
CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
the CPU still shows up in foreach_present_cpu() during the
regeneration of the elfcorehdr, thus the need to explicitly check and
exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
cpuhp_setup_state() was utilized, then the offline cpu would no longer
be in foreach_present_cpu(), and this change could be eliminated. I do
not understand cpuhp_setup_state() well enough to choose, or create,
appropriate value(s)."

The problem described (and worked around in this patch series) is the behavior/window you point out. 
I'd prefer to narrow the window, if possible. The states/values I tried did not work; any 
suggestions for a more appropriate state/value would be most welcomed!

> 
> Thanks,
> 
>          tglx
> 
> Hint: The requirements for changelogs are well documented in Documentation/process/
> 
> 
Thomas, thank you for looking at this!
eric