linux-kernel - Re: [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <bdc42bbc-846e-7695-9912-be9941004830@oracle.com>
Date:   Wed, 7 Apr 2021 15:13:50 -0700
From:   Dongli Zhang <dongli.zhang@...cle.com>
To:     Qais Yousef <qais.yousef@....com>
Cc:     linux-kernel@...r.kernel.org, tglx@...utronix.de,
        peterz@...radead.org, mpe@...erman.id.au,
        aneesh.kumar@...ux.ibm.com, ethp@...com, npiggin@...il.com,
        joe.jin@...cle.com
Subject: Re: [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is
 failed

On 4/6/21 9:10 AM, Qais Yousef wrote:
> On 04/05/21 14:22, Dongli Zhang wrote:
>> May I have if there is any feedback on this? To pr_err_once() here helps narrow
>> down what is the root cause of cpu online failure.
>>
>>
>> The issue fixed by d7eb79c6290c ("KVM: kvmclock: Fix vCPUs > 64 can't be
>> online/hotpluged") is able to demonstrate how this pr_err_once() helps.
>>
>> Due to VM kernel issue, at most 64 VCPUs can online if host clocksource is
>> switched to hpet.
>>
>> # echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
>>
>>
>> By default, we have no idea why only 64 out of 100 VCPUs are online in VM. We
>> need to customize kernel to debug why some CPUs are not able to online.
>>
>>
>> We will have below and know the root cause is with kvmclock, if we add the
>> pr_err_once().
> 
> Isn't pr_debug() more appropriate here? Don't you want one on the
> cpuhp_down_callbacks() too?

I will add one to cpuhp_down_callbacks() as well. Thank you very much for the
suggestion.

> 
> I *think* it's common now to have CONFIG_DYNAMIC_DEBUG enabled so one can
> enable that line when they need to
> 
> 	https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html
> 
> I can see the appeal of pr_err_once() but I think this can fail for legitimate
> reasons and is not necessarily strictly always an error?

The reason I used pr_err_once() is to remind the sysadmin something bad (maybe
not necessarily strictly as error) used to happen. It is quite helpful for
developer/sysadmin that is not familiar with cpu/smp/bootup/online.

Otherwise, someone will need to manually configure kernel command line with
"loglevel=8 dyndbg="file kernel/cpu.c +p".

The good thing with pr_err_once is it is not based on the assumption that
someone already know the root cause is related to kernel/cpu.c.

The good thing with pr_debug is we will not need to print for at most once.

I will send v2 with pr_debug() and let all maintainers to decide if it is
fine/required to print something here. Some places use WARN_ON_ONCE() for the
return value of cpuhp_invoke_callback(). I prefer to use pr_debug/pr_err to
print extra debug information for cpu online/hotplug case.

Thank you very much!

Dongli Zhang

> 
> Thanks
> 
> --
> Qais Yousef
>