linux-kernel - Re: [PATCH v4] cpufreq: intel_pstate: Reporting reasons why driver prematurely exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a2f4128c-54cc-44b3-d615-977e9ed37d07@criteo.com>
Date:   Wed, 13 Feb 2019 08:50:14 +0000
From:   Erwan Velu <e.velu@...teo.com>
To:     "Rafael J. Wysocki" <rafael@...nel.org>,
        Erwan Velu <erwanaliasr1@...il.com>
CC:     Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Len Brown <lenb@...nel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        "open list:INTEL PSTATE DRIVER" <linux-pm@...r.kernel.org>,
        open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v4] cpufreq: intel_pstate: Reporting reasons why driver
 prematurely exit

Le 13/02/2019 à 00:01, Rafael J. Wysocki a écrit :
[...]
> Newline characters are missing in all of your messages.

oops. Fixing this.

> "ACPI _PSS not found\n"

Done.

>>          return true;
>>   }
>>
>> @@ -2484,10 +2485,15 @@ static bool __init intel_pstate_no_acpi_pcch(void)
>>          acpi_handle handle;
>>
>>          status = acpi_get_handle(NULL, "\\_SB", &handle);
>> -       if (ACPI_FAILURE(status))
>> +       if (ACPI_FAILURE(status)) {
>> +               pr_debug("Cannot detect ACPI SB");
> This is very unlikely to happen, I wouldn't bother to print anything here.

As this is test is made, it means someone considered that case could exist.

That's why a added a message to catch this case while debugging.

I do agree that is not very likely.

>
>>                  return true;
>> +       }
>>
>> -       return !acpi_has_method(handle, "PCCH");
>> +       status = acpi_has_method(handle, "PCCH");
>> +       if (!status)
>> +               pr_debug("Cannot detect ACPI PCCH");
> This message can be printed by the caller and, again, I would prefer
> something like "ACPI PCCH not found".
Fixed too.

[..]
> "ACPI _PPC not found\n"
fixed.
>>          return false;
>>   }
>>
>> @@ -2539,8 +2546,10 @@ static bool __init intel_pstate_platform_pwr_mgmt_exists(void)
>>          id = x86_match_cpu(intel_pstate_cpu_oob_ids);
>>          if (id) {
>>                  rdmsrl(MSR_MISC_PWR_MGMT, misc_pwr);
>> -               if ( misc_pwr & (1 << 8))
>> +               if (misc_pwr & (1 << 8)) {
>> +                       pr_debug("MSR_MISC_PWR_MGMT reports enabled HW coordination");
> IIRC this means that the platform is managing P-states on systems
> without HWP, so I would print something like "P-states managed by the
> platform\n".

That's what the caller reports.

I added this additional message because this function added as special 
case for this MSR setting.

if intel_pstate_platform_pwr_mgmt_exists() is true, that's nice to know 
why when debugging.

>>                          return true;
>> +               }
>>          }
>>
>>          idx = acpi_match_platform_list(plat_info);
>> @@ -2606,22 +2615,28 @@ static int __init intel_pstate_init(void)
>>                  }
>>          } else {
>>                  id = x86_match_cpu(intel_pstate_cpu_ids);
>> -               if (!id)
>> +               if (!id) {
>> +                       pr_warn("CPU ID is not in the list of supported devices\n");
> Why not pr_debug()?
>
> And analogously below?

I'm coming from a use case where on the same hardware, changing the 
kernel changed the performance profile and impacted user's performance.

I had to debug this case from a running server.

 From the dmesg, one of the difference I saw was about the 
missing "Intel P-state driver initializing" message, but nothing else.

The reason why the driver didn't engaged itself wasn't explicit.

As CONFIG_X86_INTEL_PSTATE is set to Y by default, I wasn't enable to 
reload it in any way.

By reading the code, most of the code path and return options doesn't 
have a single pr_<something> call to explain why.

So I had to rebuild the kernel + my patch to understand which path was 
taken and then find the root cause of why the pstate behavior changed.

I wanted to contribute back my experience here by providing messages to 
ease the understanding and potentially debugging of the driver without 
recompiling it.

So the root of this patch is being able to report the main reasons of 
why the driver didn't engaged itself with a pr_warn (could be also a 
pr_info).

All the major and probable "return -ENODEV" calls before pr_info("Intel 
P-state driver initializing\n") deserves to be explicit and reachable on 
dmesg for the operators.

I'm operating 30K+ servers and having this kind direct information would 
save serious time when debugging this situations.

If my experience in that area could help other users, I'd be happy.

I'm positing a v5 with all those changes.

Thanks for the review.

Erwan,