[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a8107f7d-3163-4f03-9f17-82e1a93d8e91@amd.com>
Date: Mon, 19 Jun 2023 13:53:05 -0500
From: "Limonciello, Mario" <mario.limonciello@....com>
To: Guenter Roeck <linux@...ck-us.net>,
Baskaran Kannan <Baski.Kannan@....com>, babu.moger@....com,
clemens@...isch.de, jdelvare@...e.com, linux-hwmon@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] hwmon: (k10temp) Enable AMD3255 Proc to show negative
temperature
On 6/19/2023 1:40 PM, Guenter Roeck wrote:
> On 6/19/23 11:02, Limonciello, Mario wrote:
>>
>> On 6/19/2023 12:07 PM, Guenter Roeck wrote:
>>> On 6/19/23 09:54, Baskaran Kannan wrote:
>>>> Industrial processor i3255 supports temperatures -40 deg celcius
>>>> to 105 deg Celcius. The current implementation of k10temp_read_temp
>>>> rounds off any negative
>>>> temperatures to '0'. To fix this, the following changes have been
>>>> made.
>>>> Added a flag 'disp_negative' to struct k10temp_data to support
>>>> AMD i3255 processors. Flag 'disp_negative' is set if 3255 processor
>>>> is found during k10temp_probe. Flag 'disp_negative' is used to
>>>> determine
>>>> whether to round off negative temperatures to '0' in
>>>> k10temp_read_temp.
>>>>
>>>> Signed-off-by: Baskaran Kannan <Baski.Kannan@....com>
>>>
>>> Now you have made changes you were not asked to make, extended the flag
>>> to cover a range of processors instead of just i3255, and did not
>>> provide
>>> a change log nor a comment in the code describing why processors with
>>> certain model numbers should display negative temperatures.
>>>
>> i3255 happens to be one of the industrial processors in family 17h
>> models
>> 01h through 08h. These are potentially used at subzero temperatures and
>> so displaying negative numbers makes a lot sense.
>>
>> So I think the commit message needs to be be amended to better
>> explain that.
>>
>> I guided Kannan against leaving a comment in the code with specific
>> models
>> because it either won't age well as other industrial processors are
>> introduced or may need to be ping-ponged each time.
>>
> That only applies if there is a guarantee that the check does not
> inadvertently ends up displaying negative temperatures for other CPUs
> which are misconfigured. After all, the current code is just a hack
> working around some problem with bad temperatures reported on other CPUs.
> Personally I'd rather have a clean fix for that. If/since that is not
> available, whatever is done subsequently (including the code suggested
> here)
> is just a hack.
>
> ... and if a hack on top of a hack is introduced, we need to make sure
> that
> it does not undo the previous hack.
>
>> But perhaps it should be more generic like:
>>
>> /* Industrial processors may be used at sub zero temperatures */
>>
>
> You can not just display negative temperatures for family 0x17h models
> 0x00..0x07 without explanation. The above needs to be documented.
> I fail to understand why a variant of
>
> "i3255 happens to be one of the industrial processors in family 17h
> models
> 01h through 08h. These are potentially used at subzero temperatures and
> so displaying negative numbers makes a lot sense."
>
> can not be added as comment and description if that is exactly what
> the code
> checks for. Something like
>
> "Family 17h models 01h through 08h are industrial processors with an
> operational
> temperature of -40°C - 105°C and may be used at subzero temperatures.
> Display negative temperatures for those processors."
>
> makes perfect sense to me. Only of course it is incorrect ...
>
> Model 0x1 was used for the original Zen, and 0x8 is Zen+. 1950X is
> family 0x17 model
> 0x01 per cpuinfo, meaning your hack undoes the original hack, and the bad
> temperatures would again be displayed for the affected systems. That
> is simply
> unacceptable.
I just double checked the documentation and you're correct.
To target just the industrial ones it would need to be > 0x00 and <=
0x08 ALONG with
checking the model_id_str value against the industrial ones.
>
> Yes, it may be a pain to find an acceptable hack to solve the problem,
> but after all this is a self-inflicted problem, so it can't be helped.
> The alternative would always be to find a better means to identify CPUs
> affected by the original problem. If that is not possible, explicitly
> listing CPUs
> which are _not_ affected is the only possible alternative.
So Pinnacle Ridge and Summit Ridge (Zen/Zen+) have model_id_str values
of 'B1' and 'B2'.
I think we should be able to detect those and only avoid showing the
negative values when:
* Family 17h
* Model > 0x00
* Model <= 0x0f
* Model ID str B1 or B2
>
> Note that the code sets disp_negative for model numbers < 0x8, meaning it
> does not include model 0x8. It also sets disp_negative for model 0x00
> which is
> specifically excluded above.
>
> All that is no excuse for not providing change logs.
Appreciate your being thorough.
>
> Guenter
>
>>> Guenter
>>>
>>>> ---
>>>> drivers/hwmon/k10temp.c | 8 ++++++--
>>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c
>>>> index 7b177b9fbb09..2613420d43ff 100644
>>>> --- a/drivers/hwmon/k10temp.c
>>>> +++ b/drivers/hwmon/k10temp.c
>>>> @@ -86,6 +86,7 @@ struct k10temp_data {
>>>> u32 show_temp;
>>>> bool is_zen;
>>>> u32 ccd_offset;
>>>> + bool disp_negative;
>>>> };
>>>> #define TCTL_BIT 0
>>>> @@ -204,12 +205,12 @@ static int k10temp_read_temp(struct device
>>>> *dev, u32 attr, int channel,
>>>> switch (channel) {
>>>> case 0: /* Tctl */
>>>> *val = get_raw_temp(data);
>>>> - if (*val < 0)
>>>> + if (*val < 0 && !data->disp_negative)
>>>> *val = 0;
>>>> break;
>>>> case 1: /* Tdie */
>>>> *val = get_raw_temp(data) - data->temp_offset;
>>>> - if (*val < 0)
>>>> + if (*val < 0 && !data->disp_negative)
>>>> *val = 0;
>>>> break;
>>>> case 2 ... 13: /* Tccd{1-12} */
>>>> @@ -405,6 +406,9 @@ static int k10temp_probe(struct pci_dev *pdev,
>>>> const struct pci_device_id *id)
>>>> data->pdev = pdev;
>>>> data->show_temp |= BIT(TCTL_BIT); /* Always show Tctl */
>>>> + if (boot_cpu_data.x86 == 0x17 && boot_cpu_data.x86_model < 0x8)
>>>> + data->disp_negative = true;
>>>> +
>>>> if (boot_cpu_data.x86 == 0x15 &&
>>>> ((boot_cpu_data.x86_model & 0xf0) == 0x60 ||
>>>> (boot_cpu_data.x86_model & 0xf0) == 0x70)) {
>>>
>
Powered by blists - more mailing lists