linux-kernel - Re: [PATCH v2] platform/x86/tuxedo: Implement TUXEDO TUXI ACPI TFAN via hwmon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c6c5aceb-e329-464e-82ca-b70c5310ca89@tuxedocomputers.com>
Date: Fri, 14 Mar 2025 18:28:16 +0100
From: Werner Sembach <wse@...edocomputers.com>
To: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
Cc: Hans de Goede <hdegoede@...hat.com>, Jean Delvare <jdelvare@...e.com>,
 Guenter Roeck <linux@...ck-us.net>, LKML <linux-kernel@...r.kernel.org>,
 platform-driver-x86@...r.kernel.org, linux-hwmon@...r.kernel.org,
 Daniel Lezcano <daniel.lezcano@...aro.org>
Subject: Re: [PATCH v2] platform/x86/tuxedo: Implement TUXEDO TUXI ACPI TFAN
 via hwmon

Hi Ilpo,

Am 14.03.25 um 14:59 schrieb Ilpo Järvinen:
> On Fri, 14 Mar 2025, Werner Sembach wrote:
>
>> Sorry, resend, mail client did html message by accident
> Np.
>
>> Am 14.03.25 um 11:05 schrieb Ilpo Järvinen:
>>>>>> +			S32_MAX : (retval - TUXI_FW_TEMP_OFFSET) *
>>>>>> 100;
>>>>> Is the math wrong, as you do retval - TUXI_FW_TEMP_OFFSET before
>>>>> multiplying?
>>>> No, retval is in 10th of °K (but the last number is always 0) so is
>>>> TUXI_FW_TEMP_OFFSET which is there to convert it to 10th of °C, the * 100
>>>> is
>>>> then to bring it in line with hwmon wanting to output milli degrees
>>> So is result of S32_MAX correct when retval is 21474837?
>>>
>>> (21474837-2730)*100
>>> 2147210700
>>> 2^31-1
>>> 2147483647
>>>
>>> 2147210700 would have been representable but the upper bound is
>>> still applied (the value might be large enough to not have practical
>>> significance but to me the code looks still illogical why it applies the
>>> bound prematurely).
>> Yeah my though was: this check is only here to catch the firmware doing some
>> crazy stuff and sending highly unrealistic values, so gifting a small bit of
>> the available range away doesn't matter
> But it does matter as you could note. I stumbled on the logic which didn't
> look right while reviewing. You even claimed afterwards is not wrong when
> I raised this. :-/
>
> Please just correct the logic so it makes sense to the code reader, there
> seems to be no well justified reason to keep the illogical code even if
> the practical impact is very low. It's probably done the way it is only
> because the variable types are what they are so you couldn't do the
> subtraction like I proposed ;-). At minimum you'd need to add a comment to
> warn about the inconsistency at which point rewriting to correct logic is
> already way simpler.
ok
>
>>> I see you already sent another version, it would have been prudent to wait
>>> a bit longer as you contested some of the comments so you could have seen
>>> my replies before sending the next version.
>> I'm sorry. I just wanted to show that I'm iterating as I wait for the reply if
>> the design with the periodic safeguard is acceptable. If that's gets rejected
>> this driver must be rewritten anyway.
> Kernel development is not a sprint. It's better to avoid sending versions
> unnecessarily, a day or two isn't worth it when compared with ending up
> into people's low priority bin which will inevitably happen when the
> version counter starts to grow beyond v5-6.
>
> I (and likely others too) appreciate if they don't have to waste review
> cycles on something that is not "complete" because we have to look at the
> completed one later too. Maintainers work in good faith that developers
> are simply improving their patches (or working on some other great
> improvements to the kernel :-)) while nothing seemingly happens for a
> while. There's no need to prove that something is going on just for the
> sake of proving.
>
> Obviously RFC patches are still fine to ask specific questions about
> something, but that's not about proving progress (in fact, RFC patches are
> more about being "stuck" than about making progress).
Sorry, didn't know though it will eventually just get skipped.
>
>>>>> Shouldn't it be like this:
>>>>>
>>>>> 		retval -= TUXI_FW_TEMP_OFFSET;
>>>>> 		*val = min(retval * 100, (unsigned long long)S32_MAX);
>>>> As retval is unsigned this would not work with (theoretical) negative °C.
>>> So your code relies on implicit type conversion in this: (retval -
>>> TUXI_FW_TEMP_OFFSET) ?
>> I can add an explicit cast, np.
>>
>> [snip]
>>
>>>>>> +	}
>>>>>> +	if (temp >= temp_high)
>>>>>> +		ret = i;
>>> Now that I reread things, is this also incorrect, as "i" is at the
>>> terminator entry at this point?
>> Yes that's intentional, the 3 entries in the array open up 4 ranges:
>>
>> lower then 1st entry i=0, between 1st and 2nd entry i=1, 2nd and 3rd i=2,
>> higher then 3rd i=3 (the value that terminates the for loop)
> I didn't realize that. To me { } looks just an terminating entry. So
> what's the min_speed going to be for that last entry since it's
> initialized to 0?
>
> Oh, I see you're taking .min_speed from temp_levels[temp_level - 1] which
> I don't like either. You have a "state" and then store min_speed for the
> state into other index inside the array?!?

Because state 0 is the implicit low temperature -> no min speed is required

And then high states have higher min speeds. So to sync this up I would need to 
add a dummy entry in the first position of the array.

>
>>>>>> +
>>>>>> +		temp = retval > S32_MAX / 100 ?
>>>>>> +			S32_MAX : (retval - TUXI_FW_TEMP_OFFSET) *
>>>>>> 100;
>>>>> Same math issue comment as above.
>>>>>
>>>>> Why is the read+conversion code duplicated into two places?
>>>> because here it is with special error handling and didn't thought about an
>>>> own
>>>> function for a defacto 2 liner
>>> A function that does read+conversion would be 6-8 lines with the error
>>> handling.
>> I can add it.
>>
>> [snip]
>>
>> Thanks for the code review again.
>>
>>
>> Last but not least: As already mentioned, I still wonder if the design with
>> the periodic safeguard is ok or not or?
> I'm not sure if fully understand Daniel's suggestion [1] as it
> doesn't specify who/what is sending that notification to the thermal
> engine.
>
> [1] https://lore.kernel.org/all/286f5efc-cd15-4e0b-bec2-2e9bbb93dd37@linaro.org/#t

Me too, also at one point he was talking about device trees, but isn't that only 
an ARM thing? Or can you define one for x86 devices too?

I put him in the cc again

>
>
> When it comes to your own concerns, I'm not exactly buying the argument
> that userspace can do dangerous things. Yeah, it can shoot one's own
> foot, no doubt, such as unloading this driver and there goes your periodic
> safeguards. If the argument would be that userspace fails to respond (in
> time), I would have less trouble in accepting that argument.
>
Unloading the driver will activate the auto mode again with firmware handling 
the complete fancurve.

Userspace not responding in time, or not responding at all, or intentionally 
always requesting fan speed 0 kinda all result in the conclusion -> don't trust 
userspace.

Having this in driver is the only surefire way to communicate to the 
User(-space) that this device must not be run under full load with the fans 
turned off.

Best regards,

Werner