[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SY4P282MB30634D9D9873C9C8DC41D4EEC5F22@SY4P282MB3063.AUSP282.PROD.OUTLOOK.COM>
Date: Wed, 29 May 2024 17:40:17 +1000
From: Stephen Horvath <s.horvath@...look.com.au>
To: Thomas Weißschuh <linux@...ssschuh.net>,
Guenter Roeck <linux@...ck-us.net>
Cc: Jean Delvare <jdelvare@...e.com>, Benson Leung <bleung@...omium.org>,
Lee Jones <lee@...nel.org>, Guenter Roeck <groeck@...omium.org>,
linux-kernel@...r.kernel.org, linux-hwmon@...r.kernel.org,
chrome-platform@...ts.linux.dev, Dustin Howett <dustin@...ett.net>,
Mario Limonciello <mario.limonciello@....com>,
Moritz Fischer <mdf@...nel.org>
Subject: Re: [PATCH v2 1/2] hwmon: add ChromeOS EC driver
Hi Thomas,
On 29/5/24 16:23, Thomas Weißschuh wrote:
> On 2024-05-29 10:58:23+0000, Stephen Horvath wrote:
>> On 29/5/24 09:29, Guenter Roeck wrote:
>>> On 5/28/24 09:15, Thomas Weißschuh wrote:
>>>> On 2024-05-28 08:50:49+0000, Guenter Roeck wrote:
>>>>> On 5/27/24 17:15, Stephen Horvath wrote:
>>>>>> On 28/5/24 05:24, Thomas Weißschuh wrote:
>>>>>>> On 2024-05-25 09:13:09+0000, Stephen Horvath wrote:
>>>>>>>> Don't forget it can also return `EC_FAN_SPEED_STALLED`.
>
> <snip>
>
>>>>>>>
>>>>>>> Thanks for the hint. I'll need to think about how to
>>>>>>> handle this better.
>>>>>>>
>>>>>>>> Like Guenter, I also don't like returning `-ENODEV`,
>>>>>>>> but I don't have a
>>>>>>>> problem with checking for `EC_FAN_SPEED_NOT_PRESENT`
>>>>>>>> in case it was removed
>>>>>>>> since init or something.
>>>>>>>
>>>>>
>>>>> That won't happen. Chromebooks are not servers, where one might
>>>>> be able to
>>>>> replace a fan tray while the system is running.
>>>>
>>>> In one of my testruns this actually happened.
>>>> When running on battery, one specific of the CPU sensors sporadically
>>>> returned EC_FAN_SPEED_NOT_PRESENT.
>>>>
>>>
>>> What Chromebook was that ? I can't see the code path in the EC source
>>> that would get me there.
>>>
>>
>> I believe Thomas and I both have the Framework 13 AMD, the source code is
>> here:
>> https://github.com/FrameworkComputer/EmbeddedController/tree/lotus-zephyr
>
> Correct.
>
>> The organisation confuses me a little, but Dustin has previous said on the
>> framework forums (https://community.frame.work/t/what-ec-is-used/38574/2):
>>
>> "This one is based on the Zephyr port of the ChromeOS EC, and tracks
>> mainline more closely. It is in the branch lotus-zephyr.
>> All of the model-specific code lives in zephyr/program/lotus.
>> The 13"-specific code lives in a few subdirectories off the main tree named
>> azalea."
>
> The EC code is at [0]:
>
> $ ectool version
> RO version: azalea_v3.4.113353-ec:b4c1fb,os
> RW version: azalea_v3.4.113353-ec:b4c1fb,os
> Firmware copy: RO
> Build info: azalea_v3.4.113353-ec:b4c1fb,os:7b88e1,cmsis:4aa3ff 2024-03-26 07:10:22 lotus@...172-26-3-226
> Tool version: 0.0.1-isolate May 6 2024 none
I can confirm mine is the same build too.
> From the build info I gather it should be commit b4c1fb, which is the
> current HEAD of the lotus-zephyr branch.
> Lotus is the Framework 16 AMD, which is very similar to Azalea, the
> Framework 13 AMD, which I tested this against.
> Both share the same codebase.
>
>> Also I just unplugged my fan and you are definitely correct, the EC only
>> generates EC_FAN_SPEED_NOT_PRESENT for fans it does not have the capability
>> to support. Even after a reboot it just returns 0 RPM for an unplugged fan.
>> I thought about simulating a stall too, but I was mildly scared I was going
>> to break one of the tiny blades.
>
> I get the error when unplugging *the charger*.
>
> To be more precise:
>
> It does not happen always.
> It does not happen instantly on unplugging.
> It goes away after a few seconds/minutes.
> During the issue, one specific sensor reads 0xffff.
>
Oh I see, I haven't played around with the temp sensors until now, but I
can confirm the last temp sensor (cpu@4c / temp4) will randomly (every
~2-15 seconds) return EC_TEMP_SENSOR_ERROR (0xfe).
Unplugging the charger doesn't seem to have any impact for me.
The related ACPI sensor also says 180.8°C.
I'll probably create an issue or something shortly.
I was mildly confused by 'CPU sensors' and 'EC_FAN_SPEED_NOT_PRESENT' in
the same sentence, but I'm now assuming you mean the temp sensor?
>>>>>>> Ok.
>>>>>>>
>>>>>>>> My approach was to return the speed as `0`, since
>>>>>>>> the fan probably isn't
>>>>>>>> spinning, but set HWMON_F_FAULT for `EC_FAN_SPEED_NOT_PRESENT` and
>>>>>>>> HWMON_F_ALARM for `EC_FAN_SPEED_STALLED`.
>>>>>>>> No idea if this is correct though.
>>>>>>>
>>>>>>> I'm not a fan of returning a speed of 0 in case of errors.
>>>>>>> Rather -EIO which can't be mistaken.
>>>>>>> Maybe -EIO for both EC_FAN_SPEED_NOT_PRESENT (which
>>>>>>> should never happen)
>>>>>>> and also for EC_FAN_SPEED_STALLED.
>>>>>>
>>>>>> Yeah, that's pretty reasonable.
>>>>>>
>>>>>
>>>>> -EIO is an i/o error. I have trouble reconciling that with
>>>>> EC_FAN_SPEED_NOT_PRESENT or EC_FAN_SPEED_STALLED.
>>>>>
>>>>> Looking into the EC source code [1], I see:
>>>>>
>>>>> EC_FAN_SPEED_NOT_PRESENT means that the fan is not present.
>>>>> That should return -ENODEV in the above code, but only for
>>>>> the purpose of making the attribute invisible.
>>>>>
>>>>> EC_FAN_SPEED_STALLED means exactly that, i.e., that the fan
>>>>> is present but not turning. The EC code does not expect that
>>>>> to happen and generates a thermal event in case it does.
>>>>> Given that, it does make sense to set the fault flag.
>>>>> The actual fan speed value should then be reported as 0 or
>>>>> possibly -ENODATA. It should _not_ generate any other error
>>>>> because that would trip up the "sensors" command for no
>>>>> good reason.
>>>>
>>>> Ack.
>>>>
>>>> Currently I have the following logic (for both fans and temp):
>>>>
>>>> if NOT_PRESENT during probing:
>>>> make the attribute invisible.
>>>>
>>>> if any error during runtime (including NOT_PRESENT):
>>>> return -ENODATA and a FAULT
>>>>
>>>> This should also handle the sporadic NOT_PRESENT failures.
>>>>
>>>> What do you think?
>>>>
>>>> Is there any other feedback to this revision or should I send the next?
>>>>
>>>
>>> No, except I'd really like to know which Chromebook randomly generates
>>> a EC_FAN_SPEED_NOT_PRESENT response because that really looks like a bug.
>>> Also, can you reproduce the problem with the ectool command ?
>
> Yes, the ectool command reports the same issue at the same time.
>
> The fan affected was always the sensor cpu@4c, which is
> compatible = "amd,sb-tsi".
>
>> I have a feeling it was related to the concurrency problems between ACPI and
>> the CrOS code that are being fixed in another patch by Ben Walsh, I was also
>> seeing some weird behaviour sometimes but I *believe* it was fixed by that.
>
> I don't think it's this issue.
> Ben's series at [1], is for MEC ECs which are the older Intel
> Frameworks, not the Framework 13 AMD.
Yeah sorry, I saw it mentioned AMD and threw it into my kernel, I also
thought it stopped the 'packet too long' messages (for
EC_CMD_CONSOLE_SNAPSHOT) but it did not.
Thanks,
Steve
Powered by blists - more mailing lists