[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c26f4a9d-df14-c8af-4c99-5a670099e8bc@aquantia.com>
Date: Tue, 24 Sep 2019 14:32:46 +0000
From: Igor Russkikh <Igor.Russkikh@...antia.com>
To: Holger Hoffstätte
<holger@...lied-asynchrony.com>, Netdev <netdev@...r.kernel.org>
Subject: Re: atlantic: weird hwmon temperature readings with AQC107 NIC
(kernel 5.2/5.3)
On 24.09.2019 17:30, Holger Hoffstätte wrote:
> On 9/24/19 4:16 PM, Holger Hoffstätte wrote:
>> Hi,
>>
>> I recently upgraded my home network with two AQ107-based NICs and a
>> multi-speed switch. Everything works great, but I couldn't help but notice
>> very weird hwmon temperature output (which I wanted to use for monitoring
>> and alerting).
>>
>> Both cards identify as:
>>
>> $lspci -v -s 06:00.0
>> 06:00.0 Ethernet controller: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz
>> Ethernet Controller [AQtion] (rev 02)
>> Subsystem: ASUSTeK Computer Inc. AQC107 NBase-T/IEEE 802.3bz Ethernet
>> Controller [AQtion]
>>
>> In one machine lm_sensors says:
>>
>> eth0-pci-0200
>> Adapter: PCI adapter
>> PHY Temperature: +315.1°C
>>
>> This seems quite wrong since the card is only slightly warm to the touch, and
>> 315.1 is exactly 255 + 60.1 - the latter value feels more like the actual
>> temperature.
>>
>> On a second machine it says:
>>
>> eth0-pci-0600
>> Adapter: PCI adapter
>> PHY Temperature: +6977.0°C
>>
>> I feel qualified to say that is definitely wrong as well, since the machine is
>> currently not melting its way to the earth's core, and also only slightly warm
>> to the touch. :)
>>
>> Both cards also reported wrong values with kernel 5.2, but since I'm on 5.3.1
>> I might as well report the current wrongness.
>>
>> Do we know who's to blame here - motherboards, NICs, driver, kernel, hwmon
>> infrastructure? I believe the hwmon patches landed first in 5.2.
>
> Another observation: the hwmon output immediately becomes sane (~58°)
> when I down the link with ifconfig. As soon as I bring the link back up,
> the temperature jumps from 58° to 6976° in one second.
> It seems that the presence of the carrier somehow mangles the sensor
> readings. I hope this helps to find the issue.
>
> thanks,
> Holger
Hi Holger,
Thanks for the report,
We've recently found out that as well, could you try applying the following patch:
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
index da726489e3c8..08b026b41571 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c
@@ -337,7 +337,7 @@ static int aq_fw2x_get_phy_temp(struct aq_hw_s *self, int *temp)
/* Convert PHY temperature from 1/256 degree Celsius
* to 1/1000 degree Celsius.
*/
- *temp = temp_res * 1000 / 256;
+ *temp = (temp_res & 0xFFFF) * 1000 / 256;
return 0;
}
Funny thing is that this value gets readout from HW memory, all the readouts are
done by full dwords, but the value is only word width. High word of that dword
is estimated cable length. On short cables we use it is often zero ;)
As I see from your readings - your cables are abit longer :)
This also explains why temp is good when you do interface down.
Regards,
Igor
Powered by blists - more mailing lists