[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <HE1PR0502MB37537B5DCD0D607DFB7C7099A2760@HE1PR0502MB3753.eurprd05.prod.outlook.com>
Date: Thu, 21 Jun 2018 19:17:03 +0000
From: Vadim Pasternak <vadimp@...lanox.com>
To: Andrew Lunn <andrew@...n.ch>, Guenter Roeck <linux@...ck-us.net>
CC: "davem@...emloft.net" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"jiri@...nulli.us" <jiri@...nulli.us>
Subject: RE: [PATCH v0 03/12] mlxsw: core: Add core environment module for
port temperature reading
> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@...n.ch]
> Sent: Thursday, June 21, 2018 9:35 PM
> To: Vadim Pasternak <vadimp@...lanox.com>; Guenter Roeck <linux@...ck-
> us.net>
> Cc: davem@...emloft.net; netdev@...r.kernel.org; jiri@...nulli.us
> Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> port temperature reading
>
> > Hi Andrew,
>
> Adding Guenter Roeck, the HWMON maintainer.
>
> > The temperature of each individual module can be obtained through
> > ethtool.
>
> You mean via --module-info?
Yes.
>
> FYI: I plan to add hwmon support to the kernel SFP code. So if you ever decide to
> swap to the kernel SFP code, not your own, the raw temperatures will be
> exported.
>
Not sure it'll work for us, since we read SFP/QSFP ports through our SW/FW
interface.
But would be nice if you can provide some reference to this code.
> > The worst temperature is necessary for the system cooling control
> > decision.
>
> I would expect the system cooling would understand that.
>
In thermal zone infrastructure there is one temperature input.
How you can consider 64+ different inputs?
> > Up to 64 SFP/QSFP modules could be connected to the system.
> > Some of them could cooper modules, which doesn't provide temperature
> > measurement.
>
> SFP modules are hot-plugable. So i would also expect the hwmon devices to
> hotplug. If there is no sensor, then there is no hwmon device... If there is no
> hwmon device, it plays no part in the thermal control loop.
>
> > Some of them could be optical modules, providing untrusted temperature
> > measurement, which could impact thermal control of the system.
>
> Why would you not trust it? Are you saying some modules simply have broken
> temperature sensors? Do you have a whitelist/blacklist of modules?
>
We are reading temperature info through the firmware.
In case of "broken" module (module is supposed to be capable of
reading temperature, but returns some non-valid code), we'll get
some error code.
> > Also optical modules could be from the different vendors, and this is
> > real situation, when, f.e. one module has the warning and critical
> > thresholds 75C and 85C, while another 70C and 80C.
>
> But hwmon exports both the actual temperature and the alarm temperatures. I
> would expect the thermal control code to use all this information when making
> its decisions, not just the current temperature.
>
All information is used, but the decision to increase FAN speed is taken
based on the worst measure, which is logical.
> > So, nominal temperature is not the case here, we should know the
> > "worst" value for the thermal control decision.
>
> What it sounds like to me is you are working around problems in the thermal
> control by fudging the raw temperatures. That is the wrong thing to do. hwmon
> should export the raw data, and you should fix the thermal control code to use it
> correctly.
By default we are using kernel step-wise thermal algorithm, considering
all the module and ASIC ambient sensors temperature. This is not working
around. In thermal zone we have one PWM control and cumulative temperature
from the modules and ASIC. And it gives stable and correct results.
>
> Andrew
Powered by blists - more mailing lists