lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <HE1PR0502MB3753A634C0401556AA599C24A2750@HE1PR0502MB3753.eurprd05.prod.outlook.com>
Date:   Fri, 22 Jun 2018 09:00:50 +0000
From:   Vadim Pasternak <vadimp@...lanox.com>
To:     Guenter Roeck <linux@...ck-us.net>, Andrew Lunn <andrew@...n.ch>
CC:     "davem@...emloft.net" <davem@...emloft.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "jiri@...nulli.us" <jiri@...nulli.us>
Subject: RE: [PATCH v0 03/12] mlxsw: core: Add core environment module for
 port temperature reading



> -----Original Message-----
> From: Guenter Roeck [mailto:linux@...ck-us.net]
> Sent: Friday, June 22, 2018 1:07 AM
> To: Andrew Lunn <andrew@...n.ch>
> Cc: Vadim Pasternak <vadimp@...lanox.com>; davem@...emloft.net;
> netdev@...r.kernel.org; jiri@...nulli.us
> Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> port temperature reading
> 
> On Thu, Jun 21, 2018 at 08:34:40PM +0200, Andrew Lunn wrote:
> > > Hi Andrew,
> >
> > Adding Guenter Roeck, the HWMON maintainer.

Hi Guenter,

Thank you for reply.
We are going to re-post this patchset and add more people
for review, as Andrew suggested.

> >
> > > The temperature of each individual module can be obtained through
> > > ethtool.
> >
> > You mean via --module-info?
> >
> > FYI: I plan to add hwmon support to the kernel SFP code. So if you
> > ever decide to swap to the kernel SFP code, not your own, the raw
> > temperatures will be exported.
> >
> As should be. Unless adjustments are made by the hardware (eg a thermal diode
> temperature offset register), all adjustments should be made in userspace.
> 

>From hardware we read all module temperature in one request.
The summary of temperature is going to thermal module.

> > > The worst temperature is necessary for the system cooling control
> > > decision.
> >
> > I would expect the system cooling would understand that.
> >
> > > Up to 64 SFP/QSFP modules could be connected to the system.
> > > Some of them could cooper modules, which doesn't provide temperature
> > > measurement.
> >
> > SFP modules are hot-plugable. So i would also expect the hwmon devices
> > to hotplug. If there is no sensor, then there is no hwmon device... If
> > there is no hwmon device, it plays no part in the thermal control
> > loop.
> >
> One hardware monitoring device per SFP, and I would assume that the hwmon
> device for an SFP is only instantiated if a thermal sensor is present.
> 
> > > Some of them could be optical modules, providing untrusted
> > > temperature measurement, which could impact thermal control of the
> > > system.
> >
> > Why would you not trust it? Are you saying some modules simply have
> > broken temperature sensors? Do you have a whitelist/blacklist of
> > modules?
> >
> > > Also optical modules could be from the different vendors,  and this
> > > is real situation, when, f.e. one module has the warning and
> > > critical thresholds 75C and 85C, while another 70C and 80C.
> >
> > But hwmon exports both the actual temperature and the alarm
> > temperatures. I would expect the thermal control code to use all this
> > information when making its decisions, not just the current
> > temperature.
> >
> The respective information would either be provided by hardware and reported
> to userspace, or userspace needs to determine the limits and write them into the
> hardware. Either case, that is only relevant if the hardware has limit registers.
> Otherwise all limits can and should be handled in the thermal subsystem.
> 

This is the case. Limits, modules temperatures and thresholds are handled in
the thermal subsystem.

> > > So, nominal temperature is not the case here, we should know the
> > > "worst" value for the thermal control decision.
> >
> > What it sounds like to me is you are working around problems in the
> > thermal control by fudging the raw temperatures. That is the wrong
> > thing to do. hwmon should export the raw data, and you should fix the
> > thermal control code to use it correctly.
> >
> Agreed. From the context it sounds like there should be some kind of
> temperature aggregator which would probably reside in the thermal subsystem
> (definitely not in hwmon).

Exactly this is the kind of temperature aggregation, provided to the thermal
subsystem as the thermal zone temperature.

> 
> I have not seen any hwmon specific patches. For new drivers, please use
> [devm_]hwmon_device_register_with_info().

We already have hwmon object. This is the reason I didn't use this
interface. This existing object has been extend with a few new
attributes for FANs.
Also I added aggregated FAULT status of the modules. In case even
one module is considered as "untrusted" (this is getting from HW),
this attribute is set to true.
This indication could be used by the user.
In our systems such fault will impact the allowed PWM minimum,
which could be assigned to thermal zone cooling device.
For example, if in normal situation for some particular system
minimum could be set to 20%, then in case any untrusted module
is found, this minimum should be increased to 40% (these percent
are depend on system type).
It doesn't matter how many untrusted modules are inserted. If there
is even one, it could have very bad impact on system thermal flow. 

Thanks,
Vadim.

> 
> Guenter

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ