netdev - Re: [PATCH net-next 3/3] net: phy: realtek: add hwmon support for temp sensor on RTL822x

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0adfb0e4-72b2-48c1-bf65-da75213a5f18@lunn.ch>
Date: Sat, 11 Jan 2025 18:00:14 +0100
From: Andrew Lunn <andrew@...n.ch>
To: Heiner Kallweit <hkallweit1@...il.com>
Cc: Guenter Roeck <linux@...ck-us.net>,
	Russell King - ARM Linux <linux@...linux.org.uk>,
	Paolo Abeni <pabeni@...hat.com>, Jakub Kicinski <kuba@...nel.org>,
	David Miller <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>, Simon Horman <horms@...nel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-hwmon@...r.kernel.org" <linux-hwmon@...r.kernel.org>,
	Jean Delvare <jdelvare@...e.com>
Subject: Re: [PATCH net-next 3/3] net: phy: realtek: add hwmon support for
 temp sensor on RTL822x

> According to Guenters feedback the alarm attribute must not be written
> and is expected to be self-clearing on read.
> If we would clear the alarm in the chip on alarm attribute read, then
> we can have the following ugly scenario:
> 
> 1. Temperature threshold is exceeded and chip reduces speed to 1Gbps
> 2. Temperature is falling below alarm threshold
> 3. User uses "sensors" to check the current temperature
> 4. The implicit alarm attribute read causes the chip to clear the
>    alarm and re-enable 2.5Gbps speed, resulting in the temperature
>    alarm threshold being exceeded very soon again.
> 
> What isn't nice here is that it's not transparent to the user that
> a read-only command from his perspective causes the protective measure
> of the chip to be cancelled.
> 
> There's no existing hwmon attribute meant to be used by the user
> to clear a hw alarm once he took measures to protect the chip
> from overheating.

It is generally not the kernels job to implement policy. User space
should be doing that.

I see two different possible policies, and there are maybe others:

1) The user is happy with one second outages every so often as the
chip cycles between too hot and down shifting, and cool enough to
upshift back to the higher speeds.

2) The user prefers to have reliable, slower connectivity and needs to
explicitly do something like down/up the interface to get it back to
the higher speed.

I personally would say, from a user support view, 2) is better. A one
time 1 second break in connectivity and a kernel message is going to
cause less issues.

Maybe the solution is that the hwmon alarm attribute is not directly
the hardware bit, but a software interpretation of the system state.
When the alarm fires, copy it into a software alarm state, but leave
the hardware alarm alone. A hwmon read clears the software state, but
leaves the hardware alone. A down/up of the interface will then clear
both the software and hardware alarm state.

Anybody wanting policy 1) would then need a daemon polling the state
and taking action. 2) would be the default.

How easy is it for you to get into the alarm state? Did you need an
environment chamber/oven, or is it happening for you with just lots of
continuous traffic at typical room temperature? Are we talking about
cheap USB dangles in a sealed plastic case with poor thermal design
are going to be doing this all the time?

	Andrew