[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <442420d6-3911-4956-95f1-c9b278d45cd6@molgen.mpg.de>
Date: Mon, 10 Feb 2025 13:07:28 +0100
From: Paul Menzel <pmenzel@...gen.mpg.de>
To: Jedrzej Jagielski <jedrzej.jagielski@...el.com>
Cc: intel-wired-lan@...ts.osuosl.org,
Anthony L Nguyen <anthony.l.nguyen@...el.com>, netdev@...r.kernel.org,
Simon Horman <horms@...nel.org>,
Przemyslaw Kitszel <przemyslaw.kitszel@...el.com>,
Mateusz Polchlopek <mateusz.polchlopek@...el.com>
Subject: Re: [Intel-wired-lan] [PATCH iwl-next v3] ixgbe: add support for
thermal sensor event reception
Dear Jedrzej,
Thank you for the quick reply.
Am 10.02.25 um 12:59 schrieb Jagielski, Jedrzej:
> From: Paul Menzel <pmenzel@...gen.mpg.de>
> Sent: Monday, February 10, 2025 12:40 PM
>> Am 10.02.25 um 11:40 schrieb Jedrzej Jagielski:
>>> E610 NICs unlike the previous devices utilising ixgbe driver
>>> are notified in the case of overheatning by the FW ACI event.
>>
>> overheating (without n)
>>
>>> In event of overheat when threshold is exceeded, FW suspends all
>>> traffic and sends overtemp event to the driver.
>>
>> I guess this was already the behavior before your patch, and now it’s
>> only logged, and the device stopped properly?
> Without this patch driver cannot receive the overtemp info. FW handles
> the event - device stops but user has no clue what has happened.
> FW does the major work but this patch adds this minimal piece of
> overtemp handling done by SW - informing user at the very end.
Thank you for the confirmation. Maybe add a small note, that it’s not
logged to make it crystal clear for people like myself.
>>> Then driver
>>> logs appropriate message and closes the adapter instance.
>>> The card remains in that state until the platform is rebooted.
>>
>> As a user I’d be interested what the threshold is, and what the measured
>> temperature is. Currently, the log seems to be just generic?
>
> These details are FW internals.
> Driver just gets info that such event has happened.
> There's no additional information.
>
> In that case driver's job is just to inform user that such scenario
> has happened and tell what should be the next steps.
From a user perspective that is a suboptimal behavior, and shows
another time that the Linux kernel should have all the control, and
stuff like this should be moved *out* of the firmware and not into the
firmware.
It’d be great if you could talk to the card designers/engineers to take
that into account, and maybe revert this decision for future versions or
future adapters.
Kind regards,
Paul
>> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:static const char ixgbe_overheat_msg[] = "Network adapter has been stopped because it has over heated. Restart the computer. If the problem persists, power off the system and replace the adapter";
>>
>>> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@...el.com>
>>> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@...el.com>
>>> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@...el.com>
>>> ---
>>> v2,3 : commit msg tweaks
>>> ---
>>> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 +++++
>>> drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h | 3 +++
>>> 2 files changed, 8 insertions(+)
>>
>>
>> Kind regards,
>>
>> Paul
>>
>>
>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> index 7236f20c9a30..5c804948dd1f 100644
>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> @@ -3165,6 +3165,7 @@ static void ixgbe_aci_event_cleanup(struct ixgbe_aci_event *event)
>>> static void ixgbe_handle_fw_event(struct ixgbe_adapter *adapter)
>>> {
>>> struct ixgbe_aci_event event __cleanup(ixgbe_aci_event_cleanup);
>>> + struct net_device *netdev = adapter->netdev;
>>> struct ixgbe_hw *hw = &adapter->hw;
>>> bool pending = false;
>>> int err;
>>> @@ -3185,6 +3186,10 @@ static void ixgbe_handle_fw_event(struct ixgbe_adapter *adapter)
>>> case ixgbe_aci_opc_get_link_status:
>>> ixgbe_handle_link_status_event(adapter, &event);
>>> break;
>>> + case ixgbe_aci_opc_temp_tca_event:
>>> + e_crit(drv, "%s\n", ixgbe_overheat_msg);
>>> + ixgbe_close(netdev);
>>> + break;
>>> default:
>>> e_warn(hw, "unknown FW async event captured\n");
>>> break;
>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h
>>> index 8d06ade3c7cd..617e07878e4f 100644
>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h
>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h
>>> @@ -171,6 +171,9 @@ enum ixgbe_aci_opc {
>>> ixgbe_aci_opc_done_alt_write = 0x0904,
>>> ixgbe_aci_opc_clear_port_alt_write = 0x0906,
>>>
>>> + /* TCA Events */
>>> + ixgbe_aci_opc_temp_tca_event = 0x0C94,
>>> +
>>> /* debug commands */
>>> ixgbe_aci_opc_debug_dump_internals = 0xFF08,
Powered by blists - more mailing lists