[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4b5b0f52-7ed8-7eef-2467-fa59ca5de937@intel.com>
Date: Sun, 2 Mar 2025 15:09:35 +0200
From: "Lifshits, Vitaly" <vitaly.lifshits@...el.com>
To: Mark Pearson <mpearson-lenovo@...ebb.ca>, Andrew Lunn <andrew@...n.ch>
CC: <anthony.l.nguyen@...el.com>, <przemyslaw.kitszel@...el.com>,
<andrew+netdev@...n.ch>, <davem@...emloft.net>, <edumazet@...gle.com>,
<kuba@...nel.org>, <pabeni@...hat.com>, <intel-wired-lan@...ts.osuosl.org>,
<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [Intel-wired-lan] [PATCH] e1000e: Link flap workaround option for
false IRP events
Hi Mark,
> Hi Andrew
>
> On Thu, Feb 27, 2025, at 11:07 AM, Andrew Lunn wrote:
>>>>> + e1e_rphy(hw, PHY_REG(772, 26), &phy_data);
>>>>
>>>> Please add some #define for these magic numbers, so we have some idea
>>>> what PHY register you are actually reading. That in itself might help
>>>> explain how the workaround actually works.
>>>>
>>>
>>> I don't know what this register does I'm afraid - that's Intel knowledge and has not been shared.
>>
>> What PHY is it? Often it is just a COTS PHY, and the datasheet might
>> be available.
>>
>> Given your setup description, pause seems like the obvious thing to
>> check. When trying to debug this, did you look at pause settings?
>> Knowing what this register is might also point towards pause, or
>> something totally different.
>>
>> Andrew
>
> For the PHY - do you know a way of determining this easily? I can reach out to the platform team but that will take some time. I'm not seeing anything in the kernel logs, but if there's a recommended way of confirming that would be appreciated.
The PHY is I219 PHY.
The datasheet is indeed accessible to the public:
https://cdrdv2-public.intel.com/612523/ethernet-connection-i219-datasheet.pdf
>
> We did look at at the pause pieces - which I agree seems like an obvious candidate given the speed mismatch on the network.
> Experts on the Intel networking team did reproduce the issue in their lab and looked at this for many weeks without determining root cause. I wish it was as obvious as pause control configuration :)
>
> Thanks
> Mark
>
Reading this register was suggested for debug purposes to understand if
there is some misconfiguration. We did not find any misconfiguration.
The issue as we discovered was a link status change interrupt caused the
CSME to reset the adapter causing the link flap.
We were unable to determine what causes the link status change interrupt
in the first place. As stated in the comment, it was only ever observed
on Lenovo P5/P7systems and we couldn't ever reproduce on other systems.
The reproduction in our lab was on a P5 system as well.
Regarding the suggested workaround, there isn’t a clear understanding
why it works. We suspect that reading a PHY register is probably
prevents the CSME from resetting the PHY when it handles the LSC
interrupt it gets. However, it can also be a matter of slight timing
variations.
We communicated that this solution is not likely to be accepted to the
kernel as is, and the initial responses on the mailing list demonstrate
the pushback. We do understand the frustration of end-users that may
experience the problem. A couple of suggestions that can make it look
less “out-of-the-blue” are: try a short delay instead of the register
read, or read a more common register like PHY STATUS instead.
On a different topic, I suggest removing the part of the comment below:
* Intel unable to determine root cause.
The issue went through joint debug by Intel and Lenovo, and no obvious
spec violations by either party were found. There doesn’t seem to be
value in including this information in the comments of upstream code.
Powered by blists - more mailing lists