[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <35937fe6d371a43aa0bfe70c9fab549b62089592.camel@kernel.org>
Date: Wed, 19 May 2021 17:07:00 -0700
From: Saeed Mahameed <saeed@...nel.org>
To: Jakub Kicinski <kuba@...nel.org>
Cc: davem@...emloft.net, netdev@...r.kernel.org
Subject: Re: [PATCH net-next] mlx5: count all link events
On Wed, 2021-05-19 at 14:06 -0700, Jakub Kicinski wrote:
> On Wed, 19 May 2021 13:49:00 -0700 Saeed Mahameed wrote:
> > On Wed, 2021-05-19 at 10:18 -0700, Jakub Kicinski wrote:
> > > mlx5 devices were observed generating
> > > MLX5_PORT_CHANGE_SUBTYPE_ACTIVE
> > > events without an intervening MLX5_PORT_CHANGE_SUBTYPE_DOWN. This
> > > breaks link flap detection based on Linux carrier state
> > > transition
> > > count as netif_carrier_on() does nothing if carrier is already
> > > on.
> > > Make sure we count such events.
> >
> > Can you share more on the actual scenario that has happened ?
> > in mlx5 i know of situations where fw might generate such events,
> > just
> > as FYI for virtual ports (vports) on some configuration changes.
> >
> > another explanation is that in the driver we explicitly query the
> > link
> > state and we never take the event value, so it could have been that
> > the
> > link flapped so fast we missed the intermediate state.
>
> The link flaps quite a bit, this is likely a bad cable or port.
> I scanned the fleet a little bit more and I see a couple machines
> in such state, in each case the switch is also seeing the link flaps,
> not just the NIC. Without this patch the driver registers a full flap
> once every ~15min, with the patch it's once a second. That's much
> closer to what the switch registers.
>
> Also the issue affects all hosts in MH, and persists across reboots
> of a single host (hence I could test this patch).
>
reproduces on reboots even with a good cable ?
you reboot the peer machine or the DUT (under test) machine ?
> > According to HW spec for some reason we should always query and not
> > rely on the event.
> >
> > <quote>
> > If software retrieves this indication (port state change event),
> > this
> > signifies that the state has been
> > changed and a QUERY_VPORT_STATE command should be performed to get
> > the
> > new state.
> > </quote>
>
> I see, seems reasonable. I'm guessing the FW generates only one of
> the
> events on minor type of faults? I don't think the link goes fully
> down,
> because I can SSH to those machines, they just periodically drop
> traffic. But the can't fully retrain the link at such high rate,
> I don't think.
>
hmm, Then i would like to get to the bottom of this, so i will have to
consult with FW.
But regardless, we can progress with the patch, I think the HW spec
description forces us to do so..
Powered by blists - more mailing lists