netdev - Re: [PATCH net-next] mlx5: count all link events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210519174443.39b7cec9@kicinski-fedora-PC1C0HJN>
Date:   Wed, 19 May 2021 17:44:43 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     Saeed Mahameed <saeed@...nel.org>
Cc:     davem@...emloft.net, netdev@...r.kernel.org
Subject: Re: [PATCH net-next] mlx5: count all link events

On Wed, 19 May 2021 17:07:00 -0700 Saeed Mahameed wrote:
> On Wed, 2021-05-19 at 14:06 -0700, Jakub Kicinski wrote:
> > On Wed, 19 May 2021 13:49:00 -0700 Saeed Mahameed wrote:  
> > > Can you share more on the actual scenario that has happened ? 
> > > in mlx5 i know of situations where fw might generate such events,
> > > just
> > > as FYI for virtual ports (vports) on some configuration changes.
> > > 
> > > another explanation is that in the driver we explicitly query the
> > > link
> > > state and we never take the event value, so it could have been that
> > > the
> > > link flapped so fast we missed the intermediate state.  
> > 
> > The link flaps quite a bit, this is likely a bad cable or port.
> > I scanned the fleet a little bit more and I see a couple machines 
> > in such state, in each case the switch is also seeing the link flaps,
> > not just the NIC. Without this patch the driver registers a full flap
> > once every ~15min, with the patch it's once a second. That's much
> > closer to what the switch registers.
> > 
> > Also the issue affects all hosts in MH, and persists across reboots
> > of a single host (hence I could test this patch).
> 
> reproduces on reboots even with a good cable ? 

I don't have access to the machines so the cable stays the same. I was
just saying that it doesn't seem like a driver issue since it persists
across reboots.

> you reboot the peer machine or the DUT (under test) machine ?

DUT

> > > According to HW spec for some reason we should always query and not
> > > rely on the event. 
> > > 
> > > <quote>
> > > If software retrieves this indication (port state change event),
> > > this
> > > signifies that the state has been
> > > changed and a QUERY_VPORT_STATE command should be performed to get
> > > the
> > > new state.
> > > </quote>  
> > 
> > I see, seems reasonable. I'm guessing the FW generates only one of
> > the
> > events on minor type of faults? I don't think the link goes fully
> > down,
> > because I can SSH to those machines, they just periodically drop
> > traffic. But the can't fully retrain the link at such high rate, 
> > I don't think.
> >   
> 
> hmm, Then i would like to get to the bottom of this, so i will have to
> consult with FW.
> But regardless, we can progress with the patch, I think the HW spec
> description forces us to do so.. 

SGTM :)