lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+sq2CfBMqgt+yzbx41d7BJyQJfnGWP6VtgQzRABuAFum+nB2w@mail.gmail.com>
Date:   Sat, 7 Nov 2020 21:21:27 +0530
From:   Sunil Kovvuri <sunil.kovvuri@...il.com>
To:     Saeed Mahameed <saeed@...nel.org>
Cc:     George Cherian <gcherian@...vell.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Jiri Pirko <jiri@...dia.com>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "davem@...emloft.net" <davem@...emloft.net>,
        Sunil Kovvuri Goutham <sgoutham@...vell.com>,
        Linu Cherian <lcherian@...vell.com>,
        Geethasowjanya Akula <gakula@...vell.com>,
        "masahiroy@...nel.org" <masahiroy@...nel.org>,
        "willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com>
Subject: Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health
 reporters for NIX

On Sat, Nov 7, 2020 at 2:28 AM Saeed Mahameed <saeed@...nel.org> wrote:
>
> On Fri, 2020-11-06 at 00:59 +0530, Sunil Kovvuri wrote:
> > > > > > Output:
> > > > > >  # ./devlink health
> > > > > >  pci/0002:01:00.0:
> > > > > >    reporter npa
> > > > > >      state healthy error 0 recover 0
> > > > > >    reporter nix
> > > > > >      state healthy error 0 recover 0
> > > > > >  # ./devlink  health dump show pci/0002:01:00.0 reporter nix
> > > > > >   NIX_AF_GENERAL:
> > > > > >          Memory Fault on NIX_AQ_INST_S read: 0
> > > > > >          Memory Fault on NIX_AQ_RES_S write: 0
> > > > > >          AQ Doorbell error: 0
> > > > > >          Rx on unmapped PF_FUNC: 0
> > > > > >          Rx multicast replication error: 0
> > > > > >          Memory fault on NIX_RX_MCE_S read: 0
> > > > > >          Memory fault on multicast WQE read: 0
> > > > > >          Memory fault on mirror WQE read: 0
> > > > > >          Memory fault on mirror pkt write: 0
> > > > > >          Memory fault on multicast pkt write: 0
> > > > > >    NIX_AF_RAS:
> > > > > >          Poisoned data on NIX_AQ_INST_S read: 0
> > > > > >          Poisoned data on NIX_AQ_RES_S write: 0
> > > > > >          Poisoned data on HW context read: 0
> > > > > >          Poisoned data on packet read from mirror buffer: 0
> > > > > >          Poisoned data on packet read from mcast buffer: 0
> > > > > >          Poisoned data on WQE read from mirror buffer: 0
> > > > > >          Poisoned data on WQE read from multicast buffer: 0
> > > > > >          Poisoned data on NIX_RX_MCE_S read: 0
> > > > > >    NIX_AF_RVU:
> > > > > >          Unmap Slot Error: 0
> > > > > >
> > > > >
> > > > > Now i am a little bit skeptic here, devlink health reporter
> > > > > infrastructure was
> > > > > never meant to deal with dump op only, the main purpose is to
> > > > > diagnose/dump and recover.
> > > > >
> > > > > especially in your use case where you only report counters, i
> > > > > don't
> > > > > believe
> > > > > devlink health dump is a proper interface for this.
> > > > These are not counters. These are error interrupts raised by HW
> > > > blocks.
> > > > The count is provided to understand on how frequently the errors
> > > > are
> > > > seen.
> > > > Error recovery for some of the blocks happen internally. That is
> > > > the
> > > > reason,
> > > > Currently only dump op is added.
> > >
> > > So you are counting these events in driver, sounds like a counter
> > > to
> > > me, i really think this shouldn't belong to devlink, unless you
> > > really
> > > utilize devlink health ops for actual reporting and recovery.
> > >
> > > what's wrong with just dumping these counters to ethtool ?
> >
> > This driver is a administrative driver which handles all the
> > resources
> > in the system and doesn't do any IO.
> > NIX and NPA are key co-processor blocks which this driver handles.
> > With NIX and NPA, there are pieces
> > which gets attached to a PCI device to make it a networking device.
> > We
> > have netdev drivers registered to this
> > networking device. Some more information about the drivers is
> > available at
> > https://www.kernel.org/doc/html/latest/networking/device_drivers/ethernet/marvell/octeontx2.html
> >
> > So we don't have a netdev here to report these co-processor block
> > level errors over ethtool.
> >
>
> but AF driver can't be standalone to operate your hw, it must have a
> PF/VF with netdev interface to do io, so even if your model is modular,
> a common user of this driver will always see a netdev.
>

That's right, user will always see a netdev, but
The co-processor blocks are like this
- Each co-processor has two parts, AF unit and LF units (local function)
- Each of the co-processor can have multiple LFs, incase of NIX
co-processor, each of the LF provides RQ, SQ, CQs etc.
- So the AF driver handles the co-processor's AF unit and upon
receiving requests from PF/VF attaches the LFs to them, so that they
can do network IO.
- Within co-processor, AF unit specific errors (global) are reported
to AF driver and LF specific errors are reported to netdev driver.
- There can be 10s of netdev driver instances in the system, so these
AF unit global errors cannot be routed and shown in one of the
netdev's ethtool.

Thanks,
Sunil.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ