netdev - Re: [RFC PATCH net-next 0/6] devlink: Add device metric support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200819110725.6e8744ce@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Wed, 19 Aug 2020 11:07:25 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     Florian Fainelli <f.fainelli@...il.com>
Cc:     David Ahern <dsahern@...il.com>, Ido Schimmel <idosch@...sch.org>,
        netdev@...r.kernel.org, davem@...emloft.net, jiri@...dia.com,
        amcohen@...dia.com, danieller@...dia.com, mlxsw@...dia.com,
        roopa@...dia.com, andrew@...n.ch, vivien.didelot@...il.com,
        tariqt@...dia.com, ayal@...dia.com, mkubecek@...e.cz,
        Ido Schimmel <idosch@...dia.com>
Subject: Re: [RFC PATCH net-next 0/6] devlink: Add device metric support

On Wed, 19 Aug 2020 10:20:08 -0700 Florian Fainelli wrote:
> > I'm trying to find a solution which will not require a policeman to
> > constantly monitor the compliance. Please see my effort to ensure
> > drivers document and use the same ethtool -S stats in the TLS offload
> > implementations. I've been trying to improve this situation for a long
> > time, and it's getting old.  
> 
> Which is why I am asking genuinely what do you think should be done
> besides doing more code reviews? It does not seem to me that there is an
> easy way to catch new stats being added with tools/scripts/whatever and
> then determine what they are about, right?

I don't have a great way forward in mind, sadly. All I can think of is
that we should try to create more well defined interfaces and steer
away from free-form ones.

Example, here if the stats are vxlan decap/encap/error - we should
expose that from the vxlan module. That way vxlan module defines one
set of stats for everyone.

In general unless we attach stats to the object they relate to, we will
end up building parallel structures for exposing statistics from the
drivers. I posted a set once which was implementing hierarchical stats,
but I've abandoned it for this reason.

> > Please focus on the stats this set adds, instead of fantasizing of what
> > could be. These are absolutely not implementation specific!  
> 
> Not sure if fantasizing is quite what I would use. I am just pointing
> out that given the inability to standardize on statistics maybe we
> should have namespaces and try our best to have everything fit into the
> standard namespace along with a standard set of names, and push back
> whenever we see vendor stats being added (or more pragmatically, ask
> what they are). But maybe this very idea is moot.

IDK. I just don't feel like this is going to fly, see how many names
people invented for the CRC error statistic in ethtool -S, even tho
there is a standard stat for that! And users are actually parsing the
output of ethtool -S to get CRC stats because (a) it became the go-to
place for NIC stats and (b) some drivers forget to report in the
standard place.

The cover letter says this set replaces the bad debugfs with a good,
standard API. It may look good and standard for _vendors_ because they
will know where to dump their counters, but it makes very little
difference for _users_. If I have to parse names for every vendor I use,
I can as well add a per-vendor debugfs path to my script.

The bar for implementation-specific driver stats has to be high.

> >>> If I have to download vendor documentation and tooling, or adapt my own
> >>> scripts for every new vendor, I could have as well downloaded an SDK.    
> >>
> >> Are not you being a bit over dramatic here with your example?   
> > 
> > I hope not. It's very hard/impossible today to run a fleet of Linux
> > machines without resorting to vendor tooling.  
> 
> Your argument was putting on the same level resorting to vendor tooling
> to extract meaningful statistics/counters versus using a SDK to operate
> the hardware (this is how I understood it), and I do not believe this is
> fair.

Okay, fair. I just think that in datacenter deployments we are way
closer to the SDK model than people may want to admit.