[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <AM0PR05MB48664BACD47F08E6271C43A5D1C10@AM0PR05MB4866.eurprd05.prod.outlook.com>
Date: Thu, 9 Apr 2020 06:43:25 +0000
From: Parav Pandit <parav@...lanox.com>
To: Jakub Kicinski <kuba@...nel.org>,
"magnus.karlsson@...el.com" <magnus.karlsson@...el.com>
CC: Jiri Pirko <jiri@...nulli.us>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"davem@...emloft.net" <davem@...emloft.net>,
Yuval Avnery <yuvalav@...lanox.com>,
"jgg@...pe.ca" <jgg@...pe.ca>,
Saeed Mahameed <saeedm@...lanox.com>,
"leon@...nel.org" <leon@...nel.org>,
"andrew.gospodarek@...adcom.com" <andrew.gospodarek@...adcom.com>,
"michael.chan@...adcom.com" <michael.chan@...adcom.com>,
Moshe Shemesh <moshe@...lanox.com>,
Aya Levin <ayal@...lanox.com>,
Eran Ben Elisha <eranbe@...lanox.com>,
Vlad Buslov <vladbu@...lanox.com>,
Yevgeny Kliteynik <kliteyn@...lanox.com>,
"dchickles@...vell.com" <dchickles@...vell.com>,
"sburla@...vell.com" <sburla@...vell.com>,
"fmanlunas@...vell.com" <fmanlunas@...vell.com>,
Tariq Toukan <tariqt@...lanox.com>,
"oss-drivers@...ronome.com" <oss-drivers@...ronome.com>,
"snelson@...sando.io" <snelson@...sando.io>,
"drivers@...sando.io" <drivers@...sando.io>,
"aelior@...vell.com" <aelior@...vell.com>,
"GR-everest-linux-l2@...vell.com" <GR-everest-linux-l2@...vell.com>,
"grygorii.strashko@...com" <grygorii.strashko@...com>,
mlxsw <mlxsw@...lanox.com>, Ido Schimmel <idosch@...lanox.com>,
Mark Zhang <markz@...lanox.com>,
"jacob.e.keller@...el.com" <jacob.e.keller@...el.com>,
Alex Vesker <valex@...lanox.com>,
"linyunsheng@...wei.com" <linyunsheng@...wei.com>,
"lihong.yang@...el.com" <lihong.yang@...el.com>,
"vikas.gupta@...adcom.com" <vikas.gupta@...adcom.com>
Subject: RE: [RFC] current devlink extension plan for NICs
> From: Jakub Kicinski <kuba@...nel.org>
> Sent: Thursday, April 9, 2020 7:37 AM
>
> On Wed, 8 Apr 2020 18:13:50 +0000 Parav Pandit wrote:
> > > From: netdev-owner@...r.kernel.org <netdev-owner@...r.kernel.org>
> On
> > > Behalf Of Jakub Kicinski
> > >
> > > On Wed, 8 Apr 2020 05:07:04 +0000 Parav Pandit wrote:
> > > > > > > > 3. In future at eswitch pci port, I will be adding dpipe
> > > > > > > > support for the internal flow tables done by the driver.
> > > > > > > > 4. There were inconsistency among vendor drivers in
> > > > > > > > using/abusing phys_port_name of the eswitch ports. This is
> > > > > > > > consolidated via devlink port in core. This provides
> > > > > > > > consistent view among all vendor drivers.
> > > > > > > >
> > > > > > > > So PCI eswitch side ports are useful regardless of slice.
> > > > > > > >
> > > > > > > > >> Additionally devlink port object doesn't go through the
> > > > > > > > >> same state machine as that what slice has to go through.
> > > > > > > > >> So its weird that some devlink port has state machine
> > > > > > > > >> and some doesn't.
> > > > > > > > >
> > > > > > > > > You mean for VFs? I think you can add the states to the API.
> > > > > > > > >
> > > > > > > > As we agreed above that eswitch side objects (devlink port
> > > > > > > > and representor netdev) should not be used for 'portion of
> > > > > > > > device',
> > > > > > >
> > > > > > > We haven't agreed, I just explained how we differ.
> > > > > >
> > > > > > You mentioned that " Right, in my mental model representor
> > > > > > _is_ a port of the eswitch, so repr would not make sense to me."
> > > > > >
> > > > > > With that I infer that 'any object that is directly and
> > > > > > _always_ linked to eswitch and represents an eswitch port is
> > > > > > out of question, this includes devlink port of eswitch and
> > > > > > netdev representor. Hence, the comment 'we agree conceptually'
> > > > > > to not involve devlink port of eswitch and representor netdev
> > > > > > to represent
> > > 'portion of the device'.
> > > > >
> > > > > I disagree, repr is one to one with eswitch port. Just because
> > > > > repr is associated with a devlink port doesn't mean devlink port
> > > > > must be associated with a repr or a netdev.
> > > > Devlink port which is on eswitch side is registered with switch_id
> > > > and also
> > > linked to the rep netdev.
> > > > From this port phys_port_name is derived.
> > > > This eswitch port shouldn't represent 'portion of the device'.
> > >
> > > switch_id is per port, so it's perfectly fine for a devlink port not
> > > to have one, or for two ports of the same device to have a different ID.
> > >
> > > The phys_port_name argument I don't follow. How does that matter in
> > > the "should we create another object" debate?
> > >
> > Its very clear in net/core/devlink.c code that a devlink port with a
> > switch_id belongs to switch side and linked to eswitch representor
> > netdev.
> >
> > It just cannot/should not be overloaded to drive host side attributes.
> >
> > > IMO introducing the slice if it's 1:1 with ports is a no-go.
> > I disagree.
> > With that argument devlink port for eswitch should not have existed and
> netdev should have been self-describing.
> > But it is not done that way for 3 reasons I described already in this thread.
> > Please get rid of devlink eswitch port and put all of it in
> > representor netdev, after that 1:1 no-go point make sense. :-)
> >
> > Also we already discussed that its not 1:1. A slice might not have devlink
> port.
> > We don't want to start with lowest denominator and narrow use case.
> >
> > I also described you that slice runs through state machine which devlink
> port doesn't.
> > We don't want to overload devlink port object.
> >
> > > I also don't like how
> > > creating a slice implicitly creates a devlink port in your design.
> > > If those objects are so strongly linked that creating one implies
> > > the other they should just be merged.
> > I disagree.
> > When netdev representor is created, its multiple health reporters (strongly
> linked) are created implicitly.
> > We didn't merge and user didn't explicitly created them for right reasons.
> >
> > A slice as described represents 'portion of a device'. As described in RFC,
> it's the master object for which other associated sub-objects gets created.
> > Like an optional devlink port, representor, health reporters, resources.
> > Again, it is not 1:1.
> >
> > As Jiri described and you acked that devlink slice need not have to have a
> devlink port.
> >
> > There are enough examples in devlink subsystem today where 1:1 and non
> 1:1 objects can be related.
> > Shared buffers, devlink ports, health reporters, representors have such
> mapping with each other.
>
> I'm not going to respond to any of that. We're going in circles.
>
I don't think we are going in circle.
Its clear to me that some devlink objects are 1:1 mapped with other objects, some are not.
And 1:1 mapping/no-mapping is not good enough reason to avoid crisp definition of a host facing 'portion of the device'.
So lets focus on technical aspects below.
> I bet you remember the history of PCI ports, and my initial patch set.
>
> We even had a call about this. Clearly all of it was a waste of time.
>
> > > I'm also concerned that the slice is basically a non-networking port.
> > What is the concern?
>
> What I wrote below, but you decided to split off in your reply for whatever
> reason.
>
> > How is shared-buffer, health reporter is attributed as networking object?
>
> By non-networking I mean non-ethernet, or host facing. Which should be
> clear from what I wrote below.
Ok. I agree, host facing is good name.
>
> > > I bet some of the things we add there will one day be useful for
> > > networking or DSA ports.
> > >
> > I think this is mis-interpretation of a devlink slice object.
> > All things we intent to do in devlink slice is useful for networking and non-
> networking use.
> > So saying 'devlink slice is non networking port, hence it cannot be used for
> networking' -> is a wrong interpretation.
> >
> > I do not understand DSA port much, but what blocks users to use slice if it
> fits the need in future.
> >
> > How is shared buffer, health reporter are 'networking' object which exists
> under devlink, but not strictly under devlink port?
>
> E.g. you ad rate limiting on the slice. That's something that may be useful for
> other ingress points of the device. But it's added to the slice, not the port. So
> we can't reuse the API for network ports.
>
To my knowledge, there is no user API existed in upstream kernel for device (VF/PF) ingress rate limiting.
So in future if user wants to do ingress rate limiting, it should better use rich eswitch and representor with tc.
Device egress rate limiting at slice level is mainly coming from users who are migrating from using 'ip link set vf rate'.
And grouping those VFs further as Jiri described.
May be at this junction, we migrate users to start using tc as,
tc filter add dev enp59s0f0_0 root protocol ip matchall action police rate 1mbit burst 20k
Is there already a good way to group set of related netdevs and issuing tc filter to their grouped netdev?
Or should we create one?
Is tc block sharing good for that purpose?
With that we have cleaner slice interface leaving legacy behind.
Jiri, Jakub,
What is your input on grouping eswitch ports and configuring their ingress rate limiting as individual port as group/block/team/something else?
> > > So I'd suggest to maybe step back from the SmartNIC scenario and try
> > > to figure out how slices are useful on their own.
> > I already went through the requirements, scenario, examples and use
> > model in the RFC extension that describes
> > (a) how slice fits smartnic and non smartnic both cases.
> > (b) how user gets same experience and commands regardless of use cases.
> >
> > A 'good' in-kernel example where one object is overloaded to do multiple
> things would support a thought to overload devlink port.
> > For example merge macvlan and vlan driver object to do both
> functionalities.
> > An overloaded recently introduced qdisc to multiple things as another.
Powered by blists - more mailing lists