lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200323213219.GC21532@C02YVCJELVCG.greyhouse.net>
Date:   Mon, 23 Mar 2020 17:32:19 -0400
From:   Andy Gospodarek <andy@...yhouse.net>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     Jiri Pirko <jiri@...nulli.us>, netdev@...r.kernel.org,
        davem@...emloft.net, parav@...lanox.com, yuvalav@...lanox.com,
        jgg@...pe.ca, saeedm@...lanox.com, leon@...nel.org,
        andrew.gospodarek@...adcom.com, michael.chan@...adcom.com,
        moshe@...lanox.com, ayal@...lanox.com, eranbe@...lanox.com,
        vladbu@...lanox.com, kliteyn@...lanox.com, dchickles@...vell.com,
        sburla@...vell.com, fmanlunas@...vell.com, tariqt@...lanox.com,
        oss-drivers@...ronome.com, snelson@...sando.io,
        drivers@...sando.io, aelior@...vell.com,
        GR-everest-linux-l2@...vell.com, grygorii.strashko@...com,
        mlxsw@...lanox.com, idosch@...lanox.com, markz@...lanox.com,
        jacob.e.keller@...el.com, valex@...lanox.com,
        linyunsheng@...wei.com, lihong.yang@...el.com,
        vikas.gupta@...adcom.com, magnus.karlsson@...el.com
Subject: Re: [RFC] current devlink extension plan for NICs


[Sorry to do this, but I'm going to reply to a few different parts in
different messages.  I'm a bit late to the party and there's lots that
has already been worked out, but I want to address some of those in
detail.  Thanks to Jiri et al at Mellanox for starting this detailed
thread and sharing with the community.]

On Fri, Mar 20, 2020 at 02:25:08PM -0700, Jakub Kicinski wrote:
> On Fri, 20 Mar 2020 08:35:55 +0100 Jiri Pirko wrote:
> > Fri, Mar 20, 2020 at 04:32:53AM CET, kuba@...nel.org wrote:
> > >On Thu, 19 Mar 2020 20:27:19 +0100 Jiri Pirko wrote:  
> > >> 
> > >> ==================================================================
> > >> ||                                                              ||
> > >> ||                             PFs                              ||
> > >> ||                                                              ||
> > >> ==================================================================
> > >> 
> > >> There are 2 flavours of PFs:
> > >> 1) Parent PF. That is coupled with uplink port. The slice flavour is
> > >>    therefore "physical", to be in sync of the flavour of the uplink port.
> > >>    In case this Parent PF is actually a leg of upstream embedded switch,
> > >>    the slice flavour is "virtual" (same as the port flavour).
> > >> 
> > >>    $ devlink port show
> > >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> > >> 
> > >>    $ devlink slice show
> > >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> > >> 
> > >>    This slice is shown in both "switchdev" and "legacy" modes.
> > >> 
> > >>    If there is another parent PF, say "0000:06:00.1", that share the
> > >>    same embedded switch, the aliasing is established for devlink handles.
> > >> 
> > >>    The user can use devlink handles:
> > >>    pci/0000:06:00.0
> > >>    pci/0000:06:00.1
> > >>    as equivalents, pointing to the same devlink instance.
> > >> 
> > >>    Parent PFs are the ones that may be in control of managing
> > >>    embedded switch, on any hierarchy level.
> > >> 
> > >> 2) Child PF. This is a leg of a PF put to the parent PF. It is
> > >>    represented by a slice, and a port (with a netdevice):
> > >> 
> > >>    $ devlink port show
> > >>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1 slice 0
> > >>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2 slice 20
> > >> 
> > >>    $ devlink slice show
> > >>    pci/0000:06:00.0/0: flavour physical pfnum 0 port 0 state active
> > >>    pci/0000:06:00.0/20: flavour pcipf pfnum 1 port 1 hw_addr aa:bb:cc:aa:bb:87 state active  <<<<<<<<<<
> > >> 
> > >>    This is a typical smartnic scenario. You would see this list on
> > >>    the smartnic CPU. The slice pci/0000:06:00.0/20 is a leg to
> > >>    one of the hosts. If you send packets to enp6s0f0pf2, they will
> > >>    go to he host.
> > >> 
> > >>    Note that inside the host, the PF is represented again as "Parent PF"
> > >>    and may be used to configure nested embedded switch.  
> > >
> > >This parent/child PF I don't understand. Does it stem from some HW
> > >limitations you have?  
> > 
> > No limitation. It's just a name for 2 roles. I didn't know how else to
> > name it for the documentation purposes. Perhaps you can help me.
> > 
> > The child can simply manage a "nested eswich". The "parent eswitch"
> > would see one leg (pf representor) one way or another. Only in case the
> > "nested eswitch" is there, the child would manage it - have separate
> > representors for vfs/sfs under its devlink instance.
> 
> I see! I wouldn't use the term PF. I think we need a notion of 
> a "virtual" port within the NIC to model the eswitch being managed 
> by the Host.
> 
> If Host manages the Eswitch - SmartNIC will no longer deal with its
> PCIe ports, but only with its virtual uplink.
> 

We have been referencing these as PFs for a while but without any
in-kernel way to differentiate between what you describe as a
parent/child relationship.  The terminology someone came up with was the
notion of referring to these as "PF Pairs" when all traffic on a
SmartNIC goes to a particular PF on a host.

This is partly because in the nominal case when our SmartNIC is booted
the eSwitch is configured so that traffic is passed to the proper PF
based on the destination MAC of the traffic.  Here is a dump of the
interfaces on the smartnic side and server side for a 2 port card:

[root@...rtnic ~]# ip li sh | grep enp -A 1 
2: enP8p1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group de0
    link/ether 00:0a:f7:ac:cf:a0 brd ff:ff:ff:ff:ff:ff
3: enP8p1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group de0
    link/ether 00:0a:f7:ac:cf:a1 brd ff:ff:ff:ff:ff:ff
4: enP8p1s0f2np0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT grou0
    link/ether 00:0a:f7:ac:cf:a2 brd ff:ff:ff:ff:ff:ff
5: enP8p1s0f3np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT grou0
    link/ether 00:0a:f7:ac:cf:a3 brd ff:ff:ff:ff:ff:ff

root@...ver:~# ip li sh | grep enp -A 1 
2: enp1s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:ac:cf:a8 brd ff:ff:ff:ff:ff:ff
3: enp1s0f1d1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:ac:cf:a9 brd ff:ff:ff:ff:ff:ff
4: enp1s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:ac:cf:aa brd ff:ff:ff:ff:ff:ff
5: enp1s0f3d1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:ac:cf:ab brd ff:ff:ff:ff:ff:ff

So while it might not make sense at first to have less physical
interfaces than PFs on the server or smartnic this gives flexibility to
have a PF on the server side that does have direct network connectivity
so that traffic destined to MAC address 00:0a:f7:ac:cf:a8 will go
directly to enp1s0f0 when it comes off the wire or from other PFs on the
server or smartnic.

We can also essentially 'lock out' PFs from being able to access the
physical ports if we want.  When that is done then the parent/child
relationship would be what you would describe and we would match up

enP8p1s0f2np0 (smartnic) <---> enp1s0f0 (server)
and
enP8p1s0f3np1 (smartnic) <---> enp1s0f1d1 (server)

and delete enp1s0f2 and enp1s0f3d1 on the server.

In this case PF0 and PF1 (enP8p1s0f0np0 and enP8p1s0f1np1) on the
smartnic effectively become the physical ports as there would be no
other 'ports' on the eswitch that are in the same broadcast domain.

I'm sure it comes as no surprise to anyone, but we also have the idea
that VFs can be paired in similar ways to PFs.  Practically speaking,
however, there is not much of a reason to use VFs on the SmartNIC
without VMs on the SmartNIC unless you are using this same parent/child
relationship.  Are you proposing that this will also be an option?

One point is that we also find is that generally customers are not super
interested in having these changed in real-time.  I do LOVE the idea of
being able to query this information via devlink however, so let's keep
that rolling and if people want them to be real PCI b/d/f I think that
should be allowed.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ