[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201125105422.7f90184c@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date: Wed, 25 Nov 2020 10:54:22 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Saeed Mahameed <saeedm@...dia.com>, Eli Cohen <elic@...dia.com>,
Leon Romanovsky <leonro@...lanox.com>,
<netdev@...r.kernel.org>, <linux-rdma@...r.kernel.org>,
Eli Cohen <eli@...lanox.com>, Mark Bloch <mbloch@...dia.com>,
Maor Gottlieb <maorg@...dia.com>
Subject: Re: [PATCH mlx5-next 11/16] net/mlx5: Add VDPA priority to NIC RX
namespace
On Tue, 24 Nov 2020 15:44:13 -0400 Jason Gunthorpe wrote:
> On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote:
> > On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:
> > > On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> > > > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:
> > > > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:
> > > > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:
> > > > > > > From: Eli Cohen <eli@...lanox.com>
> > > > > > >
> > > > > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > > > > DPDK to have precedence in packet processing.
> > > > > >
> > > > > > How does DPDK and VDPA relate in this context?
> > > > >
> > > > > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > > > > Up till now, the VDPA implementation would insert a rule into the
> > > > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > > > > all the incoming traffic.
> > > > >
> > > > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > > > MLX5_FLOW_NAMESPACE_BYPASS.
> > > >
> > > > Our policy was no DPDK driver bifurcation. There's no asterisk saying
> > > > "unless you pretend you need flow filters for RDMA, get them upstream
> > > > and then drop the act".
> > >
> > > Huh?
> > >
> > > mlx5 DPDK is an *RDMA* userspace application.
> >
> > Forgive me for my naiveté.
> >
> > Here I thought the RDMA subsystem is for doing RDMA.
>
> RDMA covers a wide range of accelerated networking these days.. Where
> else are you going to put this stuff in the kernel?
IDK what else you got in there :) It's probably a case by case answer.
IMHO even using libibverbs is no strong reason for things to fall under
RDMA exclusively. Client drivers of virtio don't get silently funneled
through a separate tree just because they use a certain spec.
> > I'm sure if you start doing crypto over ibverbs crypto people will want
> > to have a look.
>
> Well, RDMA has crypto transforms for a few years now too.
Are you talking about RDMA traffic being encrypted? That's a different
case.
My example was alluding to access to a generic crypto accelerator
over ibverbs. I hope you'd let crypto people know when merging such
a thing...
> Why would crypto subsystem people be involved? It isn't using or
> duplicating their APIs.
>
> > > libibverbs. It runs on the RDMA stack. It uses RDMA flow
> > > filtering and RDMA raw ethernet QPs.
> >
> > I'm not saying that's not the case. I'm saying I don't think this
> > was something that netdev developers signed-off on.
>
> Part of the point of the subsystem split was to end the fighting that
> started all of it. It was very clear during the whole iWarp and TCP
> Offload Engine buisness in the mid 2000's that netdev wanted nothing
> to do with the accelerator world.
I was in middle school at the time, not sure what exactly went down :)
But I'm going by common sense here. Perhaps there was an agreement I'm
not aware of?
> So why would netdev need sign off on any accelerator stuff?
I'm not sure why you keep saying accelerators!
What is accelerated in raw Ethernet frame access??
> Do you want to start co-operating now? I'm willing to talk about how
> to do that.
IDK how that's even in question. I always try to bump all RDMA-looking
stuff to linux-rdma when it's not CCed there. That's the bare minimum
of cooperation I'd expect from anyone.
> > And our policy on DPDK is pretty widely known.
>
> I honestly have no idea on the netdev DPDK policy,
>
> I'm maintaining the RDMA subsystem not DPDK :)
That's what I thought, but turns out DPDK is your important user.
> > Would you mind pointing us to the introduction of raw Ethernet QPs?
> >
> > Is there any production use for that without DPDK?
>
> Hmm.. It is very old. RAW (InfiniBand) QPs were part of the original
> IBA specification cira 2000. When RoCE was defined (around 2010) they
> were naturally carried forward to Ethernet. The "flow steering"
> concept to make raw ethernet QP useful was added to verbs around 2012
> - 2013. It officially made it upstream in commit 436f2ad05a0b
> ("IB/core: Export ib_create/destroy_flow through uverbs")
>
> If I recall properly the first real application was ultra low latency
> ethernet processing for financial applications.
>
> dpdk later adopted the first mlx4 PMD using this libibverbs API around
> 2015. Interestingly the mlx4 PMD was made through an open source
> process with minimal involvment from Mellanox, based on the
> pre-existing RDMA work.
>
> Currently there are many projects, and many open source, built on top
> of the RDMA raw ethernet QP and RDMA flow steering model. It is now
> long established kernel ABI.
>
> > > It has been like this for years, it is not some "act".
> > >
> > > It is long standing uABI that accelerators like RDMA/etc get to
> > > take the traffic before netdev. This cannot be reverted. I don't
> > > really understand what you are expecting here?
> >
> > Same. I don't really know what you expect me to do either. I don't
> > think I can sign-off on kernel changes needed for DPDK.
>
> This patch is fine tuning the shared logic that splits the traffic to
> accelerator subsystems, I don't think netdev should have a veto
> here. This needs to be consensus among the various communities and
> subsystems that rely on this.
>
> Eli did not explain this well in his commit message. When he said DPDK
> he means RDMA which is the owner of the FLOW_NAMESPACE. Each
> accelerator subsystem gets hooked into this, so here VPDA is getting
> its own hook because re-using the the same hook between two kernel
> subsystems is buggy.
I'm not so sure about this.
The switchdev modeling is supposed to give users control over flow of
traffic in a sane, well defined way, as opposed to magic flow filtering
of the early SR-IOV implementations which every vendor had their own
twist on.
Now IIUC you're tapping traffic for DPDK/raw QPs _before_ all switching
happens in the NIC? That breaks the switchdev model. We're back to
per-vendor magic.
And why do you need a separate VDPA table in the first place?
Forwarding to a VDPA device has different semantics than forwarding to
any other VF/SF?
Powered by blists - more mailing lists