[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250319164802.GA116657@nvidia.com>
Date: Wed, 19 Mar 2025 13:48:02 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Nikolay Aleksandrov <nikolay@...abrica.net>
Cc: netdev@...r.kernel.org, shrijeet@...abrica.net, alex.badea@...sight.com,
eric.davis@...adcom.com, rip.sohan@....com, dsahern@...nel.org,
bmt@...ich.ibm.com, roland@...abrica.net, winston.liu@...sight.com,
dan.mihailescu@...sight.com, kheib@...hat.com,
parth.v.parikh@...sight.com, davem@...hat.com, ian.ziemba@....com,
andrew.tauferner@...nelisnetworks.com, welch@....com,
rakhahari.bhunia@...sight.com, kingshuk.mandal@...sight.com,
linux-rdma@...r.kernel.org, kuba@...nel.org, pabeni@...hat.com
Subject: Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> Hi all,
> This patch-set introduces minimal Ultra Ethernet driver infrastructure and
> the lowest Ultra Ethernet sublayer - the Packet Delivery Sublayer (PDS),
> which underpins the entire communication model of the Ultra Ethernet
> Transport[1] (UET). Ultra Ethernet is a new RDMA transport designed for
> efficient AI and HPC communication.
I was away while this discussion happened so I've gone through and
read the threads, looked at the patches and I don't think I've changed
my view since I talked to Enfabrica privately on this topic almost a
year ago.
I do not agree with creating a new subsystem (or whatever you are
calling drivers/ultraeth) for a single RDMA protocol and see nothing
new here to change my mind. I would likely NAK the direction I see in
this RFC, as I have other past attempts to build RDMA HW interfaces
outside of the RDMA subystem.
Since none of that past discussion seems to have been acknowledged or
rebutted in this series I will repeat the main points:
1) I'm aware of something like 5-7 new protocols that are competing
for the same market as Ultra Ethernet. We can't give everyone and
their dog a new subsystem (or whatever) and all the maintainability
negatives that come with that. As a matter of maintainability we
need to see consolidation here, not fragmentation!
Yes, UE is a consortium driven standard, which is unique and a big
positive, but I don't believe anyone can say for certain what
direction the industry is going to go in. Many consortium standards
have failed to get adoption in the past even with a large number of
member companies.
Nor can we know what concepts in UE are going to be copied into
other competing RDMA transports. See my other remarks on job key
for an example. Prematurely siloing stuff in drivers/ultraeth is
very much the wrong technical direction for maintainability.
That said, I think UE should be in the kernel and have a fair
chance to compete for market share. Just in a maintainable and
appropriate way while the industry evolves.
2) Due to the above, I'm pretty confident we will see RDMA NICs
supporting a lot of different protocols. In fact they already do.
From a kernel maintainability perspective we really want one RDMA
driver leveraging as much common infrastructure between the
protocols as possible. We do not want to see a single HW driver
further split up needlessly to other subsystems, that would be a
big maintainability downside.
To put a clear point on this, mlx5 has been gaining new protocols
and fitting into the existing driver model for a number of years
now. In fact there is speculation that UE could be implemented in
mlx5 RDMA with minimal kernel changes. There would be no reason to
try to mess up the driver to also interact with this stuff in
drivers/ultraeth as seems to be proposed here.
I think other HW will be similar. UE isn't so radically different
that every HW path will need to diverge from classical RDMA. Nor is
is so dissimilar to other competing proposals. We don't want
artificial differences we want to create things that can be re-used
when appropriate.
Leon's response to Bart is correct, we already have similar
examples of almost everything UE does. Bart is also correct that
verbs would be a PITA, but RDMA userspace has moved beyond verbs
limitations years ago now. Alot of mlx5 stuff is not using verbs
today, for instance. EFA and other examples use extensive stuff
beyond verbs.
3) Building a user/kernel split HW driver model is very hard. RDMA has
spent 20 years learning how to do this and making alot of mistakes
along the way. I think we are in a good place now as alot of new
functionality has been rolled out with very little stress in the
past few years. I see no reason to believe UE would not follow that
same pattern.
Frankly, I see no evidence in this RFC of any of that learning.
Probably because it doesn't actually show any HW or even seem to
contemplate what HW would even look like. There isn't even a call
to pin_user_pages() in this RFC. You can't call yourself *RDMA* if
you are not doing direct access to userspace memory!
So, this RFC is woefully incomplete. I think you greatly underestimate
how much work you are looking at to duplicate and re-invent the
existing RDMA infrastructure. Frankly I'm not even sure why you
sent this RFC when it doesn't show enough to even evaluate..
4) For example, I get the feeling this RFC is repeating the original
cardinal sin of RDMA by biasing the UAPI design toward a single
philosophy.
Ie you said:
> I should've been more specific - it is not an issue for UEC and the way
> our driver's netlink API is designed. We fully understand the pros and
> cons of our approach.
Which is exactly the kind of narrow thinking that creates long term
trouble in uAPI design. Do your choices actually work for *ALL*
future HW designs and others drivers not just "our drivers
netlink"? I think not.
Given UE spec doesn't even have something pretending to be a
kernel/user interface standard I think we will see an extreme
variety of HW implementations here.
The proven modern RDMA approach to uAPI design is the right way to
solve this problem. It is shown to work. It already implements
multi-protocol RDMA and has alot of drivers demonstrating it now.
5) RDMA actually has pretty good infrastructure. It has alot of
complex infrastructure features, for example see the long threads I
recently wrote on how it's hot plug architecture works.
Even "basic" things like mmaping a doorbell page have thousands of
lines of support infrastructure to make the drivers work well and
support enterprise level HA features.
You get to have these features if you write a RDMA
driver. Otherwise you have to clone them all.
From what I can tell in this RFC the implementations of basic
things like the object model are worse that what we have in RDMA
already. Things like a device model don't even exist. Let alone
advanced stuff like hot plug, namespace, crgoups, DMA operations
and all the stuff needed for HW bindings.
It has a *long* way to go to even reach feature parity in terms of
what the core RDMA device model and object model provides a HW
driver, let alone complex things like uverbs :\
This whole RFC reeks of NIH: it is more fun to go off and do
something greenfield than do the maintenance work to evolve an
existing code base.
6) I offered many things, including not having to use libibverbs,
adding someone to maintain the UE specific portions, and helping to
architect the solution within RDMA. So it is not like there is some
blocker that is forcing a drivers/ultraeth, or that someone has
even said no to any proposal made.
For instance I spent alot of time with the Habana labs guys to work
out how to fit their almost-RDMA stuff into RDMA. It required some
careful thinking to accommodate their limited HW, but in the end it
did manage to fit in fine.
They also started as you did here with some weird thing. In the end
we all agreed that RDMA HW support belongs in the RDMA subsystem,
using normal RDMA APIs. We are trying not to proliferate these
things.
I feel like this is repeating the drivers/accel vs DRM debate from a
few years ago. All the points DaveA made apply here just as well,
arguably even more so as RDMA has even more robust shared
infrastructure that should be used instead of re-invented. At least
Habana had a reason for accel - they wanted to skip some DRM
rules. This RFC doesn't even have that.
Thus, I don't expect you will get support for something like this to
be merged, you should change directions.
Jason
Powered by blists - more mailing lists