[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240210010129.GA1010957@nvidia.com>
Date: Fri, 9 Feb 2024 21:01:29 -0400
From: Jason Gunthorpe <jgg@...dia.com>
To: David Ahern <dsahern@...nel.org>
Cc: Jakub Kicinski <kuba@...nel.org>, Saeed Mahameed <saeed@...nel.org>,
Arnd Bergmann <arnd@...db.de>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Leon Romanovsky <leonro@...dia.com>, Jiri Pirko <jiri@...dia.com>,
Leonid Bloch <lbloch@...dia.com>, Itay Avraham <itayavr@...dia.com>,
Saeed Mahameed <saeedm@...dia.com>,
Aron Silverton <aron.silverton@...cle.com>,
Christoph Hellwig <hch@...radead.org>,
andrew.gospodarek@...adcom.com, linux-kernel@...r.kernel.org,
netdev@...r.kernel.org
Subject: Re: [PATCH V4 0/5] mlx5 ConnectX control misc driver
On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote:
> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
> >>> Ah yes, the high frequency counters. Something that is definitely
> >>> impossible to implement in a generic way. You were literally in the
> >>> room at netconf when David Ahern described his proposal for this.
>
> The key point of that proposal is host memory mapped to userspace where
> H/W counters land (either via direct DMA by a H/W push or a
> kthread/timer pulling in updates). That is similar to what is proposed here.
The counter experiment that inspired Saeed to write about it here was
done using mlx5ctl interfaces and some other POC stuff on an RDMA
network monitoring RDMA workloads, inspecting RDMA objects.
So if your proposal also considers how to select RDMA object counters,
control the detailed sampling hardware with RDMA stuff, and works
on a netdev-free InfiniBand network, then it might be interesting.
It was actually interesting research, I hope some information will be
made public.
> BTW, there is already a broadcom driver under drivers/misc that seems to
> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
> person could chime in.
Yeah, there are lots of examples of drivers that use this kind FW API
direct to userspace. It is a common design pattern across the kernel
in many subsystems. At the core it is following the general philosophy
of pushing things to userspace that don't need to be in the kernel. It
is more secure, more hackable and easier to deploy.
It becomes a userspace decision what kind of tooling community will
develop and what the ultimate user experience will be.
> > Why don't you repost it to netdev and see how many acks you get?
> > I'm not the only netdev maintainer.
>
> I'll go out on that limb and say I would have no problem ACK'ing the
> driver. It's been proven time and time again that these kinds of
> debugging facilities are needed for these kinds of complex,
> multifunction devices.
Agree as well. Ack for RDMA community. This is perfectly consistent
with the subsystem's existing design of directly exposing the device
to userspace. It is essential as we can't piggyback on any "generic"
netdev stuff on InfiniBand HW. Further, I anticipate most of the
mlx5ctl users would actually be running primarily RDMA related
workloads anyhow.
There is not that many people that buy these expensive cards and don't
use them to their full capability.
Recently at usenix Microsoft shared some details of their production
network in the paper "Empowering Azure Storage with RDMA".
Notably they shared that "Traffic statistics of all Azure public
regions between January 18 and February 16, 2023. Traffic was measured
by collecting switch counters of server-facing ports on all Top of
Rack (ToR) switches. Around 70% of traffic was RDMA."
It is a rare public insight into what is going on in the industry at
large, and why RDMA is a significant and important subsytem.
Jason
Powered by blists - more mailing lists