netdev - Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z+QiKan/j3UIhwL1@nvidia.com>
Date: Wed, 26 Mar 2025 12:50:01 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: Leon Romanovsky <leon@...nel.org>,
	Nikolay Aleksandrov <nikolay@...abrica.net>,
	Linux Kernel Network Developers <netdev@...r.kernel.org>,
	Shrijeet Mukherjee <shrijeet@...abrica.net>,
	alex.badea@...sight.com, eric.davis@...adcom.com, rip.sohan@....com,
	David Ahern <dsahern@...nel.org>, bmt@...ich.ibm.com,
	roland@...abrica.net, Winston Liu <winston.liu@...sight.com>,
	dan.mihailescu@...sight.com, kheib@...hat.com,
	parth.v.parikh@...sight.com, davem@...hat.com, ian.ziemba@....com,
	andrew.tauferner@...nelisnetworks.com, welch@....com,
	rakhahari.bhunia@...sight.com, kingshuk.mandal@...sight.com,
	linux-rdma@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
	Paolo Abeni <pabeni@...hat.com>
Subject: Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver
 introduction

On Tue, Mar 25, 2025 at 10:12:49AM -0400, Jamal Hadi Salim wrote:

> You need to at least construct the message parameterization in user
> space which would require some memory, no? And then copy_from_user
> would still need memory to copy to?
> I am probably missing something basic.

It usually all stack memory on the userspace side, and no kernel
memory allocation. Like there is no mandatory SKB in uverbs.

> For a read() to fail at say copy_to_user() feels like your app or
> system must be in really bad shape.

Yes, but still the semantic we want is that if a creation ioctl
returns 0 (success) then the object exists and if it returns any error
code then the creation was a NOP.

> A contingency plan could be to replay the message from the app/control
> plane and hope you get an "object doesnt exist" kind of message for a
> failed destroy msg.

Nope, it's racey, it must be multi-threaded safe. Another thread could
have created and re-used the object ID.

> IOW, while unwinding is more honorable, unless it comes for cheap it
> may not be worth it.

It was cheap

> Regardless: How would RDMA unwind in such a case?

The object infrastructure takes care of this with a three step object
creation protocol and some helpers.

> Not sure if this applies to you: Netlink good practise is to ensure
> any structs exchanged are 32b aligned and in cases they are not mostly
> adding explicit pads.

The alignment is less important as a ABI requirement since
copy_to_user will fix the alignment when it copies arrays to kernel
memory that will be properly aligned as required. netlink has this
issue because it bulk copies everything into a skb and uses pointers
to that copy. The approach here only copies small stuff in advance and
larger stuff is not copied until memory is allocated to hold it.

> When you say "driver" you mean "control/provisioning plane" activity
> between a userspace control app and kernel objects which likely
> extend

No, I literally mean driver.

The user of this HW will not do something like socket() as standard
system call abstracted by the kernel. Instead it makes a library call
ib_create_qp() which goes into a library with the userspace driver
components. The abstraction is now done in userspace. The library
figures out what HW the kernel has and loads a userspace driver
component with a driver_create_qp() op that does more processing and
eventually calls the kernel.

It is "control path" in the sense that it is slow path creating
objects for data transfer, but the purpose of most of the actions is
actually setting up for data plane operations.

> That you have a set of common, agreed-to attributes and then each
> vendor would add their own (separate namespace) attributes?

Yes

> The control app issuing a request would first invoke some common
> interface which would populate the applicable common TLVs for that
> request then call into a vendor interface to populate vendor specific
> attributes.

Yes

> And in the kernel, some common code would process the common
> attributes then pass on the vendor specific data to a vendor driver.

Yes
 
> If my reading is right, some comments:
> 1) You can achieve this fine with netlink. My view of the model is you
> would have a T (call it VendorData, which is is defined within the
> common namespace) that puts the vendor specific TLVs within a
> hierarchy.

Yes, that was a direction that was suggested here too. But when we got
to micro optimizing the ioctl ABI format it became clear there was
significant advantage to keeping things one level and not trying to do
some kind of nesting. This also gives a nice simple in-kernel API for
working with method arguments, it is always the same. We don't have
different APIs depending on driver/common callers.

> 2) Hopefully the vendor extensions are in the minority. Otherwise the
> complexity of someone writing an app to control multiple vendors would
> be challenging over time as different vendors add more attributes.

Nope, it is about 50/50, and there is not a challenge because the
methodology is everyone uses the *same* userspace driver code. It is
too complicated for people to reasonable try to rewrite.

> I cant imagine a commonly used utility like iproute2/tc being
> invoked with "when using broadcom then use foo=x bar=y" apply but
> when using intel use "goo=x-1 and gah=y-2".

Right, it doesn't make sense for a tool like iproute, but we aren't
building anything remotely like iproute.

> 3) A Pro/con to #2 depending on which lens you use:  it could be
> "innnovation" or "vendor lockin" - depends on the community i.e on the
> one hand a vendor could add features faster and is not bottlenecked by
> endless mailing list discussions but otoh, said vendor may not be in
> any hurry to move such features to the common path (because it gives
> them an advantage).

There is no community advantage to the common kernel path.

The users all use the library, the only thing that matters is how
accessible the vendor has made their unique ideas to the library
users.

For instance, if the user is running a MPI application and the vendor
makes standard open source MPI 5% faster with some unique HW
innovation should anyone actually care about the "common path" deep,
deep below MPI?

> 1) I am not a fan of the RPC approach because it has a higher
> developer effort when adding new features. Based on my experience, I
> am a fan of CRUD(Create Read Update Delete) 

It suites some things better than others. I don't think semantically
update is the right language for most of what is happening
here. "read" is almost never done. Like socket() Fd's and it's API
surface isn't a good fit for CRUD.

> - and with netlink i also
> get for free the subscribe/publish parts; to be specific _all you

publish/subscribe doesn't make sense in this context. We don't do it.

> 2) Using C as the modelling sounds like a good first start to someone
> who knows C well but tbh, those macros hurt my eyes for a bit (and i
> am someone who loves macro witchcraft). The big advantage IMO of using
> yaml or json is mostly the available tooling, example being polyglot.
> I am not sure if that is a requirement in RDMA.

I agree with this.. When it was first made I suggested a code
generator instead but at that time code generators in the kernel did
not seem to be a well accepted idea. I'm glad to see that improving.

> Again, I could be missing something but the semantics seem to be the
> same as netlink.

AFAIK netlink doesn't have the same notion of objects or having the
validation obtain references and locking on referenced objects at all.

> BTW, do you do fuzzy testing with this?

syzkaller runs on rdma, but I don't recall how much coverage syzkaller
gets on these forms. We fixed a huge number of syzkaller bugs at
least.

Jason