netdev - Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoM=7ac-A=ErU_PojZuuB4eHnoe-CdPxBi3x9d+=PxikfgA@mail.gmail.com>
Date: Tue, 25 Mar 2025 10:12:49 -0400
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Leon Romanovsky <leon@...nel.org>, Nikolay Aleksandrov <nikolay@...abrica.net>, 
	Linux Kernel Network Developers <netdev@...r.kernel.org>, Shrijeet Mukherjee <shrijeet@...abrica.net>, alex.badea@...sight.com, 
	eric.davis@...adcom.com, rip.sohan@....com, David Ahern <dsahern@...nel.org>, 
	bmt@...ich.ibm.com, roland@...abrica.net, 
	Winston Liu <winston.liu@...sight.com>, dan.mihailescu@...sight.com, kheib@...hat.com, 
	parth.v.parikh@...sight.com, davem@...hat.com, ian.ziemba@....com, 
	andrew.tauferner@...nelisnetworks.com, welch@....com, 
	rakhahari.bhunia@...sight.com, kingshuk.mandal@...sight.com, 
	linux-rdma@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>
Subject: Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction

On Wed, Mar 19, 2025 at 3:19 PM Jason Gunthorpe <jgg@...dia.com> wrote:
>
> On Wed, Mar 19, 2025 at 02:21:23PM -0400, Jamal Hadi Salim wrote:
>
> > Curious how you guarantee that a "destroy" will not fail under OOM. Do
> > you have pre-allocated memory?
>
> It just never allocates memory? Why would a simple system call like a
> destruction allocate any memory?

You need to at least construct the message parameterization in user
space which would require some memory, no? And then copy_from_user
would still need memory to copy to?
I am probably missing something basic.

> > > Overall systems calls here should either succeed or fail and be the
> > > same as a NOP. No failure that actually did something and then creates
> > > some resource leak or something because userspace didn't know about
> > > it.
> >
> > Yes, this is how netlink works as well. If a failure to delete an
> > object occurs then every transient state gets restored. This is always
> > the case for simple requests(a delete/create/update). For requests
> > that batch multiple objects there are cases where there is no
> > unwinding.
>
> I'm not sure that is complely true, like if userspace messes up the
> netlink read() side of the API and copy_to_user() fails then you can
> get these inconsistencies. In the RDMA model even those edge case are
> properly unwound, just like a normal system call would.
>

For a read() to fail at say copy_to_user() feels like your app or
system must be in really bad shape.
A contingency plan could be to replay the message from the app/control
plane and hope you get an "object doesnt exist" kind of message for a
failed destroy msg.
Or IMO restart the app or system and try to recover/cleanup from
scratch to build a good known state.
IOW, while unwinding is more honorable, unless it comes for cheap it
may not be worth it.
Regardless: How would RDMA unwind in such a case?

> > Makes sense. So ioctls with TLVs ;->
> > I am suspecting you don't have concepts of TLVs inside TLVs for
> > hierarchies within objects.
>
> No, it has not been needed yet, or at least the cases that have come
> up have been happy to use arrays of structs for the nesting. The
> method calls themselves don't tend to have that kind of challenging
> structure for their arguments.
>

ok.
Not sure if this applies to you: Netlink good practise is to ensure
any structs exchanged are 32b aligned and in cases they are not mostly
adding explicit pads.
Fun back in the day on tc when everything worked on x86 then failures
galore on esoteric architectures(I remember trying to run on a switch
which had a PPC cpu with 8B alignmennt). I am searching my brain cells
for what the failures were but getting ENOENT; i think it was more the
way TLV alignment was structured although it could have been offsets
of different fields ended in the wrong place, etc.

> > > RDMA also has special infrastructure to split up the TLV space between
> > > core code and HW driver code which is a key feature and necessary part
> > > of how you'd build a user/kernel split driver.
> >
> > The T namespace is split between core code and driver code?
> > I can see that as being useful for debugging maybe? What else?
>
> RDMA is all about having a user/kernel driver co-design.  This means a
> driver has code in a userspace library and code in the kernel that
> work together to implement the functionality. The userspace library
> should be thought of as an extension of the kernel driver into
> userspace.
>
> So, there is alot of traffic between the two driver components that is
> just private and unique to the driver. This is what the driver
> namespace is used for.
>
> For instance there is a common method call to create a queue. The
> queue has a number of core parameters like depth, and address, then it
> calls the driver and there are bunch of device specific parameters
> too, like say queue entry format.
>
> Every driver gets to define its own parameters best suited to its own
> device and its own user/kernel split.
>

I think i got it, to reword what you said:
When you say "driver" you mean "control/provisioning plane" activity
between a userspace control app and kernel objects which likely extend
to hardware (as opposed to datapath send/receive kind activity).
That you have a set of common, agreed-to attributes and then each
vendor would add their own (separate namespace) attributes?
The control app issuing a request would first invoke some common
interface which would populate the applicable common TLVs for that
request then call into a vendor interface to populate vendor specific
attributes.
And in the kernel, some common code would process the common
attributes then pass on the vendor specific data to a vendor driver.

If my reading is right, some comments:
1) You can achieve this fine with netlink. My view of the model is you
would have a T (call it VendorData, which is is defined within the
common namespace) that puts the vendor specific TLVs within a
hierarchy. i.e.
when constructing or parsing the VendorData you invoke vendor specific
extensions.

2) Hopefully the vendor extensions are in the minority. Otherwise the
complexity of someone writing an app to control multiple vendors would
be challenging over time as different vendors add more attributes. I
cant imagine a commonly used utility like iproute2/tc being invoked
with "when using broadcom then use foo=x bar=y" apply but when using
intel use "goo=x-1 and gah=y-2".

3) A Pro/con to #2 depending on which lens you use:  it could be
"innnovation" or "vendor lockin" - depends on the community i.e on the
one hand a vendor could add features faster and is not bottlenecked by
endless mailing list discussions but otoh, said vendor may not be in
any hurry to move such features to the common path (because it gives
them an advantage).

> Building a split user/kernel driver is complicated and uAPI is one of
> the biggest challenges :\
>
> > > > - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> > > > that is now common in generic netlink highly reduces developer effort.
> > > > Although in my opinion we really need this stuff integrated into tools
> > > > like iproute2..
> > >
> > > RDMA also has a DSL like scheme for defining schema, and centralized
> > > parsing and validation. IMHO it's capability falls someplace between
> > > the old netlink policy stuff and the new YAML stuff.
> > >
> >
> > I meant the ability to start with a data model and generate code as
> > being useful.
> > Where can i find the RDMA DSL?
>
> It is done with the C preprocessor instead of an external YAML
> file. Look at drivers/infiniband/core/uverbs_std_types_mr.c at the
> end. It describes a data model, but it is elaborated at runtime into
> an efficient parse tree, not by using a code generator.
>
> The schema is more classical object oriented RPC type scheme where you
> define objects, methods and then method parameters. The objects have
> an entire kernel side infrastructure to manage their lifecycle and the
> attributes have validation and parsing done prior to reaching the C
> function implementing the method.
>
> I always thought it was netlink inspired, but more suited to building
> a uAPI out of. Like you get actual system call names (eg
> UVERBS_METHOD_REG_DMABUF_MR) that have actual C functions implementing
> them. There is special help to implement object allocation and
> destruction functions, and freedom to have as many methods per object
> as make sense.
>

I took a quick look at what you pointed to. It's RPC-ish (just like
_most_ of netlink use is) - so similar roots. IOW, you end up with
methods like create_myfoo() and create_mybar()
Two things:
1) I am not a fan of the RPC approach because it has a higher
developer effort when adding new features. Based on my experience, I
am a fan of CRUD(Create Read Update Delete) - and with netlink i also
get for free the subscribe/publish parts; to be specific _all you
need_ are CRUDPS methods i.e 6 methods tops (which never change). You
can craft any objects to conform to those interfaces. Example, you can
have create(myfoo) not being syntactically different from
create(mybar). This simplifies the data model immensely (and allows
for better automation). Unfortunately the gprcs and thrifts out there
have permeated RPC semantics everywhere(thrift being slightly better
IMO).

2) Using C as the modelling sounds like a good first start to someone
who knows C well but tbh, those macros hurt my eyes for a bit (and i
am someone who loves macro witchcraft). The big advantage IMO of using
yaml or json is mostly the available tooling, example being polyglot.
I am not sure if that is a requirement in RDMA.

> > I dont know enough about RDMA infra to comment but iiuc, you are
> > saying that it is the control infrastructure (that sits in
> > userspace?), that does all those things you mention, that is more
> > important.
>
> There is an entire object model in the kernel and it is linked into
> the schema.
>
> For instance in the above example we have a schema for an object
> method like this:
>
> DECLARE_UVERBS_NAMED_METHOD(
>         UVERBS_METHOD_REG_DMABUF_MR,
>         UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
>                         UVERBS_OBJECT_MR,
>                         UVERBS_ACCESS_NEW,
>                         UA_MANDATORY),
>         UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
>                         UVERBS_OBJECT_PD,
>                         UVERBS_ACCESS_READ,
>                         UA_MANDATORY),
>
> That says it accepts two object handles MR and PD as input to the
> method call.
>
> The core code keeps track of all these object handles, validates the
> ID number given by userspace is refering to the correct object, of the
> correct type, in the correct state. Locks things against concurrent
> destruction, and then gives a trivial way for the C method
> implementation to pick up the object pointer:
>
>         struct ib_pd *pd =
>                 uverbs_attr_get_obj(attrs, UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE);
>
> Which can't fail because everything was already checked before we get
> here.  This is all designed to greatly simplify and make robust the
> method implementations that are often in driver code.
>

Again, I could be missing something but the semantics seem to be the
same as netlink.

BTW, do you do fuzzy testing with this? In the old days the whole
netlink infra assumed a philosophy of "we give you the gun, if you
want to shoot yourself in the small toe then go ahead".
IOW, there was no assumption of people doing stupid things - and that
stupid things will only harm them.  Now we have hostile actors,
syzkaller and bounty hunters creating all kinds of UAFs and trying to
trick the kernel into some funky state just because....

cheers,
jamal