lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UdAz1mb48kFEngY5sCvHwYM2vYtEK81VceKj-xo6roFyA@mail.gmail.com>
Date: Fri, 5 Apr 2024 08:41:11 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: John Fastabend <john.fastabend@...il.com>, Jiri Pirko <jiri@...nulli.us>, netdev@...r.kernel.org, 
	bhelgaas@...gle.com, linux-pci@...r.kernel.org, 
	Alexander Duyck <alexanderduyck@...com>, davem@...emloft.net, pabeni@...hat.com
Subject: Re: [net-next PATCH 00/15] eth: fbnic: Add network driver for Meta
 Platforms Host Network Interface

On Thu, Apr 4, 2024 at 7:38 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Thu, 4 Apr 2024 17:11:47 -0700 Alexander Duyck wrote:
> > > Opensourcing is just one push to github.
> > > There are guarantees we give to upstream drivers.
> >
> > Are there? Do we have them documented somewhere?
>
> I think they are somewhere in Documentation/
> To some extent this question in itself supports my point that written
> down rules, as out of date as they may be, seem to carry more respect
> than what a maintainer says :S

I think the problem is there are multiple maintainers and they all
have different ways of doing things. As a submitter over the years I
have had to deal with probably over a half dozen different maintainers
and each experience has been different. I have always argued that the
netdev tree is one of the better maintained setups.

> > > > Eventually they need some kernel changes and than we block those too
> > > > because we didn't allow the driver that was the use case? This seems
> > > > wrong to me.
> > >
> > > The flip side of the argument is, what if we allow some device we don't
> > > have access to to make changes to the core for its benefit. Owner
> > > reports that some changes broke the kernel for them. Kernel rules,
> > > regression, we have to revert. This is not a hypothetical, "less than
> > > cooperative users" demanding reverts, and "reporting us to Linus"
> > > is a reality :(
> > >
> > > Technical solution? Maybe if it's not a public device regression rules
> > > don't apply? Seems fairly reasonable.
> >
> > This is a hypothetical. This driver currently isn't changing anything
> > outside of itself. At this point the driver would only be build tested
> > by everyone else. They could just not include it in their Kconfig and
> > then out-of-sight, out-of-mind.
>
> Not changing does not mean not depending on existing behavior.
> Investigating and fixing properly even the hardest regressions in
> the stack is a bar that Meta can so easily clear. I don't understand
> why you are arguing.

I wasn't saying the driver wouldn't be dependent on existing behavior.
I was saying that it was a hypothetical that Meta would be a "less
than cooperative user" and demand a revert.  It is also a hypothetical
that Linus wouldn't just propose a revert of the fbnic driver instead
of the API for the crime of being a "less than cooperative maintainer"
and  then give Meta the Nvidia treatment.

> > > > Anyways we have zero ways to enforce such a policy. Have vendors
> > > > ship a NIC to somebody with the v0 of the patch set? Attach a picture?
> > >
> > > GenAI world, pictures mean nothing :) We do have a CI in netdev, which
> > > is all ready to ingest external results, and a (currently tiny amount?)
> > > of test for NICs. Prove that you care about the device by running the
> > > upstream tests and reporting results? Seems fairly reasonable.
> >
> > That seems like an opportunity to be exploited through. Are the
> > results going to be verified in any way? Maybe cryptographically
> > signed? Seems like it would be easy enough to fake the results.
>
> I think it's much easier to just run the tests than write a system
> which will competently lie. But even if we completely suspend trust,
> someone lying is of no cost to the community in this case.

I don't get this part. You are paranoid about bad actors until it
comes to accepting the test results? So write a broken API, "prove" it
works by running it on your broken test setup, and then get it
upstream after establishing a false base for trust. Seems like a
perfect setup for an exploit like what happened with the xz setup.

> > > > Even if vendor X claims they will have a product in N months and
> > > > than only sells it to qualified customers what to do we do then.
> > > > Driver author could even believe the hardware will be available
> > > > when they post the driver, but business may change out of hands
> > > > of the developer.
> > > >
> > > > I'm 100% on letting this through assuming Alex is on top of feedback
> > > > and the code is good.
> > >
> > > I'd strongly prefer if we detach our trust and respect for Alex
> > > from whatever precedent we make here. I can't stress this enough.
> > > IDK if I'm exaggerating or it's hard to appreciate the challenges
> > > of maintainership without living it, but I really don't like being
> > > accused of playing favorites or big companies buying their way in :(
> >
> > Again, I would say we look at the blast radius. That is how we should
> > be measuring any change. At this point the driver is self contained
> > into /drivers/net/ethernet/meta/fbnic/. It isn't exporting anything
> > outside that directory, and it can be switched off via Kconfig.
>
> It is not practical to ponder every change case by case. Maintainers
> are overworked. How long until we send the uAPI patch for RSS on the
> flow label? I'd rather not re-litigate this every time someone posts
> a slightly different feature. Let's cover the obvious points from
> the beginning while everyone is paying attention. We can amend later
> as need be.

Isn't that what we are doing right now? We are basically refusing this
patch set not based on its own merits but on a "what-if" scenario for
a patch set that might come at some point in the future and conjecture
that somehow it is going to be able to add features for just itself
when we haven't allowed that for in the past, for example with things
like GSO partial.

> > When the time comes to start adding new features we can probably start
> > by looking at how to add either generic offloads like was done for
> > GSO, CSO, ect or how it can also be implemented on another vendor's
> > NIC.
> >
> > At this point the only risk the driver presents is that it is yet
> > another driver, done in the same style I did the other Intel drivers,
> > and so any kernel API changes will end up needing to be applied to it
> > just like the other drivers.
>
> The risk is we'll have a fight every time there is a disagreement about
> the expectations.

We always do. I am not sure why you would expect that would change by
blocking this patch set. If anything it sounds like maybe we need to
document some requirements for availability for testing, and
expectations for what it would mean to be a one-off device where only
one entity has access to it. However that is a process problem, and
not so much a patch or driver issue.

> > > > I think any other policy would be very ugly to enforce, prove, and
> > > > even understand. Obviously code and architecture debates I'm all for.
> > > > Ensuring we have a trusted, experienced person signed up to review
> > > > code, address feedback, fix whatever syzbot finds and so on is also a
> > > > must I think. I'm sure Alex will take care of it.
> > >
> > > "Whatever syzbot finds" may be slightly moot for a private device ;)
> > > but otherwise 100%! These are exactly the kind of points I think we
> > > should enumerate. I started writing a list of expectations a while back:
> > >
> > > Documentation/maintainer/feature-and-driver-maintainers.rst
> > >
> > > I think we just need something like this, maybe just a step up, for
> > > non-public devices..
> >
> > I honestly think we are getting the cart ahead of the horse. When we
> > start talking about kernel API changes then we can probably get into
> > the whole "private" versus "publicly available" argument. A good
> > example of the kind of thing I am thinking of is GSO partial where I
> > ended up with Mellanox and Intel sending me 40G and 100G NICs and
> > cables to implement it on their devices as all I had was essentially
> > igb and ixgbe based NICs.
>
> That'd be great. Maybe even more than I'd expect. So why not write
> it down? In case the person doing the coding is not Alex Duyck, and
> just wants to get it done for their narrow use case, get a promo,
> go work on something else?

Write what down? That the vendors didn't like me harassing them to
test my code so they shipped me the NICs and asked me to just test it
myself?

That worked in my scenario as I had a server level system in my home
lab to make that work. We cannot expect everyone to have that kind of
setup for their own development. That is why I am considering the QEMU
approach as that might make this a bit more accessible. I could then
look at enabling the QEMU at the same time I enable the driver.

> > Odds are when we start getting to those kind of things maybe we need
> > to look at having a few systems available for developer use, but until
> > then I am not sure it makes sense to focus on if the device is
> > publicly available or not.
>
> Developer access would be huge.
> A mirage of developer access? immaterial :)

If nothing else, maybe we need a writeup somewhere of the level of
support a driver should expect from the Linux community if the device
is not "easily available". We could probably define that in terms of
what a reasonable expectation would be for a developer to have access
to it.

I would say that "commercial sales" is not a good metric and shouldn't
come into it. If anything it would be about device availability for
development and testing.

In addition I would be good with a definition of "support" for the
case of a device that isn't publicly available being quite limited, as
those with access would have to have active involvement in the
community to enable support. Without that it might as well be
considered orphaned and the driver dropped.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ