lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210211172603.17d6a8f6@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Thu, 11 Feb 2021 17:26:03 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Toke Høiland-Jørgensen <toke@...hat.com>
Cc:     Marek Majtyka <alardam@...il.com>,
        Saeed Mahameed <saeed@...nel.org>,
        David Ahern <dsahern@...il.com>,
        Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
        John Fastabend <john.fastabend@...il.com>,
        Jesper Dangaard Brouer <jbrouer@...hat.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Maciej Fijalkowski <maciejromanfijalkowski@...il.com>,
        Björn Töpel <bjorn.topel@...el.com>,
        Andrii Nakryiko <andrii.nakryiko@...il.com>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Network Development <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>, hawk@...nel.org,
        bpf <bpf@...r.kernel.org>,
        intel-wired-lan <intel-wired-lan@...ts.osuosl.org>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        jeffrey.t.kirsher@...el.com
Subject: Re: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set

On Wed, 10 Feb 2021 23:52:39 +0100 Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@...nel.org> writes:
> > On Wed, 10 Feb 2021 11:53:53 +0100 Toke Høiland-Jørgensen wrote:  
> >> While I do agree that that kind of conformance test would be great, I
> >> don't think it has to hold up this series (the perfect being the enemy
> >> of the good, and all that). We have a real problem today that userspace
> >> can't tell if a given driver implements, say, XDP_REDIRECT, and so
> >> people try to use it and spend days wondering which black hole their
> >> packets disappear into. And for things like container migration we need
> >> to be able to predict whether a given host supports a feature *before*
> >> we start the migration and try to use it.  
> >
> > Unless you have a strong definition of what XDP_REDIRECT means the flag
> > itself is not worth much. We're not talking about normal ethtool feature
> > flags which are primarily stack-driven, XDP is implemented mostly by
> > the driver, each vendor can do their own thing. Maybe I've seen one
> > vendor incompatibility too many at my day job to hope for the best...  
> 
> I'm totally on board with documenting what a feature means.

We're trying documentation in devlink etc. and it's not that great.
It's never clear and comprehensive enough, barely anyone reads it.

> E.g., for
> XDP_REDIRECT, whether it's acceptable to fail the redirect in some
> situations even when it's active, or if there should always be a
> slow-path fallback.
> 
> But I disagree that the flag is worthless without it. People are running
> into real issues with trying to run XDP_REDIRECT programs on a driver
> that doesn't support it at all, and it's incredibly confusing. The
> latest example popped up literally yesterday:
> 
> https://lore.kernel.org/xdp-newbies/CAM-scZPPeu44FeCPGO=Qz=03CrhhfB1GdJ8FNEpPqP_G27c6mQ@mail.gmail.com/

To help such confusion we'd actually have to validate the program
against the device caps. But perhaps I'm less concerned about a
newcomer not knowing how to use things and more concerned about
providing abstractions which will make programs dependably working
across vendors and HW generations.

> >> I view the feature flags as a list of features *implemented* by the
> >> driver. Which should be pretty static in a given kernel, but may be
> >> different than the features currently *enabled* on a given system (due
> >> to, e.g., the TX queue stuff).  
> >
> > Hm, maybe I'm not being clear enough. The way XDP_REDIRECT (your
> > example) is implemented across drivers differs in a meaningful ways. 
> > Hence the need for conformance testing. We don't have a golden SW
> > standard to fall back on, like we do with HW offloads.  
> 
> I'm not disagreeing that we need to harmonise what "implementing a
> feature" means. Maybe I'm just not sure what you mean by "conformance
> testing"? What would that look like, specifically? 

We developed a pretty good set of tests at my previous job for testing
driver XDP as well as checking that the offload conforms to the SW
behavior. I assume any vendor who takes quality seriously has
comprehensive XDP tests.

If those tests were upstream / common so that we could run them
against every implementation - the features which are supported by 
a driver fall out naturally out of the set of tests which passed.
And the structure of the capability API could be based on what the
tests need to know to make a SKIP vs FAIL decision.

Common tests would obviously also ease the validation burden, burden of
writing tests on vendors and make it far easier for new implementations
to be confidently submitted.

> A script in selftest that sets up a redirect between two interfaces
> that we tell people to run? Or what? How would you catch, say, that
> issue where if a machine has more CPUs than the NIC has TXQs things
> start falling apart?

selftests should be a good place, but I don't mind the location.
The point is having tests which anyone (vendors and users) can run
to test their platforms. One of the tests should indeed test if every
CPU in the platform can XDP_REDIRECT. Shouldn't it be a rather trivial
combination of tun/veth, mh and taskset?

> > Also IDK why those tests are considered such a huge ask. As I said most
> > vendors probably already have them, and so I'd guess do good distros.
> > So let's work together.  
> 
> I guess what I'm afraid of is that this will end up delaying or stalling
> a fix for a long-standing issue (which is what I consider this series as
> shown by the example above). Maybe you can alleviate that by expanding a
> bit on what you mean?

I hope what I wrote helps a little. I'm not good at explaining. 

Perhaps I had seen one too many vendor incompatibility to trust that
adding a driver API without a validation suite will result in something
usable in production settings. 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ