lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 21 Jun 2023 16:15:32 +0200
From: Magnus Karlsson <magnus.karlsson@...il.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@...el.com>, bpf@...r.kernel.org, ast@...nel.org, 
	daniel@...earbox.net, andrii@...nel.org, netdev@...r.kernel.org, 
	magnus.karlsson@...el.com, bjorn@...nel.org, tirthendu.sarkar@...el.com, 
	simon.horman@...igine.com
Subject: Re: [PATCH v4 bpf-next 15/22] xsk: add multi-buffer documentation

On Wed, 21 Jun 2023 at 15:30, Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Magnus Karlsson <magnus.karlsson@...il.com> writes:
>
> > On Tue, 20 Jun 2023 at 19:34, Toke Høiland-Jørgensen <toke@...nel.org> wrote:
> >>
> >> Maciej Fijalkowski <maciej.fijalkowski@...el.com> writes:
> >>
> >> > From: Magnus Karlsson <magnus.karlsson@...el.com>
> >> >
> >> > Add AF_XDP multi-buffer support documentation including two
> >> > pseudo-code samples.
> >> >
> >> > Signed-off-by: Magnus Karlsson <magnus.karlsson@...el.com>
> >> > ---
> >> >  Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++
> >> >  1 file changed, 177 insertions(+)
> >> >
> >> > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> >> > index 247c6c4127e9..2b583f58967b 100644
> >> > --- a/Documentation/networking/af_xdp.rst
> >> > +++ b/Documentation/networking/af_xdp.rst
> >> > @@ -453,6 +453,93 @@ XDP_OPTIONS getsockopt
> >> >  Gets options from an XDP socket. The only one supported so far is
> >> >  XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
> >> >
> >> > +Multi-Buffer Support
> >> > +--------------------
> >> > +
> >> > +With multi-buffer support, programs using AF_XDP sockets can receive
> >> > +and transmit packets consisting of multiple buffers both in copy and
> >> > +zero-copy mode. For example, a packet can consist of two
> >> > +frames/buffers, one with the header and the other one with the data,
> >> > +or a 9K Ethernet jumbo frame can be constructed by chaining together
> >> > +three 4K frames.
> >> > +
> >> > +Some definitions:
> >> > +
> >> > +* A packet consists of one or more frames
> >> > +
> >> > +* A descriptor in one of the AF_XDP rings always refers to a single
> >> > +  frame. In the case the packet consists of a single frame, the
> >> > +  descriptor refers to the whole packet.
> >> > +
> >> > +To enable multi-buffer support for an AF_XDP socket, use the new bind
> >> > +flag XDP_USE_SG. If this is not provided, all multi-buffer packets
> >> > +will be dropped just as before. Note that the XDP program loaded also
> >> > +needs to be in multi-buffer mode. This can be accomplished by using
> >> > +"xdp.frags" as the section name of the XDP program used.
> >> > +
> >> > +To represent a packet consisting of multiple frames, a new flag called
> >> > +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
> >> > +descriptors. If it is true (1) the packet continues with the next
> >> > +descriptor and if it is false (0) it means this is the last descriptor
> >> > +of the packet. Why the reverse logic of end-of-packet (eop) flag found
> >> > +in many NICs? Just to preserve compatibility with non-multi-buffer
> >> > +applications that have this bit set to false for all packets on Rx,
> >> > +and the apps set the options field to zero for Tx, as anything else
> >> > +will be treated as an invalid descriptor.
> >> > +
> >> > +These are the semantics for producing packets onto AF_XDP Tx ring
> >> > +consisting of multiple frames:
> >> > +
> >> > +* When an invalid descriptor is found, all the other
> >> > +  descriptors/frames of this packet are marked as invalid and not
> >> > +  completed. The next descriptor is treated as the start of a new
> >> > +  packet, even if this was not the intent (because we cannot guess
> >> > +  the intent). As before, if your program is producing invalid
> >> > +  descriptors you have a bug that must be fixed.
> >> > +
> >> > +* Zero length descriptors are treated as invalid descriptors.
> >> > +
> >> > +* For copy mode, the maximum supported number of frames in a packet is
> >> > +  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
> >> > +  descriptors accumulated so far are dropped and treated as
> >> > +  invalid. To produce an application that will work on any system
> >> > +  regardless of this config setting, limit the number of frags to 18,
> >> > +  as the minimum value of the config is 17.
> >> > +
> >> > +* For zero-copy mode, the limit is up to what the NIC HW
> >> > +  supports. Usually at least five on the NICs we have checked. We
> >> > +  consciously chose to not enforce a rigid limit (such as
> >> > +  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
> >> > +  resulted in copy actions under the hood to fit into what limit
> >> > +  the NIC supports. Kind of defeats the purpose of zero-copy mode.
> >>
> >> How is an application supposed to discover the actual limit for a given
> >> NIC/driver?
> >
> > Thanks for your comments Toke. I will add an example here of how to
> > discover this. Basically you can send a packet with N frags (N = 2 to
> > start with), check the error stats through the getsockopt. If no
> > invalid_tx_desc error, increase N with one and send this new packet.
> > If you get an error, then the max number of frags is N-1.
>
> Hmm, okay, that sounds pretty tedious :P

Indeed if you had to do it manually ;-). Do not think this max is
important though, see next answer.

> Also, it has side effects (you'll be putting frames on the wire while
> testing, right?), so I guess this is not something you'll do on every
> startup of your application? What are you expecting app developers to do
> here in practice? Run the test while developing and expect the value to
> stay constant for a given driver (does it?), or? Or will most people in
> practice only need a few frags to get up to 9k MTU?

I believe that the question that a developer wants to answer is not
"max number of frags" supported, but "is 3 or 5 frags" supported. This
corresponds to 4K and 2K chunks used to send a 9K MTU. My guess is
that any driver that reports that it supports XDP multibuffer will
support at least 5 frags too. (Just note that I have not checked
around if there are NICs that only support up to 3 frags which would
mean that 4K chunks would have to be used.) This value will not change
for a NIC and the question is if there are XDP MB enabled NIC drivers
that do not support 5 frags? If not, then 5 is the lower bound for the
max number of frags in zc mode, at least for the current set of XDP MB
enabled drivers.

If an application would like to try if 3 or 5 frags (or any other
value) is supported by a NIC when the app is launched, then we would
need to try to send something as you say. I wonder if there is a way
to transmit a frame that does not reach the wire? What happens if I
send a frame with all zeroes, MAC addresses and the like? Will it
reach the wire? Another option would be to just send something you
really would like to send and if it does not complete and/or you get
an error in the getsockopt() stats call, you know that this amount of
frags is not supported and you need to try it again with less frags or
fallback to copy mode.

I do think there will be applications that require more than 5 frags.
I just do not know what they are due to a lack of imagination on my
part. They could use the algorithm from above, or just try the number
of frags they need depending on the flexibility of the code. A
fallback to copy-mode is always possible when the supported number of
frags is not there in the NIC HW.

> >> > +* The ZC batch API guarantees that it will provide a batch of Tx
> >> > +  descriptors that ends with full packet at the end. If not, ZC
> >> > +  drivers would have to gather the full packet on their side. The
> >> > +  approach we picked makes ZC drivers' life much easier (at least on
> >> > +  Tx side).
> >>
> >> This seems like it implies some constraint on how an application can use
> >> the APIs, but it's not quite clear to me what those constraints are, nor
> >> what happens if an application does something different. This should
> >> probably be spelled out...
> >>
> >> > +On the Rx path in copy-mode, the xsk core copies the XDP data into
> >> > +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
> >> > +detailed before. Zero-copy mode works the same, though the data is not
> >> > +copied. When the application gets a descriptor with the XDP_PKT_CONTD
> >> > +flag set to one, it means that the packet consists of multiple buffers
> >> > +and it continues with the next buffer in the following
> >> > +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
> >> > +means that this is the last buffer of the packet. AF_XDP guarantees
> >> > +that only a complete packet (all frames in the packet) is sent to the
> >> > +application.
> >>
> >> In light of the comment on batch size below, I think it would be useful
> >> to spell out what this means exactly. IIUC correctly, it means that the
> >> kernel will check the ringbuffer before starting to copy the data, and
> >> if there are not enough descriptors available, it will drop the packet
> >> instead of doing a partial copy, right? And this is the case for both ZC
> >> and copy mode?
> >
> > I will make this paragraph and the previous one clearer. And yes, copy
> > mode and zc mode have the same behaviour.
>
> Alright, great!
>
> -Toke
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ