[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87o7l9c58j.fsf@toke.dk>
Date: Wed, 21 Jun 2023 15:30:36 +0200
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Magnus Karlsson <magnus.karlsson@...il.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@...el.com>, bpf@...r.kernel.org,
ast@...nel.org, daniel@...earbox.net, andrii@...nel.org,
netdev@...r.kernel.org, magnus.karlsson@...el.com, bjorn@...nel.org,
tirthendu.sarkar@...el.com, simon.horman@...igine.com
Subject: Re: [PATCH v4 bpf-next 15/22] xsk: add multi-buffer documentation
Magnus Karlsson <magnus.karlsson@...il.com> writes:
> On Tue, 20 Jun 2023 at 19:34, Toke Høiland-Jørgensen <toke@...nel.org> wrote:
>>
>> Maciej Fijalkowski <maciej.fijalkowski@...el.com> writes:
>>
>> > From: Magnus Karlsson <magnus.karlsson@...el.com>
>> >
>> > Add AF_XDP multi-buffer support documentation including two
>> > pseudo-code samples.
>> >
>> > Signed-off-by: Magnus Karlsson <magnus.karlsson@...el.com>
>> > ---
>> > Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++
>> > 1 file changed, 177 insertions(+)
>> >
>> > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>> > index 247c6c4127e9..2b583f58967b 100644
>> > --- a/Documentation/networking/af_xdp.rst
>> > +++ b/Documentation/networking/af_xdp.rst
>> > @@ -453,6 +453,93 @@ XDP_OPTIONS getsockopt
>> > Gets options from an XDP socket. The only one supported so far is
>> > XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
>> >
>> > +Multi-Buffer Support
>> > +--------------------
>> > +
>> > +With multi-buffer support, programs using AF_XDP sockets can receive
>> > +and transmit packets consisting of multiple buffers both in copy and
>> > +zero-copy mode. For example, a packet can consist of two
>> > +frames/buffers, one with the header and the other one with the data,
>> > +or a 9K Ethernet jumbo frame can be constructed by chaining together
>> > +three 4K frames.
>> > +
>> > +Some definitions:
>> > +
>> > +* A packet consists of one or more frames
>> > +
>> > +* A descriptor in one of the AF_XDP rings always refers to a single
>> > + frame. In the case the packet consists of a single frame, the
>> > + descriptor refers to the whole packet.
>> > +
>> > +To enable multi-buffer support for an AF_XDP socket, use the new bind
>> > +flag XDP_USE_SG. If this is not provided, all multi-buffer packets
>> > +will be dropped just as before. Note that the XDP program loaded also
>> > +needs to be in multi-buffer mode. This can be accomplished by using
>> > +"xdp.frags" as the section name of the XDP program used.
>> > +
>> > +To represent a packet consisting of multiple frames, a new flag called
>> > +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
>> > +descriptors. If it is true (1) the packet continues with the next
>> > +descriptor and if it is false (0) it means this is the last descriptor
>> > +of the packet. Why the reverse logic of end-of-packet (eop) flag found
>> > +in many NICs? Just to preserve compatibility with non-multi-buffer
>> > +applications that have this bit set to false for all packets on Rx,
>> > +and the apps set the options field to zero for Tx, as anything else
>> > +will be treated as an invalid descriptor.
>> > +
>> > +These are the semantics for producing packets onto AF_XDP Tx ring
>> > +consisting of multiple frames:
>> > +
>> > +* When an invalid descriptor is found, all the other
>> > + descriptors/frames of this packet are marked as invalid and not
>> > + completed. The next descriptor is treated as the start of a new
>> > + packet, even if this was not the intent (because we cannot guess
>> > + the intent). As before, if your program is producing invalid
>> > + descriptors you have a bug that must be fixed.
>> > +
>> > +* Zero length descriptors are treated as invalid descriptors.
>> > +
>> > +* For copy mode, the maximum supported number of frames in a packet is
>> > + equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
>> > + descriptors accumulated so far are dropped and treated as
>> > + invalid. To produce an application that will work on any system
>> > + regardless of this config setting, limit the number of frags to 18,
>> > + as the minimum value of the config is 17.
>> > +
>> > +* For zero-copy mode, the limit is up to what the NIC HW
>> > + supports. Usually at least five on the NICs we have checked. We
>> > + consciously chose to not enforce a rigid limit (such as
>> > + CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
>> > + resulted in copy actions under the hood to fit into what limit
>> > + the NIC supports. Kind of defeats the purpose of zero-copy mode.
>>
>> How is an application supposed to discover the actual limit for a given
>> NIC/driver?
>
> Thanks for your comments Toke. I will add an example here of how to
> discover this. Basically you can send a packet with N frags (N = 2 to
> start with), check the error stats through the getsockopt. If no
> invalid_tx_desc error, increase N with one and send this new packet.
> If you get an error, then the max number of frags is N-1.
Hmm, okay, that sounds pretty tedious :P
Also, it has side effects (you'll be putting frames on the wire while
testing, right?), so I guess this is not something you'll do on every
startup of your application? What are you expecting app developers to do
here in practice? Run the test while developing and expect the value to
stay constant for a given driver (does it?), or? Or will most people in
practice only need a few frags to get up to 9k MTU?
>> > +* The ZC batch API guarantees that it will provide a batch of Tx
>> > + descriptors that ends with full packet at the end. If not, ZC
>> > + drivers would have to gather the full packet on their side. The
>> > + approach we picked makes ZC drivers' life much easier (at least on
>> > + Tx side).
>>
>> This seems like it implies some constraint on how an application can use
>> the APIs, but it's not quite clear to me what those constraints are, nor
>> what happens if an application does something different. This should
>> probably be spelled out...
>>
>> > +On the Rx path in copy-mode, the xsk core copies the XDP data into
>> > +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
>> > +detailed before. Zero-copy mode works the same, though the data is not
>> > +copied. When the application gets a descriptor with the XDP_PKT_CONTD
>> > +flag set to one, it means that the packet consists of multiple buffers
>> > +and it continues with the next buffer in the following
>> > +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
>> > +means that this is the last buffer of the packet. AF_XDP guarantees
>> > +that only a complete packet (all frames in the packet) is sent to the
>> > +application.
>>
>> In light of the comment on batch size below, I think it would be useful
>> to spell out what this means exactly. IIUC correctly, it means that the
>> kernel will check the ringbuffer before starting to copy the data, and
>> if there are not enough descriptors available, it will drop the packet
>> instead of doing a partial copy, right? And this is the case for both ZC
>> and copy mode?
>
> I will make this paragraph and the previous one clearer. And yes, copy
> mode and zc mode have the same behaviour.
Alright, great!
-Toke
Powered by blists - more mailing lists