netdev - Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement tcp offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKdEHTxv6-Gv1d5-vuw1Yv4ySjK7iC6be+xJJvPvA-rsA@mail.gmail.com>
Date:   Fri, 11 Dec 2020 19:58:51 +0100
From:   Eric Dumazet <edumazet@...gle.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     David Ahern <dsahern@...il.com>,
        Boris Pismenny <borispismenny@...il.com>,
        Boris Pismenny <borisp@...lanox.com>,
        David Miller <davem@...emloft.net>,
        Saeed Mahameed <saeedm@...dia.com>,
        Christoph Hellwig <hch@....de>, sagi@...mberg.me, axboe@...com,
        kbusch@...nel.org, Al Viro <viro@...iv.linux.org.uk>,
        boris.pismenny@...il.com, linux-nvme@...ts.infradead.org,
        netdev <netdev@...r.kernel.org>, benishay@...dia.com,
        ogerlitz@...dia.com, yorayz@...dia.com,
        Ben Ben-Ishay <benishay@...lanox.com>,
        Or Gerlitz <ogerlitz@...lanox.com>,
        Yoray Zack <yorayz@...lanox.com>,
        Boris Pismenny <borisp@...dia.com>
Subject: Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement
 tcp offload

On Fri, Dec 11, 2020 at 7:45 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Thu, 10 Dec 2020 19:43:57 -0700 David Ahern wrote:
> > On 12/10/20 7:01 PM, Jakub Kicinski wrote:
> > > On Wed, 9 Dec 2020 21:26:05 -0700 David Ahern wrote:
> > >> Yes, TCP is a byte stream, so the packets could very well show up like this:
> > >>
> > >>  +--------------+---------+-----------+---------+--------+-----+
> > >>  | data - seg 1 | PDU hdr | prev data | TCP hdr | IP hdr | eth |
> > >>  +--------------+---------+-----------+---------+--------+-----+
> > >>  +-----------------------------------+---------+--------+-----+
> > >>  |     payload - seg 2               | TCP hdr | IP hdr | eth |
> > >>  +-----------------------------------+---------+--------+-----+
> > >>  +-------- +-------------------------+---------+--------+-----+
> > >>  | PDU hdr |    payload - seg 3      | TCP hdr | IP hdr | eth |
> > >>  +---------+-------------------------+---------+--------+-----+
> > >>
> > >> If your hardware can extract the NVMe payload into a targeted SGL like
> > >> you want in this set, then it has some logic for parsing headers and
> > >> "snapping" an SGL to a new element. ie., it already knows 'prev data'
> > >> goes with the in-progress PDU, sees more data, recognizes a new PDU
> > >> header and a new payload. That means it already has to handle a
> > >> 'snap-to-PDU' style argument where the end of the payload closes out an
> > >> SGL element and the next PDU hdr starts in a new SGL element (ie., 'prev
> > >> data' closes out sgl[i], and the next PDU hdr starts sgl[i+1]). So in
> > >> this case, you want 'snap-to-PDU' but that could just as easily be 'no
> > >> snap at all', just a byte stream and filling an SGL after the protocol
> > >> headers.
> > >
> > > This 'snap-to-PDU' requirement is something that I don't understand
> > > with the current TCP zero copy. In case of, say, a storage application
> >
> > current TCP zero-copy does not handle this and it can't AFAIK. I believe
> > it requires hardware level support where an Rx queue is dedicated to a
> > flow / socket and some degree of header and payload splitting (header is
> > consumed by the kernel stack and payload goes to socket owner's memory).
>
> Yet, Google claims to use the RX ZC in production, and with a CX3 Pro /
> mlx4 NICs.

And other proprietary NIC

We do not dedicate RX queues, TCP RX zerocopy is not direct placement
into application space.

These slides (from netdev 0x14) might be helpful

https://netdevconf.info/0x14/pub/slides/62/Implementing%20TCP%20RX%20zero%20copy.pdf


>
> Simple workaround that comes to mind is have the headers and payloads
> on separate TCP streams. That doesn't seem too slick.. but neither is
> the 4k MSS, so maybe that's what Google does?
>
> > > which wants to send some headers (whatever RPC info, block number,
> > > etc.) and then a 4k block of data - how does the RX side get just the
> > > 4k block a into a page so it can zero copy it out to its storage device?
> > >
> > > Per-connection state in the NIC, and FW parsing headers is one way,
> > > but I wonder how this record split problem is best resolved generically.
> > > Perhaps by passing hints in the headers somehow?
> > >
> > > Sorry for the slight off-topic :)
> > >
> > Hardware has to be parsing the incoming packets to find the usual
> > ethernet/IP/TCP headers and TCP payload offset. Then the hardware has to
> > have some kind of ULP processor to know how to parse the TCP byte stream
> > at least well enough to find the PDU header and interpret it to get pdu
> > header length and payload length.
>
> The big difference between normal headers and L7 headers is that one is
> at ~constant offset, self-contained, and always complete (PDU header
> can be split across segments).
>
> Edwin Peer did an implementation of TLS ULP for the NFP, it was
> complex. Not to mention it's L7 protocol ossification.
>
> To put it bluntly maybe it's fine for smaller shops but I'm guessing
> it's going to be a hard sell to hyperscalers and people who don't like
> to be locked in to HW.
>
> > At that point you push the protocol headers (eth/ip/tcp) into one buffer
> > for the kernel stack protocols and put the payload into another. The
> > former would be some page owned by the OS and the latter owned by the
> > process / socket (generically, in this case it is a kernel level
> > socket). In addition, since the payload is spread across multiple
> > packets the hardware has to keep track of TCP sequence number and its
> > current place in the SGL where it is writing the payload to keep the
> > bytes contiguous and detect out-of-order.
> >
> > If the ULP processor knows about PDU headers it knows when enough
> > payload has been found to satisfy that PDU in which case it can tell the
> > cursor to move on to the next SGL element (or separate SGL). That's what
> > I meant by 'snap-to-PDU'.
> >
> > Alternatively, if it is any random application with a byte stream not
> > understood by hardware, the cursor just keeps moving along the SGL
> > elements assigned it for this particular flow.
> >
> > If you have a socket whose payload is getting offloaded to its own queue
> > (which this set is effectively doing), you can create the queue with
> > some attribute that says 'NVMe ULP', 'iscsi ULP', 'just a byte stream'
> > that controls the parsing when you stop writing to one SGL element and
> > move on to the next. Again, assuming hardware support for such attributes.
> >
> > I don't work for Nvidia, so this is all supposition based on what the
> > patches are doing.
>
> Ack, these patches are not exciting (to me), so I'm wondering if there
> is a better way. The only reason NIC would have to understand a ULP for
> ZC is to parse out header/message lengths. There's gotta be a way to
> pass those in header options or such...
>
> And, you know, if we figure something out - maybe we stand a chance
> against having 4 different zero copy implementations (this, TCP,
> AF_XDP, netgpu) :(