[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <128d5ddc-ef46-1125-c27e-381f78a49a96@gmail.com>
Date: Mon, 14 Dec 2020 22:19:15 -0700
From: David Ahern <dsahern@...il.com>
To: Boris Pismenny <borispismenny@...il.com>,
Boris Pismenny <borisp@...lanox.com>, kuba@...nel.org,
davem@...emloft.net, saeedm@...dia.com, hch@....de,
sagi@...mberg.me, axboe@...com, kbusch@...nel.org,
viro@...iv.linux.org.uk, edumazet@...gle.com
Cc: boris.pismenny@...il.com, linux-nvme@...ts.infradead.org,
netdev@...r.kernel.org, benishay@...dia.com, ogerlitz@...dia.com,
yorayz@...dia.com, Ben Ben-Ishay <benishay@...lanox.com>,
Or Gerlitz <ogerlitz@...lanox.com>,
Yoray Zack <yorayz@...lanox.com>,
Boris Pismenny <borisp@...dia.com>
Subject: Re: [PATCH v1 net-next 02/15] net: Introduce direct data placement
tcp offload
On 12/13/20 11:21 AM, Boris Pismenny wrote:
>>> as zerocopy for the following reasons:
>>> (1) The former places buffers *exactly* where the user requests
>>> regardless of the order of response arrivals, while the latter places packets
>>> in anonymous buffers according to packet arrival order. Therefore, zerocopy
>>> can be implemented using data placement, but not vice versa.
>>
>> Fundamentally, it is an SGL and a TCP sequence number. There is a
>> starting point where seq N == sgl element 0, position 0. Presumably
>> there is a hardware cursor to track where you are in filling the SGL as
>> packets are processed. You abort on OOO, so it seems like a fairly
>> straightfoward problem.
>>
>
> We do not abort on OOO. Moreover, we can keep going as long as
> PDU headers are not reordered.
Meaning packets received OOO which contain all or part of a PDU header
are aborted, but pure data packets can arrive out-of-order?
Did you measure the affect of OOO packets? e.g., randomly drop 1 in 1000
nvme packets, 1 in 10,000, 1 in 100,000? How does that affect the fio
benchmarks?
>> Similarly for the NVMe SGLs and DDP offload - a more generic solution
>> allows other use cases to build on this as opposed to the checks you
>> want for a special case. For example, a split at the protocol headers /
>> payload boundaries would be a generic solution where kernel managed
>> protocols get data in one buffer and socket data is put into a given
>> SGL. I am guessing that you have to be already doing this to put PDU
>> payloads into an SGL and other headers into other memory to make a
>> complete packet, so this is not too far off from what you are already doing.
>>
>
> Splitting at protocol header boundaries and placing data at socket defined
> SGLs is not enough for nvme-tcp because the nvme-tcp protocol can reorder
> responses. Here is an example:
>
> the host submits the following requests:
> +--------+--------+--------+
> | Read 1 | Read 2 | Read 3 |
> +--------+--------+--------+
>
> the target responds with the following responses:
> +--------+--------+--------+
> | Resp 2 | Resp 3 | Resp 1 |
> +--------+--------+--------+
Does 'Resp N' == 'PDU + data' like this:
+---------+--------+---------+--------+---------+--------+
| PDU hdr | Resp 2 | PDU hdr | Resp 3 | PDU hdr | Resp 1 |
+---------+--------+---------+--------+---------+--------+
or is it 1 PDU hdr + all of the responses?
>
> I think that the interface we created (tcp_ddp) is sufficiently generic
> for the task at hand, which is offloading protocols that can re-order
> their responses, a non-trivial task that we claim is important.
>
> We designed it to support other protocols and not just nvme-tcp,
> which is merely an example. For instance, I think that supporting iSCSI
> would be natural, and that other protocols will fit nicely.
It would be good to add documentation that describes the design, its
assumptions and its limitations. tls has several under
Documentation/networking. e.g., one important limitation to note is that
this design only works for TCP sockets owned by kernel modules.
>
>> ###
>>
>> A dump of other comments about this patch set:
>
> Thanks for reviewing! We will fix and resubmit.
Another one I noticed today. You have several typecasts like this:
cqe128 = (struct mlx5e_cqe128 *)((char *)cqe - 64);
since cqe is a member of cqe128, container_of should be used instead of
the magic '- 64'.
Powered by blists - more mailing lists