[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7eea5714-29a7-c6e6-5f36-3c7754806c8d@gmail.com>
Date: Thu, 21 Jul 2022 11:03:47 +0100
From: Pavel Begunkov <asml.silence@...il.com>
To: Willem de Bruijn <willemb@...gle.com>,
Jakub Kicinski <kuba@...nel.org>
Cc: io-uring@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org,
"David S . Miller" <davem@...emloft.net>,
Jonathan Lemon <jonathan.lemon@...il.com>,
Jens Axboe <axboe@...nel.dk>, David Ahern <dsahern@...nel.org>,
kernel-team@...com
Subject: Re: [PATCH net-next v5 01/27] ipv4: avoid partial copy for zc
On 7/19/22 10:35, Willem de Bruijn wrote:
> On Tue, Jul 19, 2022 at 3:54 AM Jakub Kicinski <kuba@...nel.org> wrote:
>>
>> On Tue, 12 Jul 2022 21:52:25 +0100 Pavel Begunkov wrote:
>>> Even when zerocopy transmission is requested and possible,
>>> __ip_append_data() will still copy a small chunk of data just because it
>>> allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles
>>> on copy and iter manipulations and also misalignes potentially aligned
>>> data. Avoid such coies. And as a bonus we can allocate smaller skb.
>>
>> s/coies/copies/ can fix when applying
>>
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@...il.com>
>>> ---
>>> net/ipv4/ip_output.c | 8 ++++++--
>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>>> index 00b4bf26fd93..581d1e233260 100644
>>> --- a/net/ipv4/ip_output.c
>>> +++ b/net/ipv4/ip_output.c
>>> @@ -969,7 +969,6 @@ static int __ip_append_data(struct sock *sk,
>>> struct inet_sock *inet = inet_sk(sk);
>>> struct ubuf_info *uarg = NULL;
>>> struct sk_buff *skb;
>>> -
>>> struct ip_options *opt = cork->opt;
>>> int hh_len;
>>> int exthdrlen;
>>> @@ -977,6 +976,7 @@ static int __ip_append_data(struct sock *sk,
>>> int copy;
>>> int err;
>>> int offset = 0;
>>> + bool zc = false;
>>> unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
>>> int csummode = CHECKSUM_NONE;
>>> struct rtable *rt = (struct rtable *)cork->dst;
>>> @@ -1025,6 +1025,7 @@ static int __ip_append_data(struct sock *sk,
>>> if (rt->dst.dev->features & NETIF_F_SG &&
>>> csummode == CHECKSUM_PARTIAL) {
>>> paged = true;
>>> + zc = true;
>>> } else {
>>> uarg->zerocopy = 0;
>>> skb_zcopy_set(skb, uarg, &extra_uref);
>>> @@ -1091,9 +1092,12 @@ static int __ip_append_data(struct sock *sk,
>>> (fraglen + alloc_extra < SKB_MAX_ALLOC ||
>>> !(rt->dst.dev->features & NETIF_F_SG)))
>>> alloclen = fraglen;
>>> - else {
>>> + else if (!zc) {
>>> alloclen = min_t(int, fraglen, MAX_HEADER);
>>
>> Willem, I think this came in with your GSO work, is there a reason we
>> use MAX_HEADER here? I thought MAX_HEADER is for headers (i.e. more or
>> less to be reserved) not for the min amount of data to be included.
>>
>> I wanna make sure we're not missing something about GSO here.
>>
>> Otherwise I don't think we need the extra branch but that can
>> be a follow up.
I brought it up before but left it for later as I don't know workloads
and there might be perf implications. I'll send a follow up.
> The change was introduced for UDP GSO, to avoid copying most payload
> on software segmentation:
>
> "
> commit 15e36f5b8e982debe43e425d2e12d34e022d51e9
> Author: Willem de Bruijn <willemb@...gle.com>
> Date: Thu Apr 26 13:42:19 2018 -0400
>
> udp: paged allocation with gso
>
> When sending large datagrams that are later segmented, store data in
> page frags to avoid copying from linear in skb_segment.
> "
>
> and in code
>
> - else
> - alloclen = datalen + fragheaderlen;
> + else if (!paged)
> + alloclen = fraglen;
> + else {
> + alloclen = min_t(int, fraglen, MAX_HEADER);
> + pagedlen = fraglen - alloclen;
> + }
>
>
> MAX_HEADER was a short-hand for the exact header length. "alloclen =
> fragheaderlen + transhdrlen;" is probably a better choice indeed.
Great, thanks for taking a look!
>
> Whether with branch or without, the same change needs to be made to
> __ip6_append_data, just as in the referenced commit. Let's keep the
> stacks in sync.
__ip6_append_data() is changed as well but in the following patch.
I had doubts whether it's preferable to keep ipv4 and ipv6 changes
separately.
> This is tricky code. If in doubt, run the msg_zerocopy and udp_gso
> tests from tools/testing/selftests/net, ideally with KASAN.
--
Pavel Begunkov
Powered by blists - more mailing lists