[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <512cb9a1-f69d-2dc0-07c2-e696979ac011@huawei.com>
Date: Fri, 24 Mar 2023 09:53:22 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Eric Dumazet <edumazet@...gle.com>
CC: Eric Dumazet <eric.dumazet@...il.com>,
"David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
netdev <netdev@...r.kernel.org>,
Nikolay Aleksandrov <razor@...ckwall.org>
Subject: Re: [PATCH v3 net-next] net: introduce a config option to tweak
MAX_SKB_FRAGS
On 2023/3/24 9:40, Eric Dumazet wrote:
> On Thu, Mar 23, 2023 at 6:32 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>>
>> On 2023/3/24 0:28, Eric Dumazet wrote:
>>> From: Eric Dumazet <edumazet@...gle.com>
>>>
>>> Currently, MAX_SKB_FRAGS value is 17.
>>>
>>> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
>>> attempts order-3 allocations, stuffing 32768 bytes per frag.
>>>
>>> But with zero copy, we use order-0 pages.
>>>
>>> For BIG TCP to show its full potential, we add a config option
>>> to be able to fit up to 45 segments per skb.
>>>
>>> This is also needed for BIG TCP rx zerocopy, as zerocopy currently
>>> does not support skbs with frag list.
>>
>> Just out of curiosity, it is possible to add support for skbs with
>> frag list for zerocopy if the driver also support transmiting skbs
>> with frag list with NETIF_F_FRAGLIST feature on?
>
> We are talking of rx zerocopy, look at net/ipv4/tcp.c (this is not
> tied to NETIF_F_FRAGLIST support
> because packets land into a TCP receive queue)
Sorry for confusion.
So the NETIF_F_FRAGLIST is only for tx?
what about the driver building a skb with frag list if a packet with
descs more than MAX_SKB_FRAGS for rx too?
>
>>
>>>
>>> We have used MAX_SKB_FRAGS=45 value for years at Google before
>>> we deployed 4K MTU, with no adverse effect, other than
>>> a recent issue in mlx4, fixed in commit 26782aad00cc
>>> ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
>>>
>>> Back then, goal was to be able to receive full size (64KB) GRO
>>> packets without the frag_list overhead.
>>>
>>> Note that /proc/sys/net/core/max_skb_frags can also be used to limit
>>> the number of fragments TCP can use in tx packets.
>>>
>>> By default we keep the old/legacy value of 17 until we get
>>> more coverage for the updated values.
>>>
>>> Sizes of struct skb_shared_info on 64bit arches
>>>
>>> MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
>>> ==============================================
>>> 17 320
>>> 21 320+64 = 384
>>> 25 320+128 = 448
>>> 29 320+192 = 512
>>> 33 320+256 = 576
>>> 37 320+320 = 640
>>> 41 320+384 = 704
>>> 45 320+448 = 768
>>>
>>> This inflation might cause problems for drivers assuming they could pack
>>> both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
>>> using build_skb().
>>>
>>> v3: fix build error when CONFIG_NET=n
>>> v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
>>> Reviewed-by: Nikolay Aleksandrov <razor@...ckwall.org>
>>> ---
>>> drivers/scsi/cxgbi/libcxgbi.c | 4 ++--
>>> include/linux/skbuff.h | 16 +++++-----------
>>> net/Kconfig | 12 ++++++++++++
>>> net/packet/af_packet.c | 4 ++--
>>> 4 files changed, 21 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
>>> index af281e271f886041b397ea881e2ce7be00eff625..3e1de4c842cc6102e25a5972d6b11e05c3e4c060 100644
>>> --- a/drivers/scsi/cxgbi/libcxgbi.c
>>> +++ b/drivers/scsi/cxgbi/libcxgbi.c
>>> @@ -2314,9 +2314,9 @@ static int cxgbi_sock_tx_queue_up(struct cxgbi_sock *csk, struct sk_buff *skb)
>>> frags++;
>>>
>>> if (frags >= SKB_WR_LIST_SIZE) {
>>> - pr_err("csk 0x%p, frags %u, %u,%u >%lu.\n",
>>> + pr_err("csk 0x%p, frags %u, %u,%u >%u.\n",
>>> csk, skb_shinfo(skb)->nr_frags, skb->len,
>>> - skb->data_len, SKB_WR_LIST_SIZE);
>>> + skb->data_len, (unsigned int)SKB_WR_LIST_SIZE);
>>> return -EINVAL;
>>> }
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index fe661011644b8f468ff5e92075a6624f0557584c..82511b2f61ea2bc5d587b58f0901e50e64729e4f 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -345,18 +345,12 @@ struct sk_buff_head {
>>>
>>> struct sk_buff;
>>>
>>> -/* To allow 64K frame to be packed as single skb without frag_list we
>>> - * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
>>> - * buffers which do not start on a page boundary.
>>> - *
>>> - * Since GRO uses frags we allocate at least 16 regardless of page
>>> - * size.
>>> - */
>>> -#if (65536/PAGE_SIZE + 1) < 16
>>> -#define MAX_SKB_FRAGS 16UL
>>> -#else
>>> -#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
>>> +#ifndef CONFIG_MAX_SKB_FRAGS
>>> +# define CONFIG_MAX_SKB_FRAGS 17
>>
>> There seems to be an extra space before 'define'.
>
> This is indentation. Pretty standard I would say.
>
> #if xxxxx
> # define ....
> #else
> # define ....
> #endif
>
>
>>
>> Also, is there a reason why not to keep below old logic
>> if CONFIG_MAX_SKB_FRAGS is not defined?
>>
>> #if (65536/PAGE_SIZE + 1) < 16
>> #define MAX_SKB_FRAGS 16UL
>> #else
>> #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
>> #endif
>>
>> It seems with old logic:
>> 1. for kernel with 4K page size, MAX_SKB_FRAGS is 17.
>> 2. for kernel with 64K page size, MAX_SKB_FRAGS is 16.
>
> This is for CONFIG_NET=n configs.
>
> I am pretty sure nobody would care about having 17 or 16 frags per skb
> for such a silly config.
>
> Let's not confuse readers.
>
>>
>>> #endif
>>> +
>>> +#define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
>>> +
>>> extern int sysctl_max_skb_frags;
>>>
>>> /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
>>> diff --git a/net/Kconfig b/net/Kconfig
>>> index 48c33c2221999e575c83a409ab773b9cc3656eab..f806722bccf450c62e07bfdb245e5195ac4a156d 100644
>>> --- a/net/Kconfig
>>> +++ b/net/Kconfig
>>> @@ -251,6 +251,18 @@ config PCPU_DEV_REFCNT
>>> network device refcount are using per cpu variables if this option is set.
>>> This can be forced to N to detect underflows (with a performance drop).
>>>
>>> +config MAX_SKB_FRAGS
>>> + int "Maximum number of fragments per skb_shared_info"
>>> + range 17 45
>>> + default 17
>>> + help
>>> + Having more fragments per skb_shared_info can help GRO efficiency.
>>> + This helps BIG TCP workloads, but might expose bugs in some
>>> + legacy drivers.
>>> + This also increases memory overhead of small packets,
>>> + and in drivers using build_skb().
>>> + If unsure, say 17.
>>> +
>>> config RPS
>>> bool
>>> depends on SMP && SYSFS
>>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>>> index 497193f73030c385a2d33b71dfbc299fbf9b763d..568f8d76e3c124f3b322a8d88dc3dcfbc45e7c0e 100644
>>> --- a/net/packet/af_packet.c
>>> +++ b/net/packet/af_packet.c
>>> @@ -2622,8 +2622,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
>>> nr_frags = skb_shinfo(skb)->nr_frags;
>>>
>>> if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
>>> - pr_err("Packet exceed the number of skb frags(%lu)\n",
>>> - MAX_SKB_FRAGS);
>>> + pr_err("Packet exceed the number of skb frags(%u)\n",
>>> + (unsigned int)MAX_SKB_FRAGS);
>>> return -EFAULT;
>>> }
>>>
>>>
> .
>
Powered by blists - more mailing lists