[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKUFrFx09dCkqtH_eAjEJA9LKkbwackBcU37KAkwNHAEw@mail.gmail.com>
Date: Thu, 23 Mar 2023 18:40:15 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: Eric Dumazet <eric.dumazet@...il.com>,
"David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
netdev <netdev@...r.kernel.org>,
Nikolay Aleksandrov <razor@...ckwall.org>
Subject: Re: [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS
On Thu, Mar 23, 2023 at 6:32 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>
> On 2023/3/24 0:28, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@...gle.com>
> >
> > Currently, MAX_SKB_FRAGS value is 17.
> >
> > For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> > attempts order-3 allocations, stuffing 32768 bytes per frag.
> >
> > But with zero copy, we use order-0 pages.
> >
> > For BIG TCP to show its full potential, we add a config option
> > to be able to fit up to 45 segments per skb.
> >
> > This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> > does not support skbs with frag list.
>
> Just out of curiosity, it is possible to add support for skbs with
> frag list for zerocopy if the driver also support transmiting skbs
> with frag list with NETIF_F_FRAGLIST feature on?
We are talking of rx zerocopy, look at net/ipv4/tcp.c (this is not
tied to NETIF_F_FRAGLIST support
because packets land into a TCP receive queue)
>
> >
> > We have used MAX_SKB_FRAGS=45 value for years at Google before
> > we deployed 4K MTU, with no adverse effect, other than
> > a recent issue in mlx4, fixed in commit 26782aad00cc
> > ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
> >
> > Back then, goal was to be able to receive full size (64KB) GRO
> > packets without the frag_list overhead.
> >
> > Note that /proc/sys/net/core/max_skb_frags can also be used to limit
> > the number of fragments TCP can use in tx packets.
> >
> > By default we keep the old/legacy value of 17 until we get
> > more coverage for the updated values.
> >
> > Sizes of struct skb_shared_info on 64bit arches
> >
> > MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
> > ==============================================
> > 17 320
> > 21 320+64 = 384
> > 25 320+128 = 448
> > 29 320+192 = 512
> > 33 320+256 = 576
> > 37 320+320 = 640
> > 41 320+384 = 704
> > 45 320+448 = 768
> >
> > This inflation might cause problems for drivers assuming they could pack
> > both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
> > using build_skb().
> >
> > v3: fix build error when CONFIG_NET=n
> > v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > Reviewed-by: Nikolay Aleksandrov <razor@...ckwall.org>
> > ---
> > drivers/scsi/cxgbi/libcxgbi.c | 4 ++--
> > include/linux/skbuff.h | 16 +++++-----------
> > net/Kconfig | 12 ++++++++++++
> > net/packet/af_packet.c | 4 ++--
> > 4 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
> > index af281e271f886041b397ea881e2ce7be00eff625..3e1de4c842cc6102e25a5972d6b11e05c3e4c060 100644
> > --- a/drivers/scsi/cxgbi/libcxgbi.c
> > +++ b/drivers/scsi/cxgbi/libcxgbi.c
> > @@ -2314,9 +2314,9 @@ static int cxgbi_sock_tx_queue_up(struct cxgbi_sock *csk, struct sk_buff *skb)
> > frags++;
> >
> > if (frags >= SKB_WR_LIST_SIZE) {
> > - pr_err("csk 0x%p, frags %u, %u,%u >%lu.\n",
> > + pr_err("csk 0x%p, frags %u, %u,%u >%u.\n",
> > csk, skb_shinfo(skb)->nr_frags, skb->len,
> > - skb->data_len, SKB_WR_LIST_SIZE);
> > + skb->data_len, (unsigned int)SKB_WR_LIST_SIZE);
> > return -EINVAL;
> > }
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index fe661011644b8f468ff5e92075a6624f0557584c..82511b2f61ea2bc5d587b58f0901e50e64729e4f 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -345,18 +345,12 @@ struct sk_buff_head {
> >
> > struct sk_buff;
> >
> > -/* To allow 64K frame to be packed as single skb without frag_list we
> > - * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
> > - * buffers which do not start on a page boundary.
> > - *
> > - * Since GRO uses frags we allocate at least 16 regardless of page
> > - * size.
> > - */
> > -#if (65536/PAGE_SIZE + 1) < 16
> > -#define MAX_SKB_FRAGS 16UL
> > -#else
> > -#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
> > +#ifndef CONFIG_MAX_SKB_FRAGS
> > +# define CONFIG_MAX_SKB_FRAGS 17
>
> There seems to be an extra space before 'define'.
This is indentation. Pretty standard I would say.
#if xxxxx
# define ....
#else
# define ....
#endif
>
> Also, is there a reason why not to keep below old logic
> if CONFIG_MAX_SKB_FRAGS is not defined?
>
> #if (65536/PAGE_SIZE + 1) < 16
> #define MAX_SKB_FRAGS 16UL
> #else
> #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
> #endif
>
> It seems with old logic:
> 1. for kernel with 4K page size, MAX_SKB_FRAGS is 17.
> 2. for kernel with 64K page size, MAX_SKB_FRAGS is 16.
This is for CONFIG_NET=n configs.
I am pretty sure nobody would care about having 17 or 16 frags per skb
for such a silly config.
Let's not confuse readers.
>
> > #endif
> > +
> > +#define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
> > +
> > extern int sysctl_max_skb_frags;
> >
> > /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
> > diff --git a/net/Kconfig b/net/Kconfig
> > index 48c33c2221999e575c83a409ab773b9cc3656eab..f806722bccf450c62e07bfdb245e5195ac4a156d 100644
> > --- a/net/Kconfig
> > +++ b/net/Kconfig
> > @@ -251,6 +251,18 @@ config PCPU_DEV_REFCNT
> > network device refcount are using per cpu variables if this option is set.
> > This can be forced to N to detect underflows (with a performance drop).
> >
> > +config MAX_SKB_FRAGS
> > + int "Maximum number of fragments per skb_shared_info"
> > + range 17 45
> > + default 17
> > + help
> > + Having more fragments per skb_shared_info can help GRO efficiency.
> > + This helps BIG TCP workloads, but might expose bugs in some
> > + legacy drivers.
> > + This also increases memory overhead of small packets,
> > + and in drivers using build_skb().
> > + If unsure, say 17.
> > +
> > config RPS
> > bool
> > depends on SMP && SYSFS
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index 497193f73030c385a2d33b71dfbc299fbf9b763d..568f8d76e3c124f3b322a8d88dc3dcfbc45e7c0e 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2622,8 +2622,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
> > nr_frags = skb_shinfo(skb)->nr_frags;
> >
> > if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
> > - pr_err("Packet exceed the number of skb frags(%lu)\n",
> > - MAX_SKB_FRAGS);
> > + pr_err("Packet exceed the number of skb frags(%u)\n",
> > + (unsigned int)MAX_SKB_FRAGS);
> > return -EFAULT;
> > }
> >
> >
Powered by blists - more mailing lists