netdev - Re: [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKUFrFx09dCkqtH_eAjEJA9LKkbwackBcU37KAkwNHAEw@mail.gmail.com>
Date:   Thu, 23 Mar 2023 18:40:15 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Yunsheng Lin <linyunsheng@...wei.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        netdev <netdev@...r.kernel.org>,
        Nikolay Aleksandrov <razor@...ckwall.org>
Subject: Re: [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS

On Thu, Mar 23, 2023 at 6:32 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>
> On 2023/3/24 0:28, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@...gle.com>
> >
> > Currently, MAX_SKB_FRAGS value is 17.
> >
> > For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> > attempts order-3 allocations, stuffing 32768 bytes per frag.
> >
> > But with zero copy, we use order-0 pages.
> >
> > For BIG TCP to show its full potential, we add a config option
> > to be able to fit up to 45 segments per skb.
> >
> > This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> > does not support skbs with frag list.
>
> Just out of curiosity, it is possible to add support for skbs with
> frag list for zerocopy if the driver also support transmiting skbs
> with frag list with NETIF_F_FRAGLIST feature on?

We are talking of rx zerocopy, look at net/ipv4/tcp.c (this is not
tied to NETIF_F_FRAGLIST support
because packets land into a TCP receive queue)

>
> >
> > We have used MAX_SKB_FRAGS=45 value for years at Google before
> > we deployed 4K MTU, with no adverse effect, other than
> > a recent issue in mlx4, fixed in commit 26782aad00cc
> > ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
> >
> > Back then, goal was to be able to receive full size (64KB) GRO
> > packets without the frag_list overhead.
> >
> > Note that /proc/sys/net/core/max_skb_frags can also be used to limit
> > the number of fragments TCP can use in tx packets.
> >
> > By default we keep the old/legacy value of 17 until we get
> > more coverage for the updated values.
> >
> > Sizes of struct skb_shared_info on 64bit arches
> >
> > MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
> > ==============================================
> >          17     320
> >          21     320+64  = 384
> >          25     320+128 = 448
> >          29     320+192 = 512
> >          33     320+256 = 576
> >          37     320+320 = 640
> >          41     320+384 = 704
> >          45     320+448 = 768
> >
> > This inflation might cause problems for drivers assuming they could pack
> > both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
> > using build_skb().
> >
> > v3: fix build error when CONFIG_NET=n
> > v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > Reviewed-by: Nikolay Aleksandrov <razor@...ckwall.org>
> > ---
> >  drivers/scsi/cxgbi/libcxgbi.c |  4 ++--
> >  include/linux/skbuff.h        | 16 +++++-----------
> >  net/Kconfig                   | 12 ++++++++++++
> >  net/packet/af_packet.c        |  4 ++--
> >  4 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
> > index af281e271f886041b397ea881e2ce7be00eff625..3e1de4c842cc6102e25a5972d6b11e05c3e4c060 100644
> > --- a/drivers/scsi/cxgbi/libcxgbi.c
> > +++ b/drivers/scsi/cxgbi/libcxgbi.c
> > @@ -2314,9 +2314,9 @@ static int cxgbi_sock_tx_queue_up(struct cxgbi_sock *csk, struct sk_buff *skb)
> >               frags++;
> >
> >       if (frags >= SKB_WR_LIST_SIZE) {
> > -             pr_err("csk 0x%p, frags %u, %u,%u >%lu.\n",
> > +             pr_err("csk 0x%p, frags %u, %u,%u >%u.\n",
> >                      csk, skb_shinfo(skb)->nr_frags, skb->len,
> > -                    skb->data_len, SKB_WR_LIST_SIZE);
> > +                    skb->data_len, (unsigned int)SKB_WR_LIST_SIZE);
> >               return -EINVAL;
> >       }
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index fe661011644b8f468ff5e92075a6624f0557584c..82511b2f61ea2bc5d587b58f0901e50e64729e4f 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -345,18 +345,12 @@ struct sk_buff_head {
> >
> >  struct sk_buff;
> >
> > -/* To allow 64K frame to be packed as single skb without frag_list we
> > - * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
> > - * buffers which do not start on a page boundary.
> > - *
> > - * Since GRO uses frags we allocate at least 16 regardless of page
> > - * size.
> > - */
> > -#if (65536/PAGE_SIZE + 1) < 16
> > -#define MAX_SKB_FRAGS 16UL
> > -#else
> > -#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
> > +#ifndef CONFIG_MAX_SKB_FRAGS
> > +# define CONFIG_MAX_SKB_FRAGS 17
>
> There seems to be an extra space before 'define'.

This is indentation. Pretty standard I would say.

#if xxxxx
# define ....
#else
# define ....
#endif


>
> Also, is there a reason why not to keep below old logic
> if CONFIG_MAX_SKB_FRAGS is not defined?
>
> #if (65536/PAGE_SIZE + 1) < 16
> #define MAX_SKB_FRAGS 16UL
> #else
> #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
> #endif
>
> It seems with old logic:
> 1. for kernel with 4K page size, MAX_SKB_FRAGS is 17.
> 2. for kernel with 64K page size, MAX_SKB_FRAGS is 16.

This is for CONFIG_NET=n configs.

I am pretty sure nobody would care about having 17 or 16 frags per skb
for such a silly config.

Let's not confuse readers.

>
> >  #endif
> > +
> > +#define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
> > +
> >  extern int sysctl_max_skb_frags;
> >
> >  /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
> > diff --git a/net/Kconfig b/net/Kconfig
> > index 48c33c2221999e575c83a409ab773b9cc3656eab..f806722bccf450c62e07bfdb245e5195ac4a156d 100644
> > --- a/net/Kconfig
> > +++ b/net/Kconfig
> > @@ -251,6 +251,18 @@ config PCPU_DEV_REFCNT
> >         network device refcount are using per cpu variables if this option is set.
> >         This can be forced to N to detect underflows (with a performance drop).
> >
> > +config MAX_SKB_FRAGS
> > +     int "Maximum number of fragments per skb_shared_info"
> > +     range 17 45
> > +     default 17
> > +     help
> > +       Having more fragments per skb_shared_info can help GRO efficiency.
> > +       This helps BIG TCP workloads, but might expose bugs in some
> > +       legacy drivers.
> > +       This also increases memory overhead of small packets,
> > +       and in drivers using build_skb().
> > +       If unsure, say 17.
> > +
> >  config RPS
> >       bool
> >       depends on SMP && SYSFS
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index 497193f73030c385a2d33b71dfbc299fbf9b763d..568f8d76e3c124f3b322a8d88dc3dcfbc45e7c0e 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2622,8 +2622,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
> >               nr_frags = skb_shinfo(skb)->nr_frags;
> >
> >               if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
> > -                     pr_err("Packet exceed the number of skb frags(%lu)\n",
> > -                            MAX_SKB_FRAGS);
> > +                     pr_err("Packet exceed the number of skb frags(%u)\n",
> > +                            (unsigned int)MAX_SKB_FRAGS);
> >                       return -EFAULT;
> >               }
> >
> >