netdev - Re: [PATCH net-next 17/20] tcp: defer skb freeing after socket lock is released

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CANn89iJz5ExMC6zGwYWQnJDehWsNwfF4xy2T9tiWodM99FnVyA@mail.gmail.com>
Date:   Tue, 16 Nov 2021 13:35:31 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     David Ahern <dsahern@...il.com>
Cc:     Jakub Kicinski <kuba@...nel.org>,
        Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Arjun Roy <arjunroy@...gle.com>
Subject: Re: [PATCH net-next 17/20] tcp: defer skb freeing after socket lock
 is released

On Tue, Nov 16, 2021 at 12:45 PM David Ahern <dsahern@...il.com> wrote:
>
> On 11/16/21 9:46 AM, Eric Dumazet wrote:
> > On Tue, Nov 16, 2021 at 7:27 AM Jakub Kicinski <kuba@...nel.org> wrote:
> >>
> >> On Tue, 16 Nov 2021 07:22:02 -0800 Eric Dumazet wrote:
> >>> Here is the perf top profile on cpu used by user thread doing the
> >>> recvmsg(), at 96 Gbit/s
> >>>
> >>> We no longer see skb freeing related costs, but we still see costs of
> >>> having to process the backlog.
> >>>
> >>>    81.06%  [kernel]       [k] copy_user_enhanced_fast_string
> >>>      2.50%  [kernel]       [k] __skb_datagram_iter
> >>>      2.25%  [kernel]       [k] _copy_to_iter
> >>>      1.45%  [kernel]       [k] tcp_recvmsg_locked
> >>>      1.39%  [kernel]       [k] tcp_rcv_established
> >>
> >> Huh, somehow I assumed your 4k MTU numbers were with zero-copy :o
>
> I thought the same. :-)
>
> >>
> >> Out of curiosity - what's the softirq load with 4k? Do you have an
> >> idea what the load is on the CPU consuming the data vs the softirq
> >> processing with 1500B ?
> >
> > On my testing host,
> >
> > 4K MTU : processing ~2,600.000 packets per second in GRO and other parts
> > use about 60% of the core in BH.
>
> 4kB or 4kB+hdr MTU? I ask because there is a subtle difference in the
> size of the GRO packet which affects overall efficiency.
>
> e.g., at 1500 MTU, 1448 MSS, a GRO packet has at most 45 segments for a
> GRO size of 65212. At 4000 MTU, 3948 MSS, a GRO packet has at most 16
> segments for a GRO packet size of 63220. I have noticed that 3300 MTU is
> a bit of sweet spot with MLX5/ConnectX-5 at least - 20 segments and
> 65012 GRO packet without triggering nonlinear mode.

We are using 4096 bytes of payload, to enable TCP RX zero copy if
receiver wants it.
(even if in this case I was using TCP_STREAM which does a standard recvmsg())

Yes, the TSO/GRO standard limit in this case is 15*4K = 61440, but also remember
we are working on BIG TCP packets, so we do not have to find a 'sweet spot' :)

With BIG TCP enabled, I am sending/receiving TSO/GRO packets with 45
4K segments, (184320 bytes of payload).
(But the results I gave in this thread were with standard TSO/GRO limits)

>
>
> > (Some of this cost comes from a clang issue, and the csum_partial() one
> > I was working on last week)
> > NIC RX interrupts are firing about 25,000 times per second in this setup.
> >
> > 1500 MTU : processing ~ 5,800,000 packets per second uses one core in
> > BH (and also one core in recvmsg()),
> > We stay in NAPI mode (no IRQ rearming)
> > (That was with a TCP_STREAM run sustaining 70Gbit)
> >
> > BH numbers also depend on IRQ coalescing parameters.
> >
>
> What NIC do you use for testing?

Google proprietary NIC.