[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <rntmnecd6w7ntnazqloxo44dub2snqf73zn2jqwuur6io2xdv7@4iqbg5odgmfq>
Date: Fri, 22 Nov 2024 17:10:06 -0700
From: Daniel Xu <dxu@...uu.xyz>
To: Alexander Lobakin <aleksander.lobakin@...el.com>
Cc: Lorenzo Bianconi <lorenzo@...nel.org>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>, Jakub Kicinski <kuba@...nel.org>,
Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>, John Fastabend <john.fastabend@...il.com>,
Jesper Dangaard Brouer <hawk@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>,
David Miller <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
Lorenzo Bianconi <lorenzo.bianconi@...hat.com>
Subject: Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
Hi Olek,
Here are the results.
On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>
>
> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > From: Alexander Lobakin <aleksander.lobakin@...el.com>
> > Date: Tue, 22 Oct 2024 17:51:43 +0200
> >
> >> From: Alexander Lobakin <aleksander.lobakin@...el.com>
> >> Date: Wed, 9 Oct 2024 14:50:42 +0200
> >>
> >>> From: Lorenzo Bianconi <lorenzo@...nel.org>
> >>> Date: Wed, 9 Oct 2024 14:47:58 +0200
> >>>
> >>>>> From: Lorenzo Bianconi <lorenzo@...nel.org>
> >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
> >>>>>
> >>>>>>> Hi Lorenzo,
> >>>>>>>
> >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> >>>>>>>> NAPI-kthread pinned on the selected cpu.
> >>>>>>>>
> >>>>>>>> Changes in rfc v2:
> >>>>>>>> - get rid of dummy netdev dependency
> >>>>>>>>
> >>>>>>>> Lorenzo Bianconi (3):
> >>>>>>>> net: Add napi_init_for_gro routine
> >>>>>>>> net: add napi_threaded_poll to netdevice.h
> >>>>>>>> bpf: cpumap: Add gro support
> >>>>>>>>
> >>>>>>>> include/linux/netdevice.h | 3 +
> >>>>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++----------------------
> >>>>>>>> net/core/dev.c | 27 ++++++---
> >>>>>>>> 3 files changed, 73 insertions(+), 80 deletions(-)
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> 2.46.0
> >>>>>>>>
> >>>>>>>
> >>>>>>> Sorry about the long delay - finally caught up to everything after
> >>>>>>> conferences.
> >>>>>>>
> >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> >>>>>>> variable I changed is kernel version - steering prog is active for both.
> >>>>>>>
> >>>>>>>
> >>>>>>> Baseline (again)
> >>>>>>>
> >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> >>>>>>>
> >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
> >>>>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31
> >>>>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48
> >>>>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04
> >>>>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06
> >>>>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91
> >>>>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76
> >>>>>>>
> >>>>>>> cpumap NAPI patches v2
> >>>>>>>
> >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
> >>>>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56
> >>>>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92
> >>>>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48
> >>>>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49
> >>>>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49
> >>>>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388
> >>>>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47%
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Daniel
> >>>>>>
> >>>>>> Hi Daniel,
> >>>>>>
> >>>>>> cool, thx for testing it.
> >>>>>>
> >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> >>>>>> to send a regular patch for it?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I had a small vacation, sorry. I'm starting working on it again today.
> >>>>
> >>>> ack, no worries. Are you going to rebase the other patches on top of it
> >>>> or are you going to try a different approach?
> >>>
> >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
> >>> then we'll see.
> >>
> >> For now, I have the same results without NAPI as with your series, so
> >> I'll push it soon and let Daniel test.
> >>
> >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
> >> kthread logic didn't change)
> >>
> >>>
> >>> BTW I'm curious how he got this boost on v2, from what I see you didn't
> >>> change the implementation that much?
> >
> > Hi Daniel,
> >
> > Sorry for the delay. Please test [0].
> >
> > [0] https://github.com/alobakin/linux/commits/cpumap-old
> >
> > Thanks,
> > Olek
>
> Ack. Will do probably early next week.
>
Baseline (again)
Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43
Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17
Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82
Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15
Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06
Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126
cpumap v2 Olek
Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57
Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53
Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38
Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88
Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22
Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316
Delta 0.92% -0.53% 0.33% 0.85% -41.32%
It's very interesting that we see -40% tput w/ the patches. I went back
and double checked and it seems the numbers are right. Here's the
some output from some profiles I took with:
perf record -e cycles:k -a -- sleep 10
perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
# Event 'cycles:k'
# Baseline Delta Abs Shared Object Symbol
6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter
3.57% -2.56% bpf_prog_954ab9c8c8b5e42f_latency [k] bpf_prog_954ab9c8c8b5e42f_latency
+2.22% bpf_prog_5c74b34eb24d5c9b_steering [k] bpf_prog_5c74b34eb24d5c9b_steering
2.61% -1.88% [kernel.kallsyms] [k] __skb_datagram_iter
0.55% +1.53% [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
4.52% -1.46% [kernel.kallsyms] [k] read_tsc
0.34% +1.42% [kernel.kallsyms] [k] __slab_free
0.97% +1.18% [kernel.kallsyms] [k] do_idle
1.35% +1.17% [kernel.kallsyms] [k] cpuidle_enter_state
1.89% -1.15% [kernel.kallsyms] [k] tcp_ack
2.08% +1.14% [kernel.kallsyms] [k] _raw_spin_lock
+1.13% <redacted>
0.22% +1.02% [kernel.kallsyms] [k] __sock_wfree
2.23% -1.02% [kernel.kallsyms] [k] bpf_dynptr_slice
0.00% +0.98% [kernel.kallsyms] [k] tcp6_gro_receive
2.91% -0.98% [kernel.kallsyms] [k] csum_partial
0.62% +0.94% [kernel.kallsyms] [k] skb_release_data
+0.81% [kernel.kallsyms] [k] memset
0.16% +0.74% [kernel.kallsyms] [k] bnxt_tx_int
0.00% +0.74% [kernel.kallsyms] [k] dev_gro_receive
0.36% +0.74% [kernel.kallsyms] [k] __tcp_transmit_skb
+0.72% [kernel.kallsyms] [k] tcp_gro_receive
1.10% -0.66% [kernel.kallsyms] [k] ep_poll_callback
1.52% -0.65% [kernel.kallsyms] [k] page_pool_put_unrefed_netmem
0.75% -0.57% [kernel.kallsyms] [k] bnxt_rx_pkt
1.10% +0.56% [kernel.kallsyms] [k] native_sched_clock
0.16% +0.53% <redacted>
0.83% -0.53% [kernel.kallsyms] [k] skb_try_coalesce
0.60% +0.53% [kernel.kallsyms] [k] eth_type_trans
1.65% -0.51% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.14% +0.50% [kernel.kallsyms] [k] bnxt_start_xmit
0.54% -0.48% [kernel.kallsyms] [k] __skb_frag_unref
0.91% +0.48% [cls_bpf] [k] 0x0000000000000010
0.00% +0.47% [kernel.kallsyms] [k] ipv6_gro_receive
0.76% -0.45% [kernel.kallsyms] [k] tcp_rcv_established
0.94% -0.45% [kernel.kallsyms] [k] __inet6_lookup_established
0.31% +0.43% [kernel.kallsyms] [k] __sched_text_start
0.21% +0.43% [kernel.kallsyms] [k] poll_idle
0.91% -0.42% [kernel.kallsyms] [k] tcp_try_coalesce
0.91% -0.42% [kernel.kallsyms] [k] kmem_cache_free
1.13% +0.42% [kernel.kallsyms] [k] __bnxt_poll_work
0.48% -0.41% [kernel.kallsyms] [k] tcp_urg
+0.39% [kernel.kallsyms] [k] memcpy
0.51% -0.38% [kernel.kallsyms] [k] _raw_read_unlock_irqrestore
+0.38% [kernel.kallsyms] [k] __skb_gro_checksum_complete
+0.37% [kernel.kallsyms] [k] irq_entries_start
0.16% +0.36% [kernel.kallsyms] [k] bpf_sk_storage_get
0.62% -0.36% [kernel.kallsyms] [k] page_pool_refill_alloc_cache
0.08% +0.35% [kernel.kallsyms] [k] ip6_finish_output2
0.14% +0.34% [kernel.kallsyms] [k] bnxt_poll_p5
0.06% +0.33% [sch_fq] [k] 0x0000000000000020
0.04% +0.32% [kernel.kallsyms] [k] __dev_queue_xmit
0.75% -0.32% [kernel.kallsyms] [k] __xdp_build_skb_from_frame
0.67% -0.31% [kernel.kallsyms] [k] sock_def_readable
0.05% +0.31% [kernel.kallsyms] [k] netif_skb_features
+0.30% [kernel.kallsyms] [k] tcp_gro_pull_header
0.49% -0.29% [kernel.kallsyms] [k] napi_pp_put_page
0.18% +0.29% [kernel.kallsyms] [k] call_function_single_prep_ipi
0.40% -0.28% [kernel.kallsyms] [k] _raw_read_lock_irqsave
0.11% +0.27% [kernel.kallsyms] [k] raw6_local_deliver
0.18% +0.26% [kernel.kallsyms] [k] ip6_dst_check
0.42% -0.26% [kernel.kallsyms] [k] netif_receive_skb_list_internal
0.05% +0.26% [kernel.kallsyms] [k] __qdisc_run
0.75% +0.25% [kernel.kallsyms] [k] __build_skb_around
0.05% +0.25% [kernel.kallsyms] [k] htab_map_hash
0.09% +0.24% [kernel.kallsyms] [k] net_rx_action
0.07% +0.23% <redacted>
0.45% -0.23% [kernel.kallsyms] [k] migrate_enable
0.48% -0.23% [kernel.kallsyms] [k] mem_cgroup_charge_skmem
0.26% +0.23% [kernel.kallsyms] [k] __switch_to
0.15% +0.22% [kernel.kallsyms] [k] sock_rfree
0.30% -0.22% [kernel.kallsyms] [k] tcp_add_backlog
<snip>
5.68% bpf_prog_17fea1bb6503ed98_steering [k] bpf_prog_17fea1bb6503ed98_steering
2.10% [kernel.kallsyms] [k] __skb_checksum_complete
0.71% [kernel.kallsyms] [k] __memset
0.54% [kernel.kallsyms] [k] __memcpy
0.18% [kernel.kallsyms] [k] __irqentry_text_start
<snip>
Please let me know if you want me to collect any other data.
Thanks,
Daniel
Powered by blists - more mailing lists