lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <rntmnecd6w7ntnazqloxo44dub2snqf73zn2jqwuur6io2xdv7@4iqbg5odgmfq>
Date: Fri, 22 Nov 2024 17:10:06 -0700
From: Daniel Xu <dxu@...uu.xyz>
To: Alexander Lobakin <aleksander.lobakin@...el.com>
Cc: Lorenzo Bianconi <lorenzo@...nel.org>, 
	"bpf@...r.kernel.org" <bpf@...r.kernel.org>, Jakub Kicinski <kuba@...nel.org>, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, John Fastabend <john.fastabend@...il.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, 
	David Miller <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org, 
	Lorenzo Bianconi <lorenzo.bianconi@...hat.com>
Subject: Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase

Hi Olek,

Here are the results.

On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>
>
> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > From: Alexander Lobakin <aleksander.lobakin@...el.com>
> > Date: Tue, 22 Oct 2024 17:51:43 +0200
> >
> >> From: Alexander Lobakin <aleksander.lobakin@...el.com>
> >> Date: Wed, 9 Oct 2024 14:50:42 +0200
> >>
> >>> From: Lorenzo Bianconi <lorenzo@...nel.org>
> >>> Date: Wed, 9 Oct 2024 14:47:58 +0200
> >>>
> >>>>> From: Lorenzo Bianconi <lorenzo@...nel.org>
> >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
> >>>>>
> >>>>>>> Hi Lorenzo,
> >>>>>>>
> >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> >>>>>>>> NAPI-kthread pinned on the selected cpu.
> >>>>>>>>
> >>>>>>>> Changes in rfc v2:
> >>>>>>>> - get rid of dummy netdev dependency
> >>>>>>>>
> >>>>>>>> Lorenzo Bianconi (3):
> >>>>>>>>   net: Add napi_init_for_gro routine
> >>>>>>>>   net: add napi_threaded_poll to netdevice.h
> >>>>>>>>   bpf: cpumap: Add gro support
> >>>>>>>>
> >>>>>>>>  include/linux/netdevice.h |   3 +
> >>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >>>>>>>>  net/core/dev.c            |  27 ++++++---
> >>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> 2.46.0
> >>>>>>>>
> >>>>>>>
> >>>>>>> Sorry about the long delay - finally caught up to everything after
> >>>>>>> conferences.
> >>>>>>>
> >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> >>>>>>> variable I changed is kernel version - steering prog is active for both.
> >>>>>>>
> >>>>>>>
> >>>>>>> Baseline (again)
> >>>>>>>
> >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> >>>>>>>
> >>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> >>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> >>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> >>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> >>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> >>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> >>>>>>>
> >>>>>>> cpumap NAPI patches v2
> >>>>>>>
> >>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> >>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> >>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> >>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> >>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> >>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> >>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Daniel
> >>>>>>
> >>>>>> Hi Daniel,
> >>>>>>
> >>>>>> cool, thx for testing it.
> >>>>>>
> >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> >>>>>> to send a regular patch for it?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I had a small vacation, sorry. I'm starting working on it again today.
> >>>>
> >>>> ack, no worries. Are you going to rebase the other patches on top of it
> >>>> or are you going to try a different approach?
> >>>
> >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
> >>> then we'll see.
> >>
> >> For now, I have the same results without NAPI as with your series, so
> >> I'll push it soon and let Daniel test.
> >>
> >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
> >>  kthread logic didn't change)
> >>
> >>>
> >>> BTW I'm curious how he got this boost on v2, from what I see you didn't
> >>> change the implementation that much?
> >
> > Hi Daniel,
> >
> > Sorry for the delay. Please test [0].
> >
> > [0] https://github.com/alobakin/linux/commits/cpumap-old
> >
> > Thanks,
> > Olek
>
> Ack. Will do probably early next week.
>

Baseline (again)

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126

cpumap v2 Olek

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%


It's very interesting that we see -40% tput w/ the patches. I went back
and double checked and it seems the numbers are right. Here's the
some output from some profiles I took with:

    perf record -e cycles:k -a -- sleep 10
    perf --no-pager diff perf.data.baseline perf.data.withpatches > ...

    # Event 'cycles:k'
    # Baseline  Delta Abs  Shared Object                                                    Symbol
         6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
     3.57%     -2.56%  bpf_prog_954ab9c8c8b5e42f_latency                                [k] bpf_prog_954ab9c8c8b5e42f_latency
               +2.22%  bpf_prog_5c74b34eb24d5c9b_steering                               [k] bpf_prog_5c74b34eb24d5c9b_steering
     2.61%     -1.88%  [kernel.kallsyms]                                                [k] __skb_datagram_iter
     0.55%     +1.53%  [kernel.kallsyms]                                                [k] acpi_processor_ffh_cstate_enter
     4.52%     -1.46%  [kernel.kallsyms]                                                [k] read_tsc
     0.34%     +1.42%  [kernel.kallsyms]                                                [k] __slab_free
     0.97%     +1.18%  [kernel.kallsyms]                                                [k] do_idle
     1.35%     +1.17%  [kernel.kallsyms]                                                [k] cpuidle_enter_state
     1.89%     -1.15%  [kernel.kallsyms]                                                [k] tcp_ack
     2.08%     +1.14%  [kernel.kallsyms]                                                [k] _raw_spin_lock
               +1.13%  <redacted>
     0.22%     +1.02%  [kernel.kallsyms]                                                [k] __sock_wfree
     2.23%     -1.02%  [kernel.kallsyms]                                                [k] bpf_dynptr_slice
     0.00%     +0.98%  [kernel.kallsyms]                                                [k] tcp6_gro_receive
     2.91%     -0.98%  [kernel.kallsyms]                                                [k] csum_partial
     0.62%     +0.94%  [kernel.kallsyms]                                                [k] skb_release_data
               +0.81%  [kernel.kallsyms]                                                [k] memset
     0.16%     +0.74%  [kernel.kallsyms]                                                [k] bnxt_tx_int
     0.00%     +0.74%  [kernel.kallsyms]                                                [k] dev_gro_receive
     0.36%     +0.74%  [kernel.kallsyms]                                                [k] __tcp_transmit_skb
               +0.72%  [kernel.kallsyms]                                                [k] tcp_gro_receive
     1.10%     -0.66%  [kernel.kallsyms]                                                [k] ep_poll_callback
     1.52%     -0.65%  [kernel.kallsyms]                                                [k] page_pool_put_unrefed_netmem
     0.75%     -0.57%  [kernel.kallsyms]                                                [k] bnxt_rx_pkt
     1.10%     +0.56%  [kernel.kallsyms]                                                [k] native_sched_clock
     0.16%     +0.53%  <redacted>
     0.83%     -0.53%  [kernel.kallsyms]                                                [k] skb_try_coalesce
     0.60%     +0.53%  [kernel.kallsyms]                                                [k] eth_type_trans
     1.65%     -0.51%  [kernel.kallsyms]                                                [k] _raw_spin_lock_irqsave
     0.14%     +0.50%  [kernel.kallsyms]                                                [k] bnxt_start_xmit
     0.54%     -0.48%  [kernel.kallsyms]                                                [k] __skb_frag_unref
     0.91%     +0.48%  [cls_bpf]                                                        [k] 0x0000000000000010
     0.00%     +0.47%  [kernel.kallsyms]                                                [k] ipv6_gro_receive
     0.76%     -0.45%  [kernel.kallsyms]                                                [k] tcp_rcv_established
     0.94%     -0.45%  [kernel.kallsyms]                                                [k] __inet6_lookup_established
     0.31%     +0.43%  [kernel.kallsyms]                                                [k] __sched_text_start
     0.21%     +0.43%  [kernel.kallsyms]                                                [k] poll_idle
     0.91%     -0.42%  [kernel.kallsyms]                                                [k] tcp_try_coalesce
     0.91%     -0.42%  [kernel.kallsyms]                                                [k] kmem_cache_free
     1.13%     +0.42%  [kernel.kallsyms]                                                [k] __bnxt_poll_work
     0.48%     -0.41%  [kernel.kallsyms]                                                [k] tcp_urg
               +0.39%  [kernel.kallsyms]                                                [k] memcpy
     0.51%     -0.38%  [kernel.kallsyms]                                                [k] _raw_read_unlock_irqrestore
               +0.38%  [kernel.kallsyms]                                                [k] __skb_gro_checksum_complete
               +0.37%  [kernel.kallsyms]                                                [k] irq_entries_start
     0.16%     +0.36%  [kernel.kallsyms]                                                [k] bpf_sk_storage_get
     0.62%     -0.36%  [kernel.kallsyms]                                                [k] page_pool_refill_alloc_cache
     0.08%     +0.35%  [kernel.kallsyms]                                                [k] ip6_finish_output2
     0.14%     +0.34%  [kernel.kallsyms]                                                [k] bnxt_poll_p5
     0.06%     +0.33%  [sch_fq]                                                         [k] 0x0000000000000020
     0.04%     +0.32%  [kernel.kallsyms]                                                [k] __dev_queue_xmit
     0.75%     -0.32%  [kernel.kallsyms]                                                [k] __xdp_build_skb_from_frame
     0.67%     -0.31%  [kernel.kallsyms]                                                [k] sock_def_readable
     0.05%     +0.31%  [kernel.kallsyms]                                                [k] netif_skb_features
               +0.30%  [kernel.kallsyms]                                                [k] tcp_gro_pull_header
     0.49%     -0.29%  [kernel.kallsyms]                                                [k] napi_pp_put_page
     0.18%     +0.29%  [kernel.kallsyms]                                                [k] call_function_single_prep_ipi
     0.40%     -0.28%  [kernel.kallsyms]                                                [k] _raw_read_lock_irqsave
     0.11%     +0.27%  [kernel.kallsyms]                                                [k] raw6_local_deliver
     0.18%     +0.26%  [kernel.kallsyms]                                                [k] ip6_dst_check
     0.42%     -0.26%  [kernel.kallsyms]                                                [k] netif_receive_skb_list_internal
     0.05%     +0.26%  [kernel.kallsyms]                                                [k] __qdisc_run
     0.75%     +0.25%  [kernel.kallsyms]                                                [k] __build_skb_around
     0.05%     +0.25%  [kernel.kallsyms]                                                [k] htab_map_hash
     0.09%     +0.24%  [kernel.kallsyms]                                                [k] net_rx_action
     0.07%     +0.23%  <redacted>
     0.45%     -0.23%  [kernel.kallsyms]                                                [k] migrate_enable
     0.48%     -0.23%  [kernel.kallsyms]                                                [k] mem_cgroup_charge_skmem
     0.26%     +0.23%  [kernel.kallsyms]                                                [k] __switch_to
     0.15%     +0.22%  [kernel.kallsyms]                                                [k] sock_rfree
     0.30%     -0.22%  [kernel.kallsyms]                                                [k] tcp_add_backlog

     <snip>

     5.68%             bpf_prog_17fea1bb6503ed98_steering                               [k] bpf_prog_17fea1bb6503ed98_steering
     2.10%             [kernel.kallsyms]                                                [k] __skb_checksum_complete
     0.71%             [kernel.kallsyms]                                                [k] __memset
     0.54%             [kernel.kallsyms]                                                [k] __memcpy
     0.18%             [kernel.kallsyms]                                                [k] __irqentry_text_start

     <snip>

Please let me know if you want me to collect any other data.

Thanks,
Daniel

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ