lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 4 Oct 2018 23:18:48 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Björn Töpel <bjorn.topel@...il.com>
Cc:     jeffrey.t.kirsher@...el.com, intel-wired-lan@...ts.osuosl.org,
        Björn Töpel <bjorn.topel@...el.com>,
        magnus.karlsson@...el.com, magnus.karlsson@...il.com,
        ast@...nel.org, daniel@...earbox.net, netdev@...r.kernel.org,
        u9012063@...il.com, tuc@...are.com, jakub.kicinski@...ronome.com,
        brouer@...hat.com
Subject: Re: [PATCH v2 0/5] Introducing ixgbe AF_XDP ZC support

On Tue,  2 Oct 2018 10:00:29 +0200
Björn Töpel <bjorn.topel@...il.com> wrote:

> From: Björn Töpel <bjorn.topel@...el.com>
> 
> Jeff: Please remove the v1 patches from your dev-queue!
> 
> This patch set introduces zero-copy AF_XDP support for Intel's ixgbe
> driver.
> 
> The ixgbe zero-copy code is located in its own file ixgbe_xsk.[ch],
> analogous to the i40e ZC support. Again, as in i40e, code paths have
> been copied from the XDP path to the zero-copy path. Going forward we
> will try to generalize more code between the AF_XDP ZC drivers, and
> also reduce the heavy C&P.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is GCC 7.3.0. The NIC is Intel
> 82599ES/X520-2 10Gbit/s using the ixgbe driver.
> 
> Below are the results in Mpps of the 82599ES/X520-2 NIC benchmark runs
> for 64B and 1500B packets, generated by a commercial packet generator
> HW blasting packets at full 10Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes.
> 
> AF_XDP performance 64B packets:
> Benchmark   XDP_DRV with zerocopy
> rxdrop        14.7
> txpush        14.6

I see similar performance numbers, but my system can crash with 'txonly'.

See full crash log and my analysis, below.

> l2fwd         11.1

Got l2fwd 13.2 Mpps.


> 
> AF_XDP performance 1500B packets:
> Benchmark   XDP_DRV with zerocopy
> rxdrop        0.8
> l2fwd         0.8
> 
> XDP performance on our system as a base line.
> 
> 64B packets:
> XDP stats       CPU     Mpps       issue-pps
> XDP-RX CPU      16      14.7       0
> 
> 1500B packets:
> XDP stats       CPU     Mpps       issue-pps
> XDP-RX CPU      16      0.8        0
> 
> The structure of the patch set is as follows:
> 
> Patch 1: Introduce Rx/Tx ring enable/disable functionality
> Patch 2: Preparatory patche to ixgbe driver code for RX
> Patch 3: ixgbe zero-copy support for RX
> Patch 4: Preparatory patch to ixgbe driver code for TX
> Patch 5: ixgbe zero-copy support for TX
> 
> Changes since v1:
> 
> * Removed redundant AF_XDP precondition checks, pointed out by
>   Jakub. Now, the preconditions are only checked at XDP enable time.
> * Fixed a crash in the egress path, due to incorrect usage of
>   ixgbe_ring queue_index member. In v2 a ring_idx back reference is
>   introduced, and used in favor of queue_index. William reported the
>   crash, and helped me smoke out the issue. Kudos!
> * In ixgbe_xsk_async_xmit, validate qid against num_xdp_queues,
>   instead of num_rx_queues.
> 
> Cheers!
> Björn
> 
> Björn Töpel (5):
>   ixgbe: added Rx/Tx ring disable/enable functions
>   ixgbe: move common Rx functions to ixgbe_txrx_common.h
>   ixgbe: add AF_XDP zero-copy Rx support
>   ixgbe: move common Tx functions to ixgbe_txrx_common.h
>   ixgbe: add AF_XDP zero-copy Tx support
> 
>  drivers/net/ethernet/intel/ixgbe/Makefile     |   3 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |  28 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |  17 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 291 ++++++-
>  .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |  50 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 803 ++++++++++++++++++
>  6 files changed, 1146 insertions(+), 46 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
>  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c



 sock0@...be2:0 rxdrop 	
                pps         pkts        1.00       
rx              14,572,284  36,093,496 
tx              0           0          


 sock0@...be2:0 l2fwd 	
                pps         pkts        1.00       
rx              13,287,830  108,616,192
tx              13,287,830  108,616,284




Notice, the crash only happens some times (on the second invocation):

$ sudo ./xdpsock --interface ixgbe2 --txonly --zero
samples/bpf/xdpsock_user.c:kick_tx:749: Assertion failed: 0: errno: 100/"Network is down"

 sock0@...be2:0 txonly 	
                pps         pkts        0.05       
rx              0           0          
tx              33,763      1,709      


$ sudo ./xdpsock --interface ixgbe2 --txonly --zero

 sock0@...be2:0 txonly 	
                pps         pkts        1.00       
rx              0           0          
tx              14,730,354  14,733,404 


$ sudo ./xdpsock --interface ixgbe2 --txonly --zero
samples/bpf/xdpsock_user.c:kick_tx:749: Assertion failed: 0: errno: 100/"Network is down"

 sock0@...be2:0 txonly 	
                pps         pkts        0.26       
rx              0           0          
tx              2,054,927   524,680    

$ sudo ./xdpsock --interface ixgbe2 --txonly --zero


[  249.953547] ixgbe 0000:01:00.1 ixgbe2: detected SFP+: 4
[  250.204158] ixgbe 0000:01:00.1 ixgbe2: NIC Link is Up 10 Gbps, Flow Control: None
[  257.217496] ixgbe 0000:01:00.1: removed PHC on ixgbe2
[  257.279328] ixgbe 0000:01:00.1: Multiqueue Disabled: Rx Queue count = 1, Tx Queue count = 1 XDP Queue count = 6
[  257.308463] ixgbe 0000:01:00.1: registered PHC device on ixgbe2
[  257.489166] ixgbe 0000:01:00.1 ixgbe2: detected SFP+: 4
[  257.494923] ixgbe 0000:01:00.1 ixgbe2: initiating reset to clear Tx work after link loss
[  257.716190] ixgbe 0000:01:00.1 ixgbe2: Reset adapter
[  257.968552] ixgbe 0000:01:00.1 ixgbe2: detected SFP+: 4
[  258.185273] ixgbe 0000:01:00.1 ixgbe2: NIC Link is Up 10 Gbps, Flow Control: None
[  260.836196] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[  260.844652] PGD 0 P4D 0 
[  260.847527] Oops: 0002 [#1] PREEMPT SMP PTI
[  260.852042] CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.19.0-rc5-bpf-next-xdp-ixgbe-ZC+ #66
[  260.861269] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
[  260.869381] RIP: 0010:xsk_umem_consume_tx+0xc9/0x180
[  260.874682] Code: 24 75 be 48 8b 86 08 03 00 00 48 8d b0 f8 fc ff ff 48 39 c7 75 96 e8 26 bd 8a ff 5b 31 c0 41 5a 41 5c 41 5d 5d 49 8d 62 f8 c3 <89> 41 40 8b 4a 24 8b 42 1c 29 c8 75 0b 48 8b 42 28 8b 00 89 42 1c
[  260.894317] RSP: 0018:ffffc9000323bd00 EFLAGS: 00010246
[  260.899873] RAX: 0000000000000000 RBX: ffffc9000323bd68 RCX: 0000000000000000
[  260.907339] RDX: ffff8808553e1c00 RSI: ffff880826e43000 RDI: ffff880854940818
[  260.914801] RBP: ffffc9000323bd20 R08: 0000000000000010 R09: 0000000000000000
[  260.922263] R10: ffffc9000323bd40 R11: 0000000000000000 R12: ffffc9000323bd64
[  260.929726] R13: ffff880854940780 R14: 0000000000000000 R15: 0000000000000000
[  260.937189] FS:  0000000000000000(0000) GS:ffff88085c640000(0000) knlGS:0000000000000000
[  260.945871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  260.951943] CR2: 0000000000000040 CR3: 000000087f20a006 CR4: 00000000003606e0
[  260.959409] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  260.966872] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  260.974333] Call Trace:
[  260.977115]  ? ixgbe_clean_xdp_tx_irq+0x19d/0x2e0 [ixgbe]
[  260.982843]  ixgbe_clean_xdp_tx_irq+0x19d/0x2e0 [ixgbe]
[  260.988426]  ixgbe_poll+0x5a/0x700 [ixgbe]
[  260.992850]  net_rx_action+0x141/0x3f0
[  260.996931]  ? sort_range+0x20/0x20
[  261.000743]  __do_softirq+0xe3/0x2f7
[  261.004656]  ? sort_range+0x20/0x20
[  261.008490]  run_ksoftirqd+0x26/0x30
[  261.012420]  smpboot_thread_fn+0x114/0x1d0
[  261.016848]  kthread+0x111/0x130
[  261.020423]  ? kthread_create_worker_on_cpu+0x50/0x50
[  261.025802]  ret_from_fork+0x1f/0x30
[  261.029707] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables tun nfnetlink bridge nf_defrag_ipv6 nf_defrag_ipv4 bpfilter sunrpc coretemp intel_cstate intel_uncore intel_rapl_perf pcspkr i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq sch_fq_codel ixgbe mdio mlx5_core i40e igb nfp ptp i2c_algo_bit devlink i2c_core pps_core hid_generic [last unloaded: x_tables]
[  261.067878] CR2: 0000000000000040
[  261.071526] ---[ end trace f0011e17c3744ee4 ]---
[  261.077903] RIP: 0010:xsk_umem_consume_tx+0xc9/0x180
[  261.083191] Code: 24 75 be 48 8b 86 08 03 00 00 48 8d b0 f8 fc ff ff 48 39 c7 75 96 e8 26 bd 8a ff 5b 31 c0 41 5a 41 5c 41 5d 5d 49 8d 62 f8 c3 <89> 41 40 8b 4a 24 8b 42 1c 29 c8 75 0b 48 8b 42 28 8b 00 89 42 1c
[  261.102852] RSP: 0018:ffffc9000323bd00 EFLAGS: 00010246
[  261.108423] RAX: 0000000000000000 RBX: ffffc9000323bd68 RCX: 0000000000000000
[  261.115889] RDX: ffff8808553e1c00 RSI: ffff880826e43000 RDI: ffff880854940818
[  261.123382] RBP: ffffc9000323bd20 R08: 0000000000000010 R09: 0000000000000000
[  261.130847] R10: ffffc9000323bd40 R11: 0000000000000000 R12: ffffc9000323bd64
[  261.138325] R13: ffff880854940780 R14: 0000000000000000 R15: 0000000000000000
[  261.145788] FS:  0000000000000000(0000) GS:ffff88085c640000(0000) knlGS:0000000000000000
[  261.154503] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.160594] CR2: 0000000000000040 CR3: 000000087f20a006 CR4: 00000000003606e0
[  261.168070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  261.175547] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  261.183012] Kernel panic - not syncing: Fatal exception in interrupt
[  261.189743] Kernel Offset: disabled
[  261.194954] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
[  261.203123] ------------[ cut here ]------------
[  261.208071] sched: Unexpected reschedule of offline CPU#0!
[  261.213885] WARNING: CPU: 1 PID: 18 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x31/0x40
[  261.223698] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables tun nfnetlink bridge nf_defrag_ipv6 nf_defrag_ipv4 bpfilter sunrpc coretemp intel_cstate intel_uncore intel_rapl_perf pcspkr i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq sch_fq_codel ixgbe mdio mlx5_core i40e igb nfp ptp i2c_algo_bit devlink i2c_core pps_core hid_generic [last unloaded: x_tables]
[  261.261869] CPU: 1 PID: 18 Comm: ksoftirqd/1 Tainted: G      D           4.19.0-rc5-bpf-next-xdp-ixgbe-ZC+ #66
[  261.272468] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
[  261.280549] RIP: 0010:native_smp_send_reschedule+0x31/0x40
[  261.286361] Code: 48 0f a3 05 91 c7 3d 01 73 12 48 8b 05 e8 11 0c 01 be fd 00 00 00 48 8b 40 30 ff e0 89 fe 48 c7 c7 b8 36 09 82 e8 ff 7d 02 00 <0f> 0b c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 48
[  261.306001] RSP: 0018:ffff88085c643cc0 EFLAGS: 00010082
[  261.311553] RAX: 000000000000002e RBX: ffff88085c6213c0 RCX: 0000000000000006
[  261.319023] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff88085c6555e0
[  261.326483] RBP: ffff88085306a0d4 R08: 0000000000000000 R09: 0000000000000478
[  261.333943] R10: ffff88085c643bf8 R11: ffffffff82acfbad R12: ffff880853069640
[  261.341407] R13: ffff88085c643d10 R14: 0000000000000086 R15: 00000000000213c0
[  261.348869] FS:  0000000000000000(0000) GS:ffff88085c640000(0000) knlGS:0000000000000000
[  261.357555] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.363624] CR2: 0000000000000040 CR3: 000000087f20a006 CR4: 00000000003606e0
[  261.371090] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  261.378554] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  261.386014] Call Trace:
[  261.388788]  <IRQ>
[  261.391128]  check_preempt_curr+0x6f/0x80
[  261.395466]  ttwu_do_wakeup+0x19/0x150
[  261.399548]  try_to_wake_up+0x19c/0x450
[  261.403715]  ? enqueue_entity+0xad/0x2c0
[  261.407964]  __wake_up_common+0x71/0x170
[  261.412220]  ep_poll_callback+0xb5/0x2a0
[  261.416474]  __wake_up_common+0x71/0x170
[  261.420729]  __wake_up_common_lock+0x6c/0x90
[  261.425335]  ? tick_sched_do_timer+0x60/0x60
[  261.429935]  irq_work_run_list+0x47/0x70
[  261.434190]  update_process_times+0x3b/0x50
[  261.438705]  tick_sched_handle+0x21/0x70
[  261.442959]  ? tick_sched_do_timer+0x50/0x60
[  261.447554]  tick_sched_timer+0x37/0x70
[  261.451719]  __hrtimer_run_queues+0xf8/0x2a0
[  261.456317]  hrtimer_interrupt+0xe5/0x240
[  261.460657]  ? sched_clock+0x5/0x10
[  261.464478]  smp_apic_timer_interrupt+0x5e/0x140
[  261.469420]  apic_timer_interrupt+0xf/0x20
[  261.473847]  </IRQ>
[  261.476271] RIP: 0010:panic+0x1e3/0x232
[  261.480433] Code: eb ac 83 3d 30 07 a0 01 00 74 05 e8 39 36 02 00 48 c7 c6 a0 8b ac 82 48 c7 c7 10 af 09 82 e8 84 6a 05 00 fb 66 0f 1f 44 00 00 <31> db e8 f8 22 0b 00 4c 39 eb 7c 17 41 83 f4 01 44 89 e7 ff 15 d6
[  261.500066] RSP: 0018:ffffc9000323baf8 EFLAGS: 00000292 ORIG_RAX: ffffffffffffff13
[  261.508234] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000006
[  261.515696] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff88085c6555e0
[  261.523160] RBP: ffffc9000323bb68 R08: 0000000000000000 R09: 0000000000000476
[  261.530620] R10: 0000000000000008 R11: ffffffff82acfbad R12: 0000000000000000
[  261.538084] R13: 0000000000000000 R14: 0000000000000009 R15: 0000000000000001
[  261.545546]  ? panic+0x1dc/0x232
[  261.549101]  oops_end+0xb9/0xd0
[  261.552569]  no_context+0x156/0x3a0
[  261.556392]  ? cpumask_next_and+0x1a/0x20
[  261.560730]  ? find_busiest_group+0x112/0xa80
[  261.565413]  __do_page_fault+0xd5/0x500
[  261.569579]  page_fault+0x1e/0x30
[  261.573220] RIP: 0010:xsk_umem_consume_tx+0xc9/0x180
[  261.578508] Code: 24 75 be 48 8b 86 08 03 00 00 48 8d b0 f8 fc ff ff 48 39 c7 75 96 e8 26 bd 8a ff 5b 31 c0 41 5a 41 5c 41 5d 5d 49 8d 62 f8 c3 <89> 41 40 8b 4a 24 8b 42 1c 29 c8 75 0b 48 8b 42 28 8b 00 89 42 1c
[  261.598148] RSP: 0018:ffffc9000323bd00 EFLAGS: 00010246
[  261.603703] RAX: 0000000000000000 RBX: ffffc9000323bd68 RCX: 0000000000000000
[  261.611169] RDX: ffff8808553e1c00 RSI: ffff880826e43000 RDI: ffff880854940818
[  261.618631] RBP: ffffc9000323bd20 R08: 0000000000000010 R09: 0000000000000000
[  261.626094] R10: ffffc9000323bd40 R11: 0000000000000000 R12: ffffc9000323bd64
[  261.633557] R13: ffff880854940780 R14: 0000000000000000 R15: 0000000000000000
[  261.641021]  ? ixgbe_clean_xdp_tx_irq+0x19d/0x2e0 [ixgbe]
[  261.646755]  ixgbe_clean_xdp_tx_irq+0x19d/0x2e0 [ixgbe]
[  261.652308]  ixgbe_poll+0x5a/0x700 [ixgbe]
[  261.656735]  net_rx_action+0x141/0x3f0
[  261.660814]  ? sort_range+0x20/0x20
[  261.664627]  __do_softirq+0xe3/0x2f7
[  261.668530]  ? sort_range+0x20/0x20
[  261.672351]  run_ksoftirqd+0x26/0x30
[  261.676250]  smpboot_thread_fn+0x114/0x1d0
[  261.680671]  kthread+0x111/0x130
[  261.684223]  ? kthread_create_worker_on_cpu+0x50/0x50
[  261.689603]  ret_from_fork+0x1f/0x30
[  261.701291] ---[ end trace f0011e17c3744ee5 ]---


(gdb) list *(xsk_umem_consume_tx)+0xc9
0xffffffff81883fe9 is in xsk_umem_consume_tx (./include/linux/compiler.h:214).
209	static __always_inline void __write_once_size(volatile void *p, void *res, int size)
210	{
211		switch (size) {
212		case 1: *(volatile __u8 *)p = *(__u8 *)res; break;
213		case 2: *(volatile __u16 *)p = *(__u16 *)res; break;
214		case 4: *(volatile __u32 *)p = *(__u32 *)res; break;
215		case 8: *(volatile __u64 *)p = *(__u64 *)res; break;
216		default:
217			barrier();
218			__builtin_memcpy((void *)p, (const void *)res, size);


I think the bug occurs in the WRITE_ONCE in xskq_peek_desc() and
it correspond to q->ring == NULL (as ring have offset 40)

static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
					      struct xdp_desc *desc)
{
	if (q->cons_tail == q->cons_head) {
		WRITE_ONCE(q->ring->consumer, q->cons_tail);
		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);

		/* Order consumer and data */
		smp_rmb();
	}

	return xskq_validate_desc(q, desc);
}

$ pahole -C xsk_queue vmlinux
struct xsk_queue {
	u64                        chunk_mask;           /*     0     8 */
	u64                        size;                 /*     8     8 */
	u32                        ring_mask;            /*    16     4 */
	u32                        nentries;             /*    20     4 */
	u32                        prod_head;            /*    24     4 */
	u32                        prod_tail;            /*    28     4 */
	u32                        cons_head;            /*    32     4 */
	u32                        cons_tail;            /*    36     4 */
	struct xdp_ring *          ring;                 /*    40     8 */
	u64                        invalid_descs;        /*    48     8 */

	/* size: 56, cachelines: 1, members: 10 */
	/* last cacheline: 56 bytes */
};
 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ